behnam nikbakht created NUTCH-1375:
--------------------------------------

             Summary: extract main content of a html file
                 Key: NUTCH-1375
                 URL: https://issues.apache.org/jira/browse/NUTCH-1375
             Project: Nutch
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.4
            Reporter: behnam nikbakht
         Attachments: NUTCH-1375.patch

i write a code, that can extract main content of a html (usally weblogs).
this content usally apperas in <body><p> tag but there is no insurance. also 
might be multiple tags with form of <body><p> but only one of them is main 
content. this code first find body node, and then compute weight of childs 
nodes that compute based on text volume and height. so the code find lowest 
node that have maximum text volume.
i hope that improvement of this code cause to solutions to find fake or 
duplicated pages.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to