[ https://issues.apache.org/jira/browse/NUTCH-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lewis John McGibbney updated NUTCH-1375: ---------------------------------------- Patch Info: Patch Available Fix Version/s: 1.7 > extract main content of a html file > ----------------------------------- > > Key: NUTCH-1375 > URL: https://issues.apache.org/jira/browse/NUTCH-1375 > Project: Nutch > Issue Type: Bug > Components: parser > Affects Versions: 1.4 > Reporter: behnam nikbakht > Fix For: 1.7 > > Attachments: NUTCH-1375.patch > > > i write a code, that can extract main content of a html (usally weblogs). > this content usally apperas in <body><p> tag but there is no insurance. also > might be multiple tags with form of <body><p> but only one of them is main > content. this code first find body node, and then compute weight of childs > nodes that compute based on text volume and height. so the code find lowest > node that have maximum text volume. > i hope that improvement of this code cause to solutions to find fake or > duplicated pages. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira