[jira] [Updated] (NUTCH-1375) extract main content of a html file

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1375:


   Patch Info: Patch Available
Fix Version/s: 1.7

 extract main content of a html file
 ---

 Key: NUTCH-1375
 URL: https://issues.apache.org/jira/browse/NUTCH-1375
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: behnam nikbakht
 Fix For: 1.7

 Attachments: NUTCH-1375.patch


 i write a code, that can extract main content of a html (usally weblogs).
 this content usally apperas in bodyp tag but there is no insurance. also 
 might be multiple tags with form of bodyp but only one of them is main 
 content. this code first find body node, and then compute weight of childs 
 nodes that compute based on text volume and height. so the code find lowest 
 node that have maximum text volume.
 i hope that improvement of this code cause to solutions to find fake or 
 duplicated pages.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1375) extract main content of a html file

2012-05-22 Thread behnam nikbakht (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

behnam nikbakht updated NUTCH-1375:
---

Attachment: NUTCH-1375.patch

 extract main content of a html file
 ---

 Key: NUTCH-1375
 URL: https://issues.apache.org/jira/browse/NUTCH-1375
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: behnam nikbakht
 Attachments: NUTCH-1375.patch


 i write a code, that can extract main content of a html (usally weblogs).
 this content usally apperas in bodyp tag but there is no insurance. also 
 might be multiple tags with form of bodyp but only one of them is main 
 content. this code first find body node, and then compute weight of childs 
 nodes that compute based on text volume and height. so the code find lowest 
 node that have maximum text volume.
 i hope that improvement of this code cause to solutions to find fake or 
 duplicated pages.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira