[jira] [Commented] (NUTCH-710) Support for rel=canonical attribute
[ https://issues.apache.org/jira/browse/NUTCH-710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13910089#comment-13910089 ] Sertac TURKEL commented on NUTCH-710: - hi [~jnioche] [~lewismc], I want to work about this issue for 2x branch. What is the last decision about the issue. Support for rel=canonical attribute - Key: NUTCH-710 URL: https://issues.apache.org/jira/browse/NUTCH-710 Project: Nutch Issue Type: New Feature Affects Versions: 1.1 Reporter: Frank McCown Priority: Minor Fix For: 2.3, 1.8 Attachments: canonical.patch There is a the new rel=canonical attribute which is now being supported by Google, Yahoo, and Live: http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html Adding support for this attribute value will potentially reduce the number of URLs crawled and indexed and reduce duplicate page content. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1726) HeadingsFilter does not find nested nodes
[ https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13910355#comment-13910355 ] lufeng commented on NUTCH-1726: --- Hi Markus It seems that HeadingsFilter does not find nested nodes in my testing code. but I can not restore your testing result when I use following process to testing our patch {code:bash} svn checkout https://svn.apache.org/repos/asf/nutch/trunk nutch-svn2 cd nutch-svn2 patch -p0 NUTCH-1726-trunk.patch ant cd src/plugin/headings/ ant test {code} everything seems ok. yes, you are right, maybe someone want to ignore long headers. But do we need to set headings.maxlength option to -1 to disable this check, maybe someone want to disable this feature. Feng HeadingsFilter does not find nested nodes - Key: NUTCH-1726 URL: https://issues.apache.org/jira/browse/NUTCH-1726 Project: Nutch Issue Type: Bug Affects Versions: 1.7 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.8 Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch, NUTCH-1726-trunk.patch Filter won't find: {code} h1spanapache nutch/span/h1 {code} The getNodeValue() tries to read data from children but should traverse nodes instead. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (NUTCH-1726) HeadingsFilter does not find nested nodes
[ https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13910355#comment-13910355 ] lufeng edited comment on NUTCH-1726 at 2/24/14 2:41 PM: Hi Markus It seems that HeadingsFilter does not find nested nodes in my testing code. but I can not restore your testing result when I use following process to testing our patch {code:java} svn checkout https://svn.apache.org/repos/asf/nutch/trunk nutch-svn2 cd nutch-svn2 patch -p0 NUTCH-1726-trunk.patch ant cd src/plugin/headings/ ant test {code} everything seems ok. yes, you are right, maybe someone want to ignore long headers. But do we need to set headings.maxlength option to -1 to disable this check, maybe someone want to disable this feature. Feng was (Author: amuseme.lu): Hi Markus It seems that HeadingsFilter does not find nested nodes in my testing code. but I can not restore your testing result when I use following process to testing our patch {code:bash} svn checkout https://svn.apache.org/repos/asf/nutch/trunk nutch-svn2 cd nutch-svn2 patch -p0 NUTCH-1726-trunk.patch ant cd src/plugin/headings/ ant test {code} everything seems ok. yes, you are right, maybe someone want to ignore long headers. But do we need to set headings.maxlength option to -1 to disable this check, maybe someone want to disable this feature. Feng HeadingsFilter does not find nested nodes - Key: NUTCH-1726 URL: https://issues.apache.org/jira/browse/NUTCH-1726 Project: Nutch Issue Type: Bug Affects Versions: 1.7 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.8 Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch, NUTCH-1726-trunk.patch Filter won't find: {code} h1spanapache nutch/span/h1 {code} The getNodeValue() tries to read data from children but should traverse nodes instead. -- This message was sent by Atlassian JIRA (v6.1.5#6160)