[jira] [Commented] (NUTCH-1320) IndexChecker and ParseChecker choke on IDN's
[ https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291683#comment-13291683 ] Hudson commented on NUTCH-1320: --- Integrated in Nutch-trunk #1865 (See [https://builds.apache.org/job/Nutch-trunk/1865/]) NUTCH-1320 IndexChecker and ParseChecker choke on IDN's (Revision 1347755) Result = SUCCESS markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1347755 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java * /nutch/trunk/src/java/org/apache/nutch/parse/ParserChecker.java * /nutch/trunk/src/java/org/apache/nutch/util/URLUtil.java IndexChecker and ParseChecker choke on IDN's Key: NUTCH-1320 URL: https://issues.apache.org/jira/browse/NUTCH-1320 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.6 Attachments: NUTCH-1320-1.5-1.patch These handy debug tools do not handle IDN's and throw an NPE bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81 {code} Exception in thread main java.lang.NullPointerException at org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116) {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1320) IndexChecker and ParseChecker choke on IDN's
[ https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291219#comment-13291219 ] Hudson commented on NUTCH-1320: --- Integrated in nutch-trunk-maven #299 (See [https://builds.apache.org/job/nutch-trunk-maven/299/]) NUTCH-1320 IndexChecker and ParseChecker choke on IDN's (Revision 1347755) Result = SUCCESS markus : Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java * /nutch/trunk/src/java/org/apache/nutch/parse/ParserChecker.java * /nutch/trunk/src/java/org/apache/nutch/util/URLUtil.java IndexChecker and ParseChecker choke on IDN's Key: NUTCH-1320 URL: https://issues.apache.org/jira/browse/NUTCH-1320 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.6 Attachments: NUTCH-1320-1.5-1.patch These handy debug tools do not handle IDN's and throw an NPE bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81 {code} Exception in thread main java.lang.NullPointerException at org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116) {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1320) IndexChecker and ParseChecker choke on IDN's
[ https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241212#comment-13241212 ] Lewis John McGibbney commented on NUTCH-1320: - Nice Markus. +1. Is there scope for this to be applied elsewhere, or is parserchecker the only instance (so far) where you've encountered the problem? IndexChecker and ParseChecker choke on IDN's Key: NUTCH-1320 URL: https://issues.apache.org/jira/browse/NUTCH-1320 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1320-1.5-1.patch These handy debug tools do not handle IDN's and throw an NPE bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81 {code} Exception in thread main java.lang.NullPointerException at org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116) {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1320) IndexChecker and ParseChecker choke on IDN's
[ https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241225#comment-13241225 ] Markus Jelsma commented on NUTCH-1320: -- Somewhere down the line IDN's enter the CrawlDB in ASCII so there is no problem there but these tools lack conversion. The filter and normalizer checker tools would also benefit. This also suggests the need of an IDNNormalizer that does toUnicode when indexing, you don't want http://xn--*/ URL's in your index. IndexChecker and ParseChecker choke on IDN's Key: NUTCH-1320 URL: https://issues.apache.org/jira/browse/NUTCH-1320 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1320-1.5-1.patch These handy debug tools do not handle IDN's and throw an NPE bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81 {code} Exception in thread main java.lang.NullPointerException at org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116) {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira