[jira] [Commented] (NUTCH-1320) IndexChecker and ParseChecker choke on IDN's

2012-06-08 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291683#comment-13291683
 ] 

Hudson commented on NUTCH-1320:
---

Integrated in Nutch-trunk #1865 (See 
[https://builds.apache.org/job/Nutch-trunk/1865/])
NUTCH-1320 IndexChecker and ParseChecker choke on IDN's (Revision 1347755)

 Result = SUCCESS
markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1347755
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
* /nutch/trunk/src/java/org/apache/nutch/parse/ParserChecker.java
* /nutch/trunk/src/java/org/apache/nutch/util/URLUtil.java


 IndexChecker and ParseChecker choke on IDN's
 

 Key: NUTCH-1320
 URL: https://issues.apache.org/jira/browse/NUTCH-1320
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.6

 Attachments: NUTCH-1320-1.5-1.patch


 These handy debug tools do not handle IDN's and throw an NPE
 bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81
 {code}
 Exception in thread main java.lang.NullPointerException
 at 
 org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at 
 org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1320) IndexChecker and ParseChecker choke on IDN's

2012-06-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13291219#comment-13291219
 ] 

Hudson commented on NUTCH-1320:
---

Integrated in nutch-trunk-maven #299 (See 
[https://builds.apache.org/job/nutch-trunk-maven/299/])
NUTCH-1320 IndexChecker and ParseChecker choke on IDN's (Revision 1347755)

 Result = SUCCESS
markus : 
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
* /nutch/trunk/src/java/org/apache/nutch/parse/ParserChecker.java
* /nutch/trunk/src/java/org/apache/nutch/util/URLUtil.java


 IndexChecker and ParseChecker choke on IDN's
 

 Key: NUTCH-1320
 URL: https://issues.apache.org/jira/browse/NUTCH-1320
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.6

 Attachments: NUTCH-1320-1.5-1.patch


 These handy debug tools do not handle IDN's and throw an NPE
 bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81
 {code}
 Exception in thread main java.lang.NullPointerException
 at 
 org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at 
 org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1320) IndexChecker and ParseChecker choke on IDN's

2012-03-29 Thread Lewis John McGibbney (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241212#comment-13241212
 ] 

Lewis John McGibbney commented on NUTCH-1320:
-

Nice Markus. +1. Is there scope for this to be applied elsewhere, or is 
parserchecker the only instance (so far) where you've encountered the problem?

 IndexChecker and ParseChecker choke on IDN's
 

 Key: NUTCH-1320
 URL: https://issues.apache.org/jira/browse/NUTCH-1320
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1320-1.5-1.patch


 These handy debug tools do not handle IDN's and throw an NPE
 bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81
 {code}
 Exception in thread main java.lang.NullPointerException
 at 
 org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at 
 org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1320) IndexChecker and ParseChecker choke on IDN's

2012-03-29 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241225#comment-13241225
 ] 

Markus Jelsma commented on NUTCH-1320:
--

Somewhere down the line IDN's enter the CrawlDB in ASCII so there is no problem 
there but these tools lack conversion. The filter and normalizer checker tools 
would also benefit. This also suggests the need of an IDNNormalizer that does 
toUnicode when indexing, you don't want http://xn--*/ URL's in your index.

 IndexChecker and ParseChecker choke on IDN's
 

 Key: NUTCH-1320
 URL: https://issues.apache.org/jira/browse/NUTCH-1320
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1320-1.5-1.patch


 These handy debug tools do not handle IDN's and throw an NPE
 bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81
 {code}
 Exception in thread main java.lang.NullPointerException
 at 
 org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at 
 org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira