[jira] [Closed] (NUTCH-1026) Strip UTF-8 non-character codepoints

Ferdy Galema (JIRA) Thu, 10 May 2012 05:48:13 -0700

     [ 
https://issues.apache.org/jira/browse/NUTCH-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ferdy Galema closed NUTCH-1026.
-------------------------------

       Resolution: Fixed
    Fix Version/s:     (was: 2.1)
                   nutchgora

When indexing a huge dataset I ran into this issue too. The patch in NUTCH-1016 
works fine. (Thanks Markus!) I verified and tested this. Committed at nutchgora.

Minor note: The patch checks for invalid chars ONLY on the "content" field of 
the NutchDocument. But since the problem is most likely to only occur on this 
field, it is okay for now.
                
> Strip UTF-8 non-character codepoints
> ------------------------------------
>
>                 Key: NUTCH-1026
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1026
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: nutchgora
>            Reporter: Markus Jelsma
>             Fix For: nutchgora
>
>
> During a very large crawl i found a few documents producing non-character 
> codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class 
> java.io.CharConversionException] Invalid UTF-8 character 0xffff at char 
> #1142033, byte #1155068)
>         at 
> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>         at 
> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the 
> content field to a method to strip away non-characters. I'm not too sure 
> about this implementation but the tests i've done locally with a huge dataset 
> now passes correctly. Here's a list of codepoints to strip away: 
> http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (NUTCH-1026) Strip UTF-8 non-character codepoints

Reply via email to