[ https://issues.apache.org/jira/browse/NUTCH-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ferdy Galema closed NUTCH-1026. ------------------------------- Resolution: Fixed Fix Version/s: (was: 2.1) nutchgora When indexing a huge dataset I ran into this issue too. The patch in NUTCH-1016 works fine. (Thanks Markus!) I verified and tested this. Committed at nutchgora. Minor note: The patch checks for invalid chars ONLY on the "content" field of the NutchDocument. But since the problem is most likely to only occur on this field, it is okay for now. > Strip UTF-8 non-character codepoints > ------------------------------------ > > Key: NUTCH-1026 > URL: https://issues.apache.org/jira/browse/NUTCH-1026 > Project: Nutch > Issue Type: Bug > Components: indexer > Affects Versions: nutchgora > Reporter: Markus Jelsma > Fix For: nutchgora > > > During a very large crawl i found a few documents producing non-character > codepoints. When indexing to Solr this will yield the following exception: > {code} > SEVERE: java.lang.RuntimeException: [was class > java.io.CharConversionException] Invalid UTF-8 character 0xffff at char > #1142033, byte #1155068) > at > com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) > at > com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) > {code} > Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the > content field to a method to strip away non-characters. I'm not too sure > about this implementation but the tests i've done locally with a huge dataset > now passes correctly. Here's a list of codepoints to strip away: > http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:] > Please comment! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira