[ 
https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1016:
---------------------------------

    Attachment: NUTCH-1016-1.4-2.patch

Silly me again, the patch was wrong. changed OR's to AND's!

This patch also includes more verbose output of the SolrWriter class. Handy for 
batches of many thousands of documents. This patch doesn't include change to 
log4j.properties though.

Should i get rid of the logging? Keep it?

> Strip UTF-8 non-character codepoints
> ------------------------------------
>
>                 Key: NUTCH-1016
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1016
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1016-1.4-2.patch
>
>
> During a very large crawl i found a few documents producing non-character 
> codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class 
> java.io.CharConversionException] Invalid UTF-8 character 0xffff at char 
> #1142033, byte #1155068)
>         at 
> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>         at 
> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the 
> content field to a method to strip away non-characters. I'm not too sure 
> about this implementation but the tests i've done locally with a huge dataset 
> now passes correctly. Here's a list of codepoints to strip away: 
> http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to