Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "NutchGotchas" page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/NutchGotchas?action=diff&rev1=1&rev2=2 - The following list is meant to document "Gotchas" that exist in the Nutch Codebase and in its usage. + The following list acts as a comprehensive list of Nutch "Gotchas" which should act as a suitable prerequisite source of implicit information currently existing in the Nutch Codebase and in its general usage. == Developing Nutch: Gotchas == + Developing Nutch Gotchas should be driven purely by community movement and consensus that it is necessary to make implicit information explicit in an attempt to create an earier working environment for Nutch users at all levels. The list below has been compiled as a repository of information which emerged during discussions on the user@ list. As with many areas of the Nutch wiki, this list exists as a non static resource and all Nutch users are invited to edit based upon experience and community consensus. + == Current Gotchas and using them:== - == Using Nutch: Gotchas == + Nutch-1016: Strip UTF-8 non-character codepoints + This JIRA issue affects the indexer and relates to the stripping of UTF-8 non-character codepoints which exist within some documents and was initially discovered during large crawls. When indexing to Solr this will yield the following exception: + + {{{ + EVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068) + at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) + at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) + }}} + + The fix (committed by Markus) for the SolrWriter class passes the value of the content field to a method to strip away non-characters, effectively avoiding the runtime exception. Various patches are available [[https://issues.apache.org/jira/browse/NUTCH-1016|here]] +

