[ https://issues.apache.org/jira/browse/NUTCH-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel updated NUTCH-2198: ----------------------------------- Description: (reported by [~kalanya] in NUTCH-2168) If raw binary is indexed using the plugin index-html this may cause an exception in Solr: {noformat} 2016-01-05 12:28:00,152 INFO html.HtmlIndexingFilter - Html indexing for: http://ujiapps.uji.es/com/investigacio/img/ciencia11.jpg 2016-01-05 12:28:00,163 INFO html.HtmlIndexingFilter - Html indexing for: http://ujiapps.uji.es/serveis/cd/bib/reservori/2015/e-llibres/ 2016-01-05 12:28:00,164 INFO solr.SolrIndexWriter - Adding 250 documents 2016-01-05 12:28:00,531 INFO solr.SolrIndexWriter - Adding 250 documents 2016-01-05 12:28:00,842 WARN mapred.LocalJobRunner - job_local1207147570_0001 java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char #137317, byte #139263) {noformat} The index-html plugin tries to treat any raw content as readable content converting it to a String based on the platform-dependent charset (cf. [Scanner API docs|http://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html]): {code:title=HtmlIndexingFilter.java} Scanner scanner = new Scanner(arrayInputStream); scanner.useDelimiter("\\Z");//To read all scanner content in one String String data = ""; if (scanner.hasNext()) { data = scanner.next(); } doc.add("rawcontent", StringUtil.cleanField(data)); {code} The field "rawcontent" is of type "string": {code:xml|title=conf/schema.xml} <!-- fields for index-html plugin Note: although raw document content may be binary, index-html adds a String to the index field --> <field name="rawcontent" type="string" stored="true" indexed="false"/> {code} was: (reported by [~kalanya] in NUTCH-2168) If raw binary is indexed using the plugin index-html this may cause an exception in Solr: {noformat} 2016-01-05 12:28:00,152 INFO html.HtmlIndexingFilter - Html indexing for: http://ujiapps.uji.es/com/investigacio/img/ciencia11.jpg 2016-01-05 12:28:00,163 INFO html.HtmlIndexingFilter - Html indexing for: http://ujiapps.uji.es/serveis/cd/bib/reservori/2015/e-llibres/ 2016-01-05 12:28:00,164 INFO solr.SolrIndexWriter - Adding 250 documents 2016-01-05 12:28:00,531 INFO solr.SolrIndexWriter - Adding 250 documents 2016-01-05 12:28:00,842 WARN mapred.LocalJobRunner - job_local1207147570_0001 java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char #137317, byte #139263) {noformat} The index-html plugin tries to treat any raw content as readable content converting it to a String based on the platform-dependent charset (cf. [Scanner API docus|http://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html]): {code:title=HtmlIndexingFilter.java} Scanner scanner = new Scanner(arrayInputStream); scanner.useDelimiter("\\Z");//To read all scanner content in one String String data = ""; if (scanner.hasNext()) { data = scanner.next(); } doc.add("rawcontent", StringUtil.cleanField(data)); {code} The field "rawcontent" is of type "string": {code:xml|title=conf/schema.xml} <!-- fields for index-html plugin Note: although raw document content may be binary, index-html adds a String to the index field --> <field name="rawcontent" type="string" stored="true" indexed="false"/> {code} > Indexing binary content by index-html causes Solr Exception > ----------------------------------------------------------- > > Key: NUTCH-2198 > URL: https://issues.apache.org/jira/browse/NUTCH-2198 > Project: Nutch > Issue Type: Bug > Components: indexer > Affects Versions: 2.3.1 > Reporter: Sebastian Nagel > Fix For: 2.4 > > > (reported by [~kalanya] in NUTCH-2168) > If raw binary is indexed using the plugin index-html this may cause an > exception in Solr: > {noformat} > 2016-01-05 12:28:00,152 INFO html.HtmlIndexingFilter - Html indexing for: > http://ujiapps.uji.es/com/investigacio/img/ciencia11.jpg > 2016-01-05 12:28:00,163 INFO html.HtmlIndexingFilter - Html indexing for: > http://ujiapps.uji.es/serveis/cd/bib/reservori/2015/e-llibres/ > 2016-01-05 12:28:00,164 INFO solr.SolrIndexWriter - Adding 250 documents > 2016-01-05 12:28:00,531 INFO solr.SolrIndexWriter - Adding 250 documents > 2016-01-05 12:28:00,842 WARN mapred.LocalJobRunner - job_local1207147570_0001 > java.lang.Exception: > org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was > class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char > #137317, byte #139263) > {noformat} > The index-html plugin tries to treat any raw content as readable content > converting it to a String based on the platform-dependent charset (cf. > [Scanner API > docs|http://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html]): > {code:title=HtmlIndexingFilter.java} > Scanner scanner = new Scanner(arrayInputStream); > scanner.useDelimiter("\\Z");//To read all scanner content in one > String > String data = ""; > if (scanner.hasNext()) { > data = scanner.next(); > } > doc.add("rawcontent", StringUtil.cleanField(data)); > {code} > The field "rawcontent" is of type "string": > {code:xml|title=conf/schema.xml} > <!-- fields for index-html plugin > Note: although raw document content may be binary, > index-html adds a String to the index field --> > <field name="rawcontent" type="string" stored="true" indexed="false"/> > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)