[
https://issues.apache.org/jira/browse/NUTCH-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-2198:
-----------------------------------
Description:
(reported by [~kalanya] in NUTCH-2168)
If raw binary is indexed using the plugin index-html this may cause an
exception in Solr:
{noformat}
2016-01-05 12:28:00,152 INFO html.HtmlIndexingFilter - Html indexing for:
http://ujiapps.uji.es/com/investigacio/img/ciencia11.jpg
2016-01-05 12:28:00,163 INFO html.HtmlIndexingFilter - Html indexing for:
http://ujiapps.uji.es/serveis/cd/bib/reservori/2015/e-llibres/
2016-01-05 12:28:00,164 INFO solr.SolrIndexWriter - Adding 250 documents
2016-01-05 12:28:00,531 INFO solr.SolrIndexWriter - Adding 250 documents
2016-01-05 12:28:00,842 WARN mapred.LocalJobRunner - job_local1207147570_0001
java.lang.Exception:
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was
class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char
#137317, byte #139263)
{noformat}
The index-html plugin tries to treat any raw content as readable content
converting it to a String based on the platform-dependent charset (cf. [Scanner
API docs|http://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html]):
{code:title=HtmlIndexingFilter.java}
Scanner scanner = new Scanner(arrayInputStream);
scanner.useDelimiter("\\Z");//To read all scanner content in one
String
String data = "";
if (scanner.hasNext()) {
data = scanner.next();
}
doc.add("rawcontent", StringUtil.cleanField(data));
{code}
The field "rawcontent" is of type "string":
{code:xml|title=conf/schema.xml}
<!-- fields for index-html plugin
Note: although raw document content may be binary,
index-html adds a String to the index field -->
<field name="rawcontent" type="string" stored="true" indexed="false"/>
{code}
was:
(reported by [~kalanya] in NUTCH-2168)
If raw binary is indexed using the plugin index-html this may cause an
exception in Solr:
{noformat}
2016-01-05 12:28:00,152 INFO html.HtmlIndexingFilter - Html indexing for:
http://ujiapps.uji.es/com/investigacio/img/ciencia11.jpg
2016-01-05 12:28:00,163 INFO html.HtmlIndexingFilter - Html indexing for:
http://ujiapps.uji.es/serveis/cd/bib/reservori/2015/e-llibres/
2016-01-05 12:28:00,164 INFO solr.SolrIndexWriter - Adding 250 documents
2016-01-05 12:28:00,531 INFO solr.SolrIndexWriter - Adding 250 documents
2016-01-05 12:28:00,842 WARN mapred.LocalJobRunner - job_local1207147570_0001
java.lang.Exception:
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was
class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char
#137317, byte #139263)
{noformat}
The index-html plugin tries to treat any raw content as readable content
converting it to a String based on the platform-dependent charset (cf. [Scanner
API docus|http://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html]):
{code:title=HtmlIndexingFilter.java}
Scanner scanner = new Scanner(arrayInputStream);
scanner.useDelimiter("\\Z");//To read all scanner content in one
String
String data = "";
if (scanner.hasNext()) {
data = scanner.next();
}
doc.add("rawcontent", StringUtil.cleanField(data));
{code}
The field "rawcontent" is of type "string":
{code:xml|title=conf/schema.xml}
<!-- fields for index-html plugin
Note: although raw document content may be binary,
index-html adds a String to the index field -->
<field name="rawcontent" type="string" stored="true" indexed="false"/>
{code}
> Indexing binary content by index-html causes Solr Exception
> -----------------------------------------------------------
>
> Key: NUTCH-2198
> URL: https://issues.apache.org/jira/browse/NUTCH-2198
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 2.3.1
> Reporter: Sebastian Nagel
> Fix For: 2.4
>
>
> (reported by [~kalanya] in NUTCH-2168)
> If raw binary is indexed using the plugin index-html this may cause an
> exception in Solr:
> {noformat}
> 2016-01-05 12:28:00,152 INFO html.HtmlIndexingFilter - Html indexing for:
> http://ujiapps.uji.es/com/investigacio/img/ciencia11.jpg
> 2016-01-05 12:28:00,163 INFO html.HtmlIndexingFilter - Html indexing for:
> http://ujiapps.uji.es/serveis/cd/bib/reservori/2015/e-llibres/
> 2016-01-05 12:28:00,164 INFO solr.SolrIndexWriter - Adding 250 documents
> 2016-01-05 12:28:00,531 INFO solr.SolrIndexWriter - Adding 250 documents
> 2016-01-05 12:28:00,842 WARN mapred.LocalJobRunner - job_local1207147570_0001
> java.lang.Exception:
> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was
> class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char
> #137317, byte #139263)
> {noformat}
> The index-html plugin tries to treat any raw content as readable content
> converting it to a String based on the platform-dependent charset (cf.
> [Scanner API
> docs|http://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html]):
> {code:title=HtmlIndexingFilter.java}
> Scanner scanner = new Scanner(arrayInputStream);
> scanner.useDelimiter("\\Z");//To read all scanner content in one
> String
> String data = "";
> if (scanner.hasNext()) {
> data = scanner.next();
> }
> doc.add("rawcontent", StringUtil.cleanField(data));
> {code}
> The field "rawcontent" is of type "string":
> {code:xml|title=conf/schema.xml}
> <!-- fields for index-html plugin
> Note: although raw document content may be binary,
> index-html adds a String to the index field -->
> <field name="rawcontent" type="string" stored="true" indexed="false"/>
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)