[jira] [Commented] (NUTCH-2198) Indexing binary content by index-html causes Solr Exception

2016-01-09 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090625#comment-15090625
 ] 

Sebastian Nagel commented on NUTCH-2198:


Tried to reproduce the Solr exception by indexing on of the JPEGs shown in the 
log snippet (ciencia11.jpg).
* the Solr exception is not caused by this image (or Solr 4.10.4 is safe)
* however, the indexed rawcontent is modified. E.g., the 4 leading bytes are 
stripped:
{noformat}
% od -tcx1 ciencia11.jpg | head -2
000 377 330 377 341  \v   /   E   x   i   f  \0  \0   M   M  \0   *
 ff  d8  ff  e1  0b  2f  45  78  69  66  00  00  4d  4d  00  2a
{noformat}
vs.
{noformat}
% curl -s 
'http://localhost:8983/solr/collection1/select?q=url%3A%22http%3A%2F%2Flocalhost%2Fnutch%2Ftest%2Fciencia11.jpg%22&wt=json&indent=true'
{
  "responseHeader":{
"status":0,
"QTime":0,
"params":{
  "q":"url:\"http://localhost/nutch/test/ciencia11.jpg\"";,
  "indent":"true",
  "wt":"json"}},
  "response":{"numFound":1,"start":0,"docs":[
  {
"tstamp":"1970-01-01T00:00:00Z",
"rawcontent":"#11;/Exif#0;#0;MM#0;*#0;#0;#0;#8;#0; ...
{noformat}

We need a different mechanism to index HTML or binary content -- as binary 
field, converting it to Base64, etc. Forcing a string conversion by a 
platform-dependent charset and then stripping some (but not all!) binary 
characters away is surely no proper solution.

> Indexing binary content by index-html causes Solr Exception
> ---
>
> Key: NUTCH-2198
> URL: https://issues.apache.org/jira/browse/NUTCH-2198
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 2.3.1
>Reporter: Sebastian Nagel
> Fix For: 2.4
>
>
> (reported by [~kalanya] in NUTCH-2168)
> If raw binary is indexed using the plugin index-html this may cause an 
> exception in Solr:
> {noformat}
> 2016-01-05 12:28:00,152 INFO html.HtmlIndexingFilter - Html indexing for: 
> http://ujiapps.uji.es/com/investigacio/img/ciencia11.jpg
> 2016-01-05 12:28:00,163 INFO html.HtmlIndexingFilter - Html indexing for: 
> http://ujiapps.uji.es/serveis/cd/bib/reservori/2015/e-llibres/
> 2016-01-05 12:28:00,164 INFO solr.SolrIndexWriter - Adding 250 documents
> 2016-01-05 12:28:00,531 INFO solr.SolrIndexWriter - Adding 250 documents
> 2016-01-05 12:28:00,842 WARN mapred.LocalJobRunner - job_local1207147570_0001
> java.lang.Exception: 
> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was 
> class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char 
> #137317, byte #139263)
> {noformat}
> The index-html plugin tries to treat any raw content as readable content 
> converting it to a String based on the platform-dependent charset (cf. 
> [Scanner API 
> docs|http://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html]):
> {code:title=HtmlIndexingFilter.java}
> Scanner scanner = new Scanner(arrayInputStream);
> scanner.useDelimiter("\\Z");//To read all scanner content in one 
> String
> String data = "";
> if (scanner.hasNext()) {
> data = scanner.next();
> }
> doc.add("rawcontent", StringUtil.cleanField(data));
> {code}
> The field "rawcontent" is of type "string":
> {code:xml|title=conf/schema.xml}
> 
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2168) Parse-tika fails to retrieve parser

2016-01-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090622#comment-15090622
 ] 

Hudson commented on NUTCH-2168:
---

SUCCESS: Integrated in Nutch-nutchgora #1545 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1545/])
NUTCH-2168 Parse-tika fails to retrieve parser (snagel: 
[http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1723851])
* 2.x/CHANGES.txt
* 2.x/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java


> Parse-tika fails to retrieve parser
> ---
>
> Key: NUTCH-2168
> URL: https://issues.apache.org/jira/browse/NUTCH-2168
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.3.1
>Reporter: Sebastian Nagel
> Fix For: 2.3.1
>
> Attachments: NUTCH-2168.patch
>
>
> The plugin parse-tika fails to parse most (all?) kinds of document types 
> (PDF, xlsx, ...) when run via ParserChecker or ParserJob:
> {noformat}
> 2015-11-12 19:14:30,903 INFO  parse.ParserJob - Parsing 
> http://localhost/pdftest.pdf
> 2015-11-12 19:14:30,905 INFO  parse.ParserFactory - ...
> 2015-11-12 19:14:30,907 ERROR tika.TikaParser - Can't retrieve Tika parser 
> for mime-type application/pdf
> 2015-11-12 19:14:30,913 WARN  parse.ParseUtil - Unable to successfully parse 
> content http://localhost/pdftest.pdf of type application/pdf
> {noformat}
> The same document is successfully parsed by TestPdfParser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2198) Indexing binary content by index-html causes Solr Exception

2016-01-09 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2198:
---
Description: 
(reported by [~kalanya] in NUTCH-2168)
If raw binary is indexed using the plugin index-html this may cause an 
exception in Solr:
{noformat}
2016-01-05 12:28:00,152 INFO html.HtmlIndexingFilter - Html indexing for: 
http://ujiapps.uji.es/com/investigacio/img/ciencia11.jpg
2016-01-05 12:28:00,163 INFO html.HtmlIndexingFilter - Html indexing for: 
http://ujiapps.uji.es/serveis/cd/bib/reservori/2015/e-llibres/
2016-01-05 12:28:00,164 INFO solr.SolrIndexWriter - Adding 250 documents
2016-01-05 12:28:00,531 INFO solr.SolrIndexWriter - Adding 250 documents
2016-01-05 12:28:00,842 WARN mapred.LocalJobRunner - job_local1207147570_0001
java.lang.Exception: 
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was 
class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char 
#137317, byte #139263)
{noformat}

The index-html plugin tries to treat any raw content as readable content 
converting it to a String based on the platform-dependent charset (cf. [Scanner 
API docs|http://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html]):
{code:title=HtmlIndexingFilter.java}
Scanner scanner = new Scanner(arrayInputStream);
scanner.useDelimiter("\\Z");//To read all scanner content in one 
String
String data = "";
if (scanner.hasNext()) {
data = scanner.next();
}
doc.add("rawcontent", StringUtil.cleanField(data));
{code}

The field "rawcontent" is of type "string":
{code:xml|title=conf/schema.xml}


{code}

  was:
(reported by [~kalanya] in NUTCH-2168)
If raw binary is indexed using the plugin index-html this may cause an 
exception in Solr:
{noformat}
2016-01-05 12:28:00,152 INFO html.HtmlIndexingFilter - Html indexing for: 
http://ujiapps.uji.es/com/investigacio/img/ciencia11.jpg
2016-01-05 12:28:00,163 INFO html.HtmlIndexingFilter - Html indexing for: 
http://ujiapps.uji.es/serveis/cd/bib/reservori/2015/e-llibres/
2016-01-05 12:28:00,164 INFO solr.SolrIndexWriter - Adding 250 documents
2016-01-05 12:28:00,531 INFO solr.SolrIndexWriter - Adding 250 documents
2016-01-05 12:28:00,842 WARN mapred.LocalJobRunner - job_local1207147570_0001
java.lang.Exception: 
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was 
class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char 
#137317, byte #139263)
{noformat}

The index-html plugin tries to treat any raw content as readable content 
converting it to a String based on the platform-dependent charset (cf. [Scanner 
API docus|http://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html]):
{code:title=HtmlIndexingFilter.java}
Scanner scanner = new Scanner(arrayInputStream);
scanner.useDelimiter("\\Z");//To read all scanner content in one 
String
String data = "";
if (scanner.hasNext()) {
data = scanner.next();
}
doc.add("rawcontent", StringUtil.cleanField(data));
{code}

The field "rawcontent" is of type "string":
{code:xml|title=conf/schema.xml}


{code}


> Indexing binary content by index-html causes Solr Exception
> ---
>
> Key: NUTCH-2198
> URL: https://issues.apache.org/jira/browse/NUTCH-2198
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 2.3.1
>Reporter: Sebastian Nagel
> Fix For: 2.4
>
>
> (reported by [~kalanya] in NUTCH-2168)
> If raw binary is indexed using the plugin index-html this may cause an 
> exception in Solr:
> {noformat}
> 2016-01-05 12:28:00,152 INFO html.HtmlIndexingFilter - Html indexing for: 
> http://ujiapps.uji.es/com/investigacio/img/ciencia11.jpg
> 2016-01-05 12:28:00,163 INFO html.HtmlIndexingFilter - Html indexing for: 
> http://ujiapps.uji.es/serveis/cd/bib/reservori/2015/e-llibres/
> 2016-01-05 12:28:00,164 INFO solr.SolrIndexWriter - Adding 250 documents
> 2016-01-05 12:28:00,531 INFO solr.SolrIndexWriter - Adding 250 documents
> 2016-01-05 12:28:00,842 WARN mapred.LocalJobRunner - job_local1207147570_0001
> java.lang.Exception: 
> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was 
> class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char 
> #137317, byte #139263)
> {noformat}
> The index-html plugin tries to treat any raw content as readable content 
> converting it to a String based on the platform-dependent charset (cf. 
> [Scanner API 
> docs|http://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html]):
> {code:title=HtmlIndexingFilter.java}
> Scanner scanner = new Scanner(arrayInputStream);
> scanner.useDelimiter("\\

[jira] [Resolved] (NUTCH-2168) Parse-tika fails to retrieve parser

2016-01-09 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2168.

Resolution: Fixed

Committed to 2.x, r1723851. Opened NUTCH-2198 to track the problem when 
indexing the raw binary content using the plugin index-html. Thanks, [~lewismc] 
and [~kalanya], for the review!

> Parse-tika fails to retrieve parser
> ---
>
> Key: NUTCH-2168
> URL: https://issues.apache.org/jira/browse/NUTCH-2168
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.3.1
>Reporter: Sebastian Nagel
> Fix For: 2.3.1
>
> Attachments: NUTCH-2168.patch
>
>
> The plugin parse-tika fails to parse most (all?) kinds of document types 
> (PDF, xlsx, ...) when run via ParserChecker or ParserJob:
> {noformat}
> 2015-11-12 19:14:30,903 INFO  parse.ParserJob - Parsing 
> http://localhost/pdftest.pdf
> 2015-11-12 19:14:30,905 INFO  parse.ParserFactory - ...
> 2015-11-12 19:14:30,907 ERROR tika.TikaParser - Can't retrieve Tika parser 
> for mime-type application/pdf
> 2015-11-12 19:14:30,913 WARN  parse.ParseUtil - Unable to successfully parse 
> content http://localhost/pdftest.pdf of type application/pdf
> {noformat}
> The same document is successfully parsed by TestPdfParser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2198) Indexing binary content by index-html causes Solr Exception

2016-01-09 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2198:
--

 Summary: Indexing binary content by index-html causes Solr 
Exception
 Key: NUTCH-2198
 URL: https://issues.apache.org/jira/browse/NUTCH-2198
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 2.3.1
Reporter: Sebastian Nagel
 Fix For: 2.4


(reported by [~kalanya] in NUTCH-2168)
If raw binary is indexed using the plugin index-html this may cause an 
exception in Solr:
{noformat}
2016-01-05 12:28:00,152 INFO html.HtmlIndexingFilter - Html indexing for: 
http://ujiapps.uji.es/com/investigacio/img/ciencia11.jpg
2016-01-05 12:28:00,163 INFO html.HtmlIndexingFilter - Html indexing for: 
http://ujiapps.uji.es/serveis/cd/bib/reservori/2015/e-llibres/
2016-01-05 12:28:00,164 INFO solr.SolrIndexWriter - Adding 250 documents
2016-01-05 12:28:00,531 INFO solr.SolrIndexWriter - Adding 250 documents
2016-01-05 12:28:00,842 WARN mapred.LocalJobRunner - job_local1207147570_0001
java.lang.Exception: 
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was 
class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char 
#137317, byte #139263)
{noformat}

The index-html plugin tries to treat any raw content as readable content 
converting it to a String based on the platform-dependent charset (cf. [Scanner 
API docus|http://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html]):
{code:title=HtmlIndexingFilter.java}
Scanner scanner = new Scanner(arrayInputStream);
scanner.useDelimiter("\\Z");//To read all scanner content in one 
String
String data = "";
if (scanner.hasNext()) {
data = scanner.next();
}
doc.add("rawcontent", StringUtil.cleanField(data));
{code}

The field "rawcontent" is of type "string":
{code:xml|title=conf/schema.xml}


{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)