[jira] [Commented] (NUTCH-2168) Parse-tika fails to retrieve parser

2016-01-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090622#comment-15090622
 ] 

Hudson commented on NUTCH-2168:
---

SUCCESS: Integrated in Nutch-nutchgora #1545 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1545/])
NUTCH-2168 Parse-tika fails to retrieve parser (snagel: 
[http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev=1723851])
* 2.x/CHANGES.txt
* 2.x/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java


> Parse-tika fails to retrieve parser
> ---
>
> Key: NUTCH-2168
> URL: https://issues.apache.org/jira/browse/NUTCH-2168
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.3.1
>Reporter: Sebastian Nagel
> Fix For: 2.3.1
>
> Attachments: NUTCH-2168.patch
>
>
> The plugin parse-tika fails to parse most (all?) kinds of document types 
> (PDF, xlsx, ...) when run via ParserChecker or ParserJob:
> {noformat}
> 2015-11-12 19:14:30,903 INFO  parse.ParserJob - Parsing 
> http://localhost/pdftest.pdf
> 2015-11-12 19:14:30,905 INFO  parse.ParserFactory - ...
> 2015-11-12 19:14:30,907 ERROR tika.TikaParser - Can't retrieve Tika parser 
> for mime-type application/pdf
> 2015-11-12 19:14:30,913 WARN  parse.ParseUtil - Unable to successfully parse 
> content http://localhost/pdftest.pdf of type application/pdf
> {noformat}
> The same document is successfully parsed by TestPdfParser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2168) Parse-tika fails to retrieve parser

2016-01-08 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090337#comment-15090337
 ] 

Lewis John McGibbney commented on NUTCH-2168:
-

+1 for commit [~wastl-nagel] nice catch and debugging!

> Parse-tika fails to retrieve parser
> ---
>
> Key: NUTCH-2168
> URL: https://issues.apache.org/jira/browse/NUTCH-2168
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.3.1
>Reporter: Sebastian Nagel
> Fix For: 2.3.1
>
> Attachments: NUTCH-2168.patch
>
>
> The plugin parse-tika fails to parse most (all?) kinds of document types 
> (PDF, xlsx, ...) when run via ParserChecker or ParserJob:
> {noformat}
> 2015-11-12 19:14:30,903 INFO  parse.ParserJob - Parsing 
> http://localhost/pdftest.pdf
> 2015-11-12 19:14:30,905 INFO  parse.ParserFactory - ...
> 2015-11-12 19:14:30,907 ERROR tika.TikaParser - Can't retrieve Tika parser 
> for mime-type application/pdf
> 2015-11-12 19:14:30,913 WARN  parse.ParseUtil - Unable to successfully parse 
> content http://localhost/pdftest.pdf of type application/pdf
> {noformat}
> The same document is successfully parsed by TestPdfParser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2168) Parse-tika fails to retrieve parser

2016-01-05 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083285#comment-15083285
 ] 

Sebastian Nagel commented on NUTCH-2168:


Hi [~kalanya], looks like the indexed raw content of the JPEGs are causing the 
invalid utf-8 character. The index-html plugin tries to treat any raw content 
as readable content converting it to a String based on the platform-dependent 
charset. What happens if the field content is specified as "binary" in 
schema.xml (cf. patch for NUTCH-2130)?

Without the patch applied, non-HTML documents simply fail to parse and are 
never indexed. That's probably the reason why the fix for this issue causes the 
problem with index-html. I would suggest to open a separate issue to address 
the indexer-solr problem with raw content from index-html.

> Parse-tika fails to retrieve parser
> ---
>
> Key: NUTCH-2168
> URL: https://issues.apache.org/jira/browse/NUTCH-2168
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.3.1
>Reporter: Sebastian Nagel
> Fix For: 2.3.1
>
> Attachments: NUTCH-2168.patch
>
>
> The plugin parse-tika fails to parse most (all?) kinds of document types 
> (PDF, xlsx, ...) when run via ParserChecker or ParserJob:
> {noformat}
> 2015-11-12 19:14:30,903 INFO  parse.ParserJob - Parsing 
> http://localhost/pdftest.pdf
> 2015-11-12 19:14:30,905 INFO  parse.ParserFactory - ...
> 2015-11-12 19:14:30,907 ERROR tika.TikaParser - Can't retrieve Tika parser 
> for mime-type application/pdf
> 2015-11-12 19:14:30,913 WARN  parse.ParseUtil - Unable to successfully parse 
> content http://localhost/pdftest.pdf of type application/pdf
> {noformat}
> The same document is successfully parsed by TestPdfParser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2168) Parse-tika fails to retrieve parser

2016-01-05 Thread Auro Miralles (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082904#comment-15082904
 ] 

Auro Miralles commented on NUTCH-2168:
--

Hello. I have no idea which document fails... I can crawl without problems with 
index-html plugin disabled, but nutch fails at the third iteration when i 
enable the plugin. Only two urls in my seed.txt and ignore.external.links on 
true.

http://ujiapps.uji.es/
https://wiki.apache.org/nutch/

:~/ /bin/crawl urls/ testCrawl http://localhost:8983/solr/ 3



Parsing https://wiki.apache.org/nutch/bin/nutch%20mergelinkdb
Parsing https://wiki.apache.org/nutch/GettingNutchRunningWithDebian
Parsing 
http://ujiapps.uji.es/serveis/scp/accp/carnetUJI/avantatges/formacio/index.jpg
Parsing 
http://ujiapps.uji.es/serveis/scp/accp/carnetUJI/avantatges/compres/index.jpg
Parsing 
http://ujiapps.uji.es/serveis/scp/accp/carnetUJI/avantatges/productesfinancers/index.jpg
ParserJob: success
ParserJob: finished at 2016-01-05 12:27:42, time elapsed: 00:00:14
CrawlDB update for testCrawl
/home/kalanya/apache-nutch-2.3.1/runtime/local/bin/nutch updatedb -D 
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D 
mapred.reduce.tasks.speculative.execution=false -D 
mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true 
-all -crawlId testCrawl
DbUpdaterJob: starting at 2016-01-05 12:27:42
DbUpdaterJob: updatinging all
DbUpdaterJob: finished at 2016-01-05 12:27:49, time elapsed: 00:00:06
Indexing testCrawl on SOLR index -> http://localhost:8983/solr/
/home/kalanya/apache-nutch-2.3.1/runtime/local/bin/nutch index -D 
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D 
mapred.reduce.tasks.speculative.execution=false -D 
mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true 
-D solr.server.url=http://localhost:8983/solr/ -all -crawlId testCrawl
IndexingJob: starting
SolrIndexerJob: java.lang.RuntimeException: job failed: 
name=[testCrawl]Indexer, jobid=job_local1207147570_0001
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:119)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)

Error running:
  /home/kalanya/apache-nutch-2.3.1/runtime/local/bin/nutch index -D 
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D 
mapred.reduce.tasks.speculative.execution=false -D 
mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true 
-D solr.server.url=http://localhost:8983/solr/ -all -crawlId testCrawl
Failed with exit value 255.



HADOOP.LOG



2016-01-05 12:28:00,151 INFO  html.HtmlIndexingFilter - Html indexing for: 
http://ujiapps.uji.es/serveis/alumnisauji/jasoc/alumnisaujipremium/instal_lacions/se/
2016-01-05 12:28:00,152 INFO  html.HtmlIndexingFilter - Html indexing for: 
http://ujiapps.uji.es/com/investigacio/img/ciencia11.jpg
2016-01-05 12:28:00,163 INFO  html.HtmlIndexingFilter - Html indexing for: 
http://ujiapps.uji.es/serveis/cd/bib/reservori/2015/e-llibres/
2016-01-05 12:28:00,164 INFO  solr.SolrIndexWriter - Adding 250 documents
2016-01-05 12:28:00,531 INFO  solr.SolrIndexWriter - Adding 250 documents
2016-01-05 12:28:00,842 WARN  mapred.LocalJobRunner - job_local1207147570_0001
java.lang.Exception: 
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was 
class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at 
char #137317, byte #139263)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: 
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was 
class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char 
#1373
17, byte #139263)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at 
org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
at 
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
at 
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
at 

[jira] [Commented] (NUTCH-2168) Parse-tika fails to retrieve parser

2016-01-04 Thread Auro Miralles (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080972#comment-15080972
 ] 

Auro Miralles commented on NUTCH-2168:
--

Hi all! I'm pretty sure solrindex fails after applying this patch on Nutch 
2.3.1, with index-html plugin enabled. The error was... "Illegal UTF-8 
character".

> Parse-tika fails to retrieve parser
> ---
>
> Key: NUTCH-2168
> URL: https://issues.apache.org/jira/browse/NUTCH-2168
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.3.1
>Reporter: Sebastian Nagel
> Fix For: 2.3.1
>
> Attachments: NUTCH-2168.patch
>
>
> The plugin parse-tika fails to parse most (all?) kinds of document types 
> (PDF, xlsx, ...) when run via ParserChecker or ParserJob:
> {noformat}
> 2015-11-12 19:14:30,903 INFO  parse.ParserJob - Parsing 
> http://localhost/pdftest.pdf
> 2015-11-12 19:14:30,905 INFO  parse.ParserFactory - ...
> 2015-11-12 19:14:30,907 ERROR tika.TikaParser - Can't retrieve Tika parser 
> for mime-type application/pdf
> 2015-11-12 19:14:30,913 WARN  parse.ParseUtil - Unable to successfully parse 
> content http://localhost/pdftest.pdf of type application/pdf
> {noformat}
> The same document is successfully parsed by TestPdfParser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)