[
https://issues.apache.org/jira/browse/NUTCH-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082904#comment-15082904
]
Auro Miralles commented on NUTCH-2168:
--------------------------------------
Hello. I have no idea which document fails... I can crawl without problems with
index-html plugin disabled, but nutch fails at the third iteration when i
enable the plugin. Only two urls in my seed.txt and ignore.external.links on
true.
http://ujiapps.uji.es/
https://wiki.apache.org/nutch/
:~/ /bin/crawl urls/ testCrawl http://localhost:8983/solr/ 3
....
....
....
Parsing https://wiki.apache.org/nutch/bin/nutch%20mergelinkdb
Parsing https://wiki.apache.org/nutch/GettingNutchRunningWithDebian
Parsing
http://ujiapps.uji.es/serveis/scp/accp/carnetUJI/avantatges/formacio/index.jpg
Parsing
http://ujiapps.uji.es/serveis/scp/accp/carnetUJI/avantatges/compres/index.jpg
Parsing
http://ujiapps.uji.es/serveis/scp/accp/carnetUJI/avantatges/productesfinancers/index.jpg
ParserJob: success
ParserJob: finished at 2016-01-05 12:27:42, time elapsed: 00:00:14
CrawlDB update for testCrawl
/home/kalanya/apache-nutch-2.3.1/runtime/local/bin/nutch updatedb -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true
-all -crawlId testCrawl
DbUpdaterJob: starting at 2016-01-05 12:27:42
DbUpdaterJob: updatinging all
DbUpdaterJob: finished at 2016-01-05 12:27:49, time elapsed: 00:00:06
Indexing testCrawl on SOLR index -> http://localhost:8983/solr/
/home/kalanya/apache-nutch-2.3.1/runtime/local/bin/nutch index -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true
-D solr.server.url=http://localhost:8983/solr/ -all -crawlId testCrawl
IndexingJob: starting
SolrIndexerJob: java.lang.RuntimeException: job failed:
name=[testCrawl]Indexer, jobid=job_local1207147570_0001
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:119)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)
Error running:
/home/kalanya/apache-nutch-2.3.1/runtime/local/bin/nutch index -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true
-D solr.server.url=http://localhost:8983/solr/ -all -crawlId testCrawl
Failed with exit value 255.
HADOOP.LOG
....
....
....
2016-01-05 12:28:00,151 INFO html.HtmlIndexingFilter - Html indexing for:
http://ujiapps.uji.es/serveis/alumnisauji/jasoc/alumnisaujipremium/instal_lacions/se/
2016-01-05 12:28:00,152 INFO html.HtmlIndexingFilter - Html indexing for:
http://ujiapps.uji.es/com/investigacio/img/ciencia11.jpg
2016-01-05 12:28:00,163 INFO html.HtmlIndexingFilter - Html indexing for:
http://ujiapps.uji.es/serveis/cd/bib/reservori/2015/e-llibres/
2016-01-05 12:28:00,164 INFO solr.SolrIndexWriter - Adding 250 documents
2016-01-05 12:28:00,531 INFO solr.SolrIndexWriter - Adding 250 documents
2016-01-05 12:28:00,842 WARN mapred.LocalJobRunner - job_local1207147570_0001
java.lang.Exception:
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was
class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at
char #137317, byte #139263)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by:
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was
class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char
#1373
17, byte #139263)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at
org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
at
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
at
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
at
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
at
org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
at
org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
at
org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2016-01-05 12:28:01,605 ERROR indexer.IndexingJob - SolrIndexerJob:
java.lang.RuntimeException: job failed: name=[testCrawl]Indexer,
jobid=job_local1207147570_0001
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:119)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)
> Parse-tika fails to retrieve parser
> -----------------------------------
>
> Key: NUTCH-2168
> URL: https://issues.apache.org/jira/browse/NUTCH-2168
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 2.3.1
> Reporter: Sebastian Nagel
> Fix For: 2.3.1
>
> Attachments: NUTCH-2168.patch
>
>
> The plugin parse-tika fails to parse most (all?) kinds of document types
> (PDF, xlsx, ...) when run via ParserChecker or ParserJob:
> {noformat}
> 2015-11-12 19:14:30,903 INFO parse.ParserJob - Parsing
> http://localhost/pdftest.pdf
> 2015-11-12 19:14:30,905 INFO parse.ParserFactory - ...
> 2015-11-12 19:14:30,907 ERROR tika.TikaParser - Can't retrieve Tika parser
> for mime-type application/pdf
> 2015-11-12 19:14:30,913 WARN parse.ParseUtil - Unable to successfully parse
> content http://localhost/pdftest.pdf of type application/pdf
> {noformat}
> The same document is successfully parsed by TestPdfParser.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)