[
https://issues.apache.org/jira/browse/NUTCH-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15423492#comment-15423492
]
Jose-Marcio Martins commented on NUTCH-2269:
--------------------------------------------
Hello, from a message I've posted on nutch-users discussion list... on Jun, 07
2016. Nobody answered.
I tried with older solr releases but the problem remains.So I've tried to
rebuild the crawl data (and solr data too) from scratch, incrementally to see
at what point the problem arrives.
I copy here the content of my message to nutch-list...
Well. to find which "thing" could trigger the problem on "clean", I worked
incrementally, and I found that the problem is triggered when nutch tries to
clean the following URLs from solr :
********************************************************************************************
[nutch@crawler crawldb]$ ../../../../devel/show-urls part-00000 | grep gone
db_gone http://www.armines.net/0.85
db_gone http://www.armines.net/1.8
db_gone http://www.armines.net/agenda/3%C3%A8me-a%C3%A9rogels
db_gone http://www.armines.net/agenda/chercheurs-3d
db_gone http://www.armines.net/agenda/rencontres-2016
db_gone http://www.armines.net/association-armines/chiffres-dactivit%C3%A9
db_gone http://www.armines.net/associations-reseaux
db_gone
http://www.armines.net/carnot-mines-tv/sciences-mat%C3%A9riaux/extinguo
db_gone
http://www.armines.net/centres-thematiques/%C3%A9conomie-management-soci%C3%A9t%C3%A9
db_gone
http://www.armines.net/centres-thematiques/%C3%A9nerg%C3%A9tique-proc%C3%A9d%C3%A9s
db_gone http://www.armines.net/centres-thematiques/math%C3%A9matiques-9
db_gone http://www.armines.net/centres-thematiques/sciences-lenvironnement
db_gone http://www.armines.net/centres-thematiques/sciences-mat%C3%A9riaux
db_gone http://www.armines.net/domaines-dapplication/energie-durable
db_gone
http://www.armines.net/domaines-dapplication/transformation-mati%C3%A8re
db_gone http://www.armines.net/fr/grid4eu-solutions
db_gone http://www.armines.net/text/javascript
[nutch@crawler crawldb]$
Is it possible that the problem come from the encoded URLs (with %XY) ?
> Clean not working after crawl
> -----------------------------
>
> Key: NUTCH-2269
> URL: https://issues.apache.org/jira/browse/NUTCH-2269
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 1.12
> Environment: Vagrant, Ubuntu, Java 8, Solr 4.10
> Reporter: Francesco Capponi
> Fix For: 1.13
>
>
> I'm have been having this problem for a while and I had to rollback using the
> old solr clean instead of the newer version.
> Once it inserts/update correctly every document in Nutch, when it tries to
> clean, it returns error 255:
> {quote}
> 2016-05-30 10:13:04,992 WARN output.FileOutputCommitter - Output Path is
> null in setupJob()
> 2016-05-30 10:13:07,284 INFO indexer.IndexWriters - Adding
> org.apache.nutch.indexwriter.solr.SolrIndexWriter
> 2016-05-30 10:13:08,114 INFO solr.SolrMappingReader - source: content dest:
> content
> 2016-05-30 10:13:08,114 INFO solr.SolrMappingReader - source: title dest:
> title
> 2016-05-30 10:13:08,114 INFO solr.SolrMappingReader - source: host dest: host
> 2016-05-30 10:13:08,114 INFO solr.SolrMappingReader - source: segment dest:
> segment
> 2016-05-30 10:13:08,114 INFO solr.SolrMappingReader - source: boost dest:
> boost
> 2016-05-30 10:13:08,114 INFO solr.SolrMappingReader - source: digest dest:
> digest
> 2016-05-30 10:13:08,114 INFO solr.SolrMappingReader - source: tstamp dest:
> tstamp
> 2016-05-30 10:13:08,133 INFO solr.SolrIndexWriter - SolrIndexer: deleting
> 15/15 documents
> 2016-05-30 10:13:08,919 WARN output.FileOutputCommitter - Output Path is
> null in cleanupJob()
> 2016-05-30 10:13:08,937 WARN mapred.LocalJobRunner - job_local662730477_0001
> java.lang.Exception: java.lang.IllegalStateException: Connection pool shut
> down
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.IllegalStateException: Connection pool shut down
> at org.apache.http.util.Asserts.check(Asserts.java:34)
> at
> org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:169)
> at
> org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:202)
> at
> org.apache.http.impl.conn.PoolingClientConnectionManager.requestConnection(PoolingClientConnectionManager.java:184)
> at
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:415)
> at
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
> at
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
> at
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
> at
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
> at
> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:480)
> at
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241)
> at
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230)
> at
> org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:150)
> at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:483)
> at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:464)
> at
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:190)
> at
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:178)
> at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115)
> at
> org.apache.nutch.indexer.CleaningJob$DeleterReducer.close(CleaningJob.java:120)
> at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:237)
> at
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> 2016-05-30 10:13:09,299 ERROR indexer.CleaningJob - CleaningJob:
> java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
> at org.apache.nutch.indexer.CleaningJob.delete(CleaningJob.java:172)
> at org.apache.nutch.indexer.CleaningJob.run(CleaningJob.java:195)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.indexer.CleaningJob.main(CleaningJob.java:206)
> {quote}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)