Abe-san, Thank you for the info.
That's a good idea. Hope I can avoid the job interruption in this way. Regards, Shigeki 2012/3/19 Shinichiro Abe <[email protected]> > Hi, > > Currently MCF can't ignore 500 server error which is caused by Solr. > If you can upgrade to Solr 3.2, you can specify ignoreTikaException. > https://issues.apache.org/jira/browse/SOLR-2480 > Hope that helps. > > Regards, > Shinichiro Abe > > On 2012/03/19, at 12:55, Shigeki Kobayashi wrote: > > > Karl, > > > > > > Thanks for your reply. > > > > It seems that Tika failed on extracting documents from PDF files while > crawling web links down. I confirmed there were Tika Exception subsequently > to Solr Exception. > > > > So, Solr detecting Tika Exception sends a status code, 500, then MCF > retries ingesting certain times: > > > > "500 from ingestion request; ingestion will be retried again later" > > > > After all, MCF shuts down the entire job. > > > > I know I should up grade the Solr version (including Tika), to improve a > job in document extraction. But, the current version of Tika still fails in > document extraction sometimes anyway, i feel it would make more sense that > MCF ignores and proceeds after such ingestion error caused by Tika. > > > > Are there any such specification requests from users that MCF ignores > and proceeds after failure of document ingestion caused by Tika, maybe in > the next release? > > > > Are there any options that users can choose to have MCF ignore and > proceed after such ingestion error? > > > > > > regards, > > > > Shigeki > > > > 2012/3/16 Karl Wright <[email protected]> > > Hi Shigeki, > > > > A "service interruption" means that a connector (either a repository > > connector like the web connector or an output connector like the Solr > > connector) could not communicate with the configured service. > > > > "Repeated service interruptions" means that certain URLs failed to > > fetch properly even after a pattern of retries which lasted many > > hours. ManifoldCF connectors deal with such errors in one of several > > ways, depending on the exact details of the error: > > > > - ignore it and proceed > > - retry periodically for some time interval, and then give up and proceed > > - retry periodically for some time interval, and then shut down the job > > > > It sounds like your job has encountered one of the latter errors. The > > "Error: Repeated service interruptions - failure processing document: > > Ingestion HTTP error code 500" indicates that the problem is due to > > communication with Solr. Apparently certain documents you are > > indexing are causing Solr to return an error code 500, which is an > > "internal server error", and is usually associated with a Solr > > exception. You will need to diagnose why this is, and take corrective > > steps, in order for your ManifoldCF job to complete successfully. > > > > "Job no longer active" is harmless - it's a side effect of the job > > shutting down. When a job is shutting down, active document > > processing cannot always be interrupted within a connector, but the > > framework helps it to stop quickly by throwing this exception. > > > > Thanks, > > Karl > > > > > > 2012/3/16 小林 茂樹(情報システム本部 / サービス企画部) <[email protected] > >: > > > > > > I was crawling web sites with links to html and pdf files on the > provided > > > multiprocess-example agent for a few hours, then Simple History started > > > showing -104 result code with a message saying "Interrupted: Job no > longer > > > active". > > > > > > After the same error occurred repeatedly around 40 times, the job > status > > > became "Aborting" and then ended up with "Error: Repeated service > > > interruptions > > > - failure processing document: Ingestion HTTP error code 500". > > > > > > The job was interrupted and stopped. > > > > > > Does anyone know what situation brings "Repeated service > interruptions" and > > > has jobs stopped? > > > Also in what circumstance an error status code -104 occurs? What is the > > > meaning of the code -104? > > > > > > If you have any ideas, please advise me on how to avoid this error. > > > > > > > > > I am using the followings: > > > > > > Solr 1.4 (Extracting Request Handler is set) > > > ManifoldCF 0.4 (multiprocess-example) > > > - Repository connector: WEB > > > - Output connector: Solr > > > Tomcat 6.0.29 > > > PostgreSQL 9.1.3 > > > > > > > > > Here is MCF’s debug log right before the job was interrupted: > > > > > > DEBUG 2012-03-15 20:04:16,325 (Worker thread '4') - WEB: Attempting to > get > > > connection to http://xx.xx.xx.xx:80 (95697 ms) > > > DEBUG 2012-03-15 20:04:16,325 (Worker thread '4') - WEB: Waiting 3895 > ms > > > before starting fetch on http://xx.xx.xx.xx:80 > > > DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Attempting to > get > > > connection to http://xx.xx.xx.xx:80 (99593 ms) > > > DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Successfully > got > > > connection to http://xx.xx.xx.xx:80 (99593 ms) > > > DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Waiting for an > > > HttpClient object > > > DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Got an > HttpClient > > > object after 0 ms. > > > DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Get method for > > > '/xx/xx.pdf' > > > DEBUG 2012-03-15 20:04:20,222 (Worker thread '4') - WEB: For > > > http://xx.xx/xx/xx.pdf, setting virtual host to xx.xx > > > DEBUG 2012-03-15 20:04:20,315 (Worker thread '4') - WEB: Performing a > read > > > wait on bin 'xx.xx' of 128 ms. > > > DEBUG 2012-03-15 20:04:20,445 (Worker thread '4') - WEB: Performing a > read > > > wait on bin 'xx.xx' of 62 ms. > > > DEBUG 2012-03-15 20:04:20,509 (Worker thread '4') - WEB: Performing a > read > > > wait on bin 'xx.xx' of 62 ms. > > > DEBUG 2012-03-15 20:04:20,573 (Worker thread '4') - WEB: Performing a > read > > > wait on bin 'xx.xx' of 62 ms. > > > DEBUG 2012-03-15 20:04:20,637 (Worker thread '4') - WEB: Performing a > read > > > wait on bin 'xx.xx' of 62 ms. > > > DEBUG 2012-03-15 20:04:20,701 (Worker thread '4') - WEB: Performing a > read > > > wait on bin 'xx.xx' of 62 ms. > > > DEBUG 2012-03-15 20:04:20,765 (Worker thread '4') - WEB: Performing a > read > > > wait on bin 'xx.xx' of 62 ms. > > > DEBUG 2012-03-15 20:04:20,829 (Worker thread '4') - WEB: Performing a > read > > > wait on bin 'xx.xx' of 62 ms. > > > DEBUG 2012-03-15 20:04:20,893 (Worker thread '4') - WEB: Performing a > read > > > wait on bin 'xx.xx' of 62 ms. > > > DEBUG 2012-03-15 20:04:20,957 (Worker thread '4') - WEB: Performing a > read > > > wait on bin 'xx.xx' of 62 ms. > > > DEBUG 2012-03-15 20:04:21,021 (Worker thread '4') - WEB: Performing a > read > > > wait on bin 'xx.xx' of 62 ms. > > > DEBUG 2012-03-15 20:04:21,085 (Worker thread '4') - WEB: Performing a > read > > > wait on bin 'xx.xx' of 62 ms. > > > DEBUG 2012-03-15 20:04:21,149 (Worker thread '4') - WEB: Performing a > read > > > wait on bin 'xx.xx' of 62 ms. > > > DEBUG 2012-03-15 20:04:21,213 (Worker thread '4') - WEB: Performing a > read > > > wait on bin 'xx.xx' of 62 ms. > > > DEBUG 2012-03-15 20:04:21,277 (Worker thread '4') - WEB: Performing a > read > > > wait on bin 'xx.xx' of 62 ms. > > > INFO 2012-03-15 20:04:21,344 (Worker thread '4') - WEB: FETCH > > > URL| > http://xx.xx/xx/xx.pdf|1331809460221+1122|-104|65536|org.apache.manifoldcf.core.interfaces.ManifoldCFException| > > > Interrupted: Job no longer active > > > DEBUG 2012-03-15 20:04:21,344 (Worker thread '4') - WEB: Fetch > exception for > > > 'http://xx.xx/xx/xx.pdf' > > > org.apache.manifoldcf.core.interfaces.ManifoldCFException: > Interrupted: Job > > > no longer active > > > at > > > > org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection.noteInterrupted(ThrottledFetcher.java:1735) > > > at > > > > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:743) > > > at > > > > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:318) > > > Caused by: > org.apache.manifoldcf.agents.interfaces.ServiceInterruption: Job > > > no longer active > > > at > > > > org.apache.manifoldcf.crawler.system.WorkerThread$VersionActivity.checkJobStillActive(WorkerThread.java:1223) > > > at > > > > org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:135) > > > at > > > > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:713) > > > ... 1 more > > > WARN 2012-03-15 20:04:21,345 (Worker thread '4') - Pre-ingest service > > > interruption reported for job 1331716457096 connection 'web': Job no > longer > > > active > > > DEBUG 2012-03-15 20:04:23,871 (Job reset thread) - Stopped job > 1331716457096 > > > DEBUG 2012-03-15 20:04:24,236 (Job notification thread) - Found job > > > 1331716457096 in need of notification > > > > > > > > -- > > ~~~~~~~~~~~~~~~~~~~~~~~~ > > ソフトバンクモバイル株式会社 > > 情報システム本部 > > システムサービス事業統括部 > > サービス企画部 > > > > 小林 茂樹 > > [email protected] > > ~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > > > -- *~~~~~~~~~~~~~~~~~~~~**~~~~* ソフトバンクモバイル株式会社 情報システム本部 システムサービス事業統括部 サービス企画部 小林 茂樹 [email protected] *~~~~~~~~~~~~~~~~~~~~**~~~~*
