I've committed a fix to how the WorkerThread handles service interruptions. This should eliminate the "unexpected value" exception. Could you confirm that it does?
After that, I believe you will have to look at your Tika setup on Solr to figure out how to avoid having PDFs blow up the pipeline. You should confirm first that Tika is indeed throwing an exception when a PDF is sent to it, of course, and that Solr is closing the http connection under those conditions. Thanks, Karl On Tue, Aug 14, 2012 at 1:28 AM, Karl Wright <[email protected]> wrote: > There are two different issues here. The first one is that you are > having a connection close on you; not sure the reason why, but could > potentially be caused by a Tika exception in Solr. The second is that > the refactored WorkerThread code I checked in Sunday might have a bug > in handling exceptions of this kind. > > I'll have a look at these and get back to you shortly. > > Karl > > On Mon, Aug 13, 2012 at 10:28 PM, Ahmet Arslan <[email protected]> wrote: >> If I modify my Path Rules to index only *.doc and *.docx files, I can >> re-index over and over without restarting anything. Everything works fine. >> It seems that there is a problem with non text extractable files. >> >> /Documents/*.doc file include >> /Documents/*.docx file include >> >> --- On Tue, 8/14/12, Ahmet Arslan <[email protected]> wrote: >> >>> From: Ahmet Arslan <[email protected]> >>> Subject: Re: SharePoint: Error closing connection to file >>> To: [email protected] >>> Date: Tuesday, August 14, 2012, 5:20 AM >>> >>> Also after this, when i hit "View Repository Connection >>> Status" i get : >>> >>> Got an unknown remote exception accessing site - axis fault >>> = Server.userException, detail = >>> java.net.UnknownHostException: null >>> >>> I restart mcf, I get "Connection status: Connection working" >>> at "View Repository Connection Status" page. >>> >>> --- On Tue, 8/14/12, Ahmet Arslan <[email protected]> >>> wrote: >>> >>> > From: Ahmet Arslan <[email protected]> >>> > Subject: SharePoint: Error closing connection to file >>> > To: [email protected] >>> > Date: Tuesday, August 14, 2012, 5:18 AM >>> > Hello, >>> > >>> > Using solr output connector and SP2010 Repository >>> connector, >>> > I am indexing a document library named Documents. This >>> > library has some scanned pdf documents. Very First >>> crawl >>> > indexes all 91 docs. >>> > When I hit "Re-ingest all associated documents" and >>> start >>> > second crawl, I get : "Error: Unexpected jobqueue >>> status - >>> > record id 1344907007021, expecting active status, saw >>> 3" >>> > >>> > Here is the stack trace: >>> > When i look at >>> > http://iknowtest/Documents/ik_docs/vize_evraklari/ticaret_sicil_gazetesi.pdf, >>> > it is an image (scanned) pdf. >>> > >>> > WARN 2012-08-14 05:13:22,068 (Worker thread '39') - >>> > SharePoint: Error closing connection to file >>> > 'http://iknowtest/Documents/ik_docs/vize_evraklari/ticaret_sicil_gazetesi.pdf': >>> > Connection reset >>> > java.net.SocketException: Connection reset >>> > at >>> > >>> java.net.SocketInputStream.read(SocketInputStream.java:113) >>> > at >>> > >>> java.io.BufferedInputStream.fill(BufferedInputStream.java:218) >>> > at >>> > >>> java.io.BufferedInputStream.read1(BufferedInputStream.java:258) >>> > at >>> > >>> java.io.BufferedInputStream.read(BufferedInputStream.java:317) >>> > at >>> > >>> org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown >>> > Source) >>> > at >>> > >>> org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown >>> > Source) >>> > at >>> > >>> org.apache.commons.httpclient.ChunkedInputStream.exhaustInputStream(Unknown >>> > Source) >>> > at >>> > >>> org.apache.commons.httpclient.ContentLengthInputStream.close(Unknown >>> > Source) >>> > at >>> > >>> java.io.FilterInputStream.close(FilterInputStream.java:155) >>> > at >>> > >>> org.apache.commons.httpclient.AutoCloseInputStream.notifyWatcher(Unknown >>> > Source) >>> > at >>> > >>> org.apache.commons.httpclient.AutoCloseInputStream.close(Unknown >>> > Source) >>> > at >>> > >>> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:1457) >>> > at >>> > >>> org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) >>> > at >>> > >>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:549) >>> > DEBUG 2012-08-14 05:13:22,072 (Worker thread '42') - >>> > SharePoint: Path attribute name is null >>> > WARN 2012-08-14 05:13:22,081 (Worker thread '39') >>> - >>> > SharePoint: IOException thrown: Connection reset >>> > java.net.SocketException: Connection reset >>> > at >>> > >>> java.net.SocketInputStream.read(SocketInputStream.java:168) >>> > at >>> > >>> java.io.BufferedInputStream.read1(BufferedInputStream.java:256) >>> > at >>> > >>> java.io.BufferedInputStream.read(BufferedInputStream.java:317) >>> > at >>> > >>> org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown >>> > Source) >>> > at >>> > >>> java.io.FilterInputStream.read(FilterInputStream.java:116) >>> > at >>> > >>> org.apache.commons.httpclient.AutoCloseInputStream.read(Unknown >>> > Source) >>> > at >>> > >>> java.io.FilterInputStream.read(FilterInputStream.java:90) >>> > at >>> > >>> org.apache.commons.httpclient.AutoCloseInputStream.read(Unknown >>> > Source) >>> > at >>> > >>> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:1447) >>> > at >>> > >>> org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) >>> > at >>> > >>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:549) >>> > WARN 2012-08-14 05:13:22,186 (Worker thread '39') >>> - Service >>> > interruption reported for job 1344906886879 connection >>> > 'SP2010': SharePoint is down attempting to read >>> > 'http://iknowtest/Documents/ik_docs/vize_evraklari/ticaret_sicil_gazetesi.pdf', >>> > retrying: Connection reset >>> > ERROR 2012-08-14 05:13:22,230 (Worker thread '39') - >>> > Exception tossed: Unexpected jobqueue status - record >>> id >>> > 1344907007021, expecting active status, saw 3 >>> > >>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: >>> > Unexpected jobqueue status - record id 1344907007021, >>> > expecting active status, saw 3 >>> > at >>> > >>> org.apache.manifoldcf.crawler.jobs.JobQueue.updateCompletedRecord(JobQueue.java:711) >>> > at >>> > >>> org.apache.manifoldcf.crawler.jobs.JobManager.markDocumentCompletedMultiple(JobManager.java:2435) >>> > at >>> > >>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:745) >>> > >>>
