Hi Karl, Somehow those scanned pdf files do not throw exception. I tired sending them using curl :
curl "http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true" -F "myfile=@ticaret_sicil_gazetesi.pdf" No exception in solr logs. File is indexed. But when i do this, java coffee icon appears in Dock. I don't know what this is. I will further investigate on tika/solr side. Thanks for your support on this. Anyways, I still sometimes get : "Got an unknown remote exception accessing site - axis fault = Server.userException, detail = java.net.UnknownHostException: null" I see following entries in manifoldcf.log WARN 2012-08-14 17:39:41,099 (Thread-10418) - Cookie rejected: "$Version=0; http%3A%2F%2Fiknowtest%2FDiscovery=WorkspaceSiteName=SUtOb3c=&WorkspaceSiteUrl=aHR0cDovL2lrbm93dGVzdA==&WorkspaceSiteTime=MjAxMi0wOC0xNFQxNDozOTo0MQ==; $Path=/_vti_bin/Discovery.asmx". Illegal path attribute "/_vti_bin/Discovery.asmx". Path of origin: "/Pages/denemeIkGeneralPage0712-6740.aspx" FATAL 2012-08-14 17:55:55,096 (Startup thread) - Error tossed: null java.lang.NullPointerException at org.apache.manifoldcf.crawler.interfaces.QueueTracker$PriorityKey.hashCode(QueueTracker.java:726) at java.util.HashMap.get(HashMap.java:300) at org.apache.manifoldcf.crawler.interfaces.QueueTracker.calculatePriority(QueueTracker.java:518) at org.apache.manifoldcf.crawler.system.SeedingActivity.writeSeedDocuments(SeedingActivity.java:225) at org.apache.manifoldcf.crawler.system.SeedingActivity.doneSeeding(SeedingActivity.java:165) at org.apache.manifoldcf.crawler.system.StartupThread.run(StartupThread.java:181) --- On Tue, 8/14/12, Karl Wright <[email protected]> wrote: > From: Karl Wright <[email protected]> > Subject: Re: SharePoint: Error closing connection to file > To: [email protected] > Date: Tuesday, August 14, 2012, 9:32 AM > I've committed a fix to how the > WorkerThread handles service > interruptions. This should eliminate the "unexpected > value" > exception. Could you confirm that it does? > > After that, I believe you will have to look at your Tika > setup on Solr > to figure out how to avoid having PDFs blow up the > pipeline. You > should confirm first that Tika is indeed throwing an > exception when a > PDF is sent to it, of course, and that Solr is closing the > http > connection under those conditions. > > Thanks, > Karl > > On Tue, Aug 14, 2012 at 1:28 AM, Karl Wright <[email protected]> > wrote: > > There are two different issues here. The first > one is that you are > > having a connection close on you; not sure the reason > why, but could > > potentially be caused by a Tika exception in > Solr. The second is that > > the refactored WorkerThread code I checked in Sunday > might have a bug > > in handling exceptions of this kind. > > > > I'll have a look at these and get back to you shortly. > > > > Karl > > > > On Mon, Aug 13, 2012 at 10:28 PM, Ahmet Arslan <[email protected]> > wrote: > >> If I modify my Path Rules to index only *.doc and > *.docx files, I can re-index over and over without > restarting anything. Everything works fine. > >> It seems that there is a problem with non text > extractable files. > >> > >> /Documents/*.doc > file include > >> /Documents/*.docx > file include > >> > >> --- On Tue, 8/14/12, Ahmet Arslan <[email protected]> > wrote: > >> > >>> From: Ahmet Arslan <[email protected]> > >>> Subject: Re: SharePoint: Error closing > connection to file > >>> To: [email protected] > >>> Date: Tuesday, August 14, 2012, 5:20 AM > >>> > >>> Also after this, when i hit "View Repository > Connection > >>> Status" i get : > >>> > >>> Got an unknown remote exception accessing site > - axis fault > >>> = Server.userException, detail = > >>> java.net.UnknownHostException: null > >>> > >>> I restart mcf, I get "Connection status: > Connection working" > >>> at "View Repository Connection Status" page. > >>> > >>> --- On Tue, 8/14/12, Ahmet Arslan <[email protected]> > >>> wrote: > >>> > >>> > From: Ahmet Arslan <[email protected]> > >>> > Subject: SharePoint: Error closing > connection to file > >>> > To: [email protected] > >>> > Date: Tuesday, August 14, 2012, 5:18 AM > >>> > Hello, > >>> > > >>> > Using solr output connector and SP2010 > Repository > >>> connector, > >>> > I am indexing a document library named > Documents. This > >>> > library has some scanned pdf documents. > Very First > >>> crawl > >>> > indexes all 91 docs. > >>> > When I hit "Re-ingest all associated > documents" and > >>> start > >>> > second crawl, I get : "Error: Unexpected > jobqueue > >>> status - > >>> > record id 1344907007021, expecting active > status, saw > >>> 3" > >>> > > >>> > Here is the stack trace: > >>> > When i look at > >>> > http://iknowtest/Documents/ik_docs/vize_evraklari/ticaret_sicil_gazetesi.pdf, > >>> > it is an image (scanned) pdf. > >>> > > >>> > WARN 2012-08-14 05:13:22,068 (Worker > thread '39') - > >>> > SharePoint: Error closing connection to > file > 'http://iknowtest/Documents/ik_docs/vize_evraklari/ticaret_sicil_gazetesi.pdf': > >>> > Connection reset > >>> > java.net.SocketException: Connection > reset > >>> > at > >>> > > >>> > java.net.SocketInputStream.read(SocketInputStream.java:113) > >>> > at > >>> > > >>> > java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > >>> > at > >>> > > >>> > java.io.BufferedInputStream.read1(BufferedInputStream.java:258) > >>> > at > >>> > > >>> > java.io.BufferedInputStream.read(BufferedInputStream.java:317) > >>> > at > >>> > > >>> > org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown > >>> > Source) > >>> > at > >>> > > >>> > org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown > >>> > Source) > >>> > at > >>> > > >>> > org.apache.commons.httpclient.ChunkedInputStream.exhaustInputStream(Unknown > >>> > Source) > >>> > at > >>> > > >>> > org.apache.commons.httpclient.ContentLengthInputStream.close(Unknown > >>> > Source) > >>> > at > >>> > > >>> > java.io.FilterInputStream.close(FilterInputStream.java:155) > >>> > at > >>> > > >>> > org.apache.commons.httpclient.AutoCloseInputStream.notifyWatcher(Unknown > >>> > Source) > >>> > at > >>> > > >>> > org.apache.commons.httpclient.AutoCloseInputStream.close(Unknown > >>> > Source) > >>> > at > >>> > > >>> > org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:1457) > >>> > at > >>> > > >>> > org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) > >>> > at > >>> > > >>> > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:549) > >>> > DEBUG 2012-08-14 05:13:22,072 (Worker > thread '42') - > >>> > SharePoint: Path attribute name is null > >>> > WARN 2012-08-14 05:13:22,081 (Worker > thread '39') > >>> - > >>> > SharePoint: IOException thrown: Connection > reset > >>> > java.net.SocketException: Connection > reset > >>> > at > >>> > > >>> > java.net.SocketInputStream.read(SocketInputStream.java:168) > >>> > at > >>> > > >>> > java.io.BufferedInputStream.read1(BufferedInputStream.java:256) > >>> > at > >>> > > >>> > java.io.BufferedInputStream.read(BufferedInputStream.java:317) > >>> > at > >>> > > >>> > org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown > >>> > Source) > >>> > at > >>> > > >>> > java.io.FilterInputStream.read(FilterInputStream.java:116) > >>> > at > >>> > > >>> > org.apache.commons.httpclient.AutoCloseInputStream.read(Unknown > >>> > Source) > >>> > at > >>> > > >>> > java.io.FilterInputStream.read(FilterInputStream.java:90) > >>> > at > >>> > > >>> > org.apache.commons.httpclient.AutoCloseInputStream.read(Unknown > >>> > Source) > >>> > at > >>> > > >>> > org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:1447) > >>> > at > >>> > > >>> > org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) > >>> > at > >>> > > >>> > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:549) > >>> > WARN 2012-08-14 05:13:22,186 (Worker > thread '39') > >>> - Service > >>> > interruption reported for job > 1344906886879 connection > >>> > 'SP2010': SharePoint is down attempting to > read > 'http://iknowtest/Documents/ik_docs/vize_evraklari/ticaret_sicil_gazetesi.pdf', > >>> > retrying: Connection reset > >>> > ERROR 2012-08-14 05:13:22,230 (Worker > thread '39') - > >>> > Exception tossed: Unexpected jobqueue > status - record > >>> id > >>> > 1344907007021, expecting active status, > saw 3 > >>> > > >>> > org.apache.manifoldcf.core.interfaces.ManifoldCFException: > >>> > Unexpected jobqueue status - record id > 1344907007021, > >>> > expecting active status, saw 3 > >>> > at > >>> > > >>> > org.apache.manifoldcf.crawler.jobs.JobQueue.updateCompletedRecord(JobQueue.java:711) > >>> > at > >>> > > >>> > org.apache.manifoldcf.crawler.jobs.JobManager.markDocumentCompletedMultiple(JobManager.java:2435) > >>> > at > >>> > > >>> > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:745) > >>> > > >>> >
