Hi Karl,

Somehow those scanned pdf files do not throw exception.
I tired sending them using curl : 

curl "http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true"; -F 
"myfile=@ticaret_sicil_gazetesi.pdf"

No exception in solr logs. File is indexed. But when i do this, java coffee 
icon appears in Dock. I don't know what this is. I will further investigate on 
tika/solr side.

Thanks for your support on this.

Anyways, I still sometimes get :
"Got an unknown remote exception accessing site - axis fault = 
Server.userException, detail = java.net.UnknownHostException: null"

I see following entries in manifoldcf.log

 
 WARN 2012-08-14 17:39:41,099 (Thread-10418) - Cookie rejected: "$Version=0; 
http%3A%2F%2Fiknowtest%2FDiscovery=WorkspaceSiteName=SUtOb3c=&WorkspaceSiteUrl=aHR0cDovL2lrbm93dGVzdA==&WorkspaceSiteTime=MjAxMi0wOC0xNFQxNDozOTo0MQ==;
 $Path=/_vti_bin/Discovery.asmx". Illegal path attribute 
"/_vti_bin/Discovery.asmx". Path of origin: 
"/Pages/denemeIkGeneralPage0712-6740.aspx"


FATAL 2012-08-14 17:55:55,096 (Startup thread) - Error tossed: null
java.lang.NullPointerException
        at 
org.apache.manifoldcf.crawler.interfaces.QueueTracker$PriorityKey.hashCode(QueueTracker.java:726)
        at java.util.HashMap.get(HashMap.java:300)
        at 
org.apache.manifoldcf.crawler.interfaces.QueueTracker.calculatePriority(QueueTracker.java:518)
        at 
org.apache.manifoldcf.crawler.system.SeedingActivity.writeSeedDocuments(SeedingActivity.java:225)
        at 
org.apache.manifoldcf.crawler.system.SeedingActivity.doneSeeding(SeedingActivity.java:165)
        at 
org.apache.manifoldcf.crawler.system.StartupThread.run(StartupThread.java:181)

--- On Tue, 8/14/12, Karl Wright <[email protected]> wrote:

> From: Karl Wright <[email protected]>
> Subject: Re: SharePoint: Error closing connection to file
> To: [email protected]
> Date: Tuesday, August 14, 2012, 9:32 AM
> I've committed a fix to how the
> WorkerThread handles service
> interruptions.  This should eliminate the "unexpected
> value"
> exception.  Could you confirm that it does?
> 
> After that, I believe you will have to look at your Tika
> setup on Solr
> to figure out how to avoid having PDFs blow up the
> pipeline.  You
> should confirm first that Tika is indeed throwing an
> exception when a
> PDF is sent to it, of course, and that Solr is closing the
> http
> connection under those conditions.
> 
> Thanks,
> Karl
> 
> On Tue, Aug 14, 2012 at 1:28 AM, Karl Wright <[email protected]>
> wrote:
> > There are two different issues here.  The first
> one is that you are
> > having a connection close on you; not sure the reason
> why, but could
> > potentially be caused by a Tika exception in
> Solr.  The second is that
> > the refactored WorkerThread code I checked in Sunday
> might have a bug
> > in handling exceptions of this kind.
> >
> > I'll have a look at these and get back to you shortly.
> >
> > Karl
> >
> > On Mon, Aug 13, 2012 at 10:28 PM, Ahmet Arslan <[email protected]>
> wrote:
> >> If I modify my Path Rules to index only *.doc and
> *.docx files, I can re-index over and over without
> restarting anything. Everything works fine.
> >> It seems that there is a problem with non text
> extractable files.
> >>
> >> /Documents/*.doc       
> file    include
> >> /Documents/*.docx   
>    file    include
> >>
> >> --- On Tue, 8/14/12, Ahmet Arslan <[email protected]>
> wrote:
> >>
> >>> From: Ahmet Arslan <[email protected]>
> >>> Subject: Re: SharePoint: Error closing
> connection to file
> >>> To: [email protected]
> >>> Date: Tuesday, August 14, 2012, 5:20 AM
> >>>
> >>> Also after this, when i hit "View Repository
> Connection
> >>> Status" i get :
> >>>
> >>> Got an unknown remote exception accessing site
> - axis fault
> >>> = Server.userException, detail =
> >>> java.net.UnknownHostException: null
> >>>
> >>> I restart mcf, I get "Connection status:
> Connection working"
> >>> at "View Repository Connection Status" page.
> >>>
> >>> --- On Tue, 8/14/12, Ahmet Arslan <[email protected]>
> >>> wrote:
> >>>
> >>> > From: Ahmet Arslan <[email protected]>
> >>> > Subject: SharePoint: Error closing
> connection to file
> >>> > To: [email protected]
> >>> > Date: Tuesday, August 14, 2012, 5:18 AM
> >>> > Hello,
> >>> >
> >>> > Using solr output connector and SP2010
> Repository
> >>> connector,
> >>> > I am indexing a document library named
> Documents. This
> >>> > library has some scanned pdf documents.
> Very First
> >>> crawl
> >>> > indexes all 91 docs.
> >>> > When I hit "Re-ingest all associated
> documents" and
> >>> start
> >>> > second crawl, I get : "Error: Unexpected
> jobqueue
> >>> status -
> >>> > record id 1344907007021, expecting active
> status, saw
> >>> 3"
> >>> >
> >>> > Here is the stack trace:
> >>> > When i look at 
> >>> > http://iknowtest/Documents/ik_docs/vize_evraklari/ticaret_sicil_gazetesi.pdf,
> >>> > it is an image (scanned) pdf.
> >>> >
> >>> > WARN 2012-08-14 05:13:22,068 (Worker
> thread '39') -
> >>> > SharePoint: Error closing connection to
> file 
> 'http://iknowtest/Documents/ik_docs/vize_evraklari/ticaret_sicil_gazetesi.pdf':
> >>> > Connection reset
> >>> > java.net.SocketException: Connection
> reset
> >>> >     at
> >>> >
> >>>
> java.net.SocketInputStream.read(SocketInputStream.java:113)
> >>> >     at
> >>> >
> >>>
> java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> >>> >     at
> >>> >
> >>>
> java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> >>> >     at
> >>> >
> >>>
> java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> >>> >     at
> >>> >
> >>>
> org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown
> >>> > Source)
> >>> >     at
> >>> >
> >>>
> org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown
> >>> > Source)
> >>> >     at
> >>> >
> >>>
> org.apache.commons.httpclient.ChunkedInputStream.exhaustInputStream(Unknown
> >>> > Source)
> >>> >     at
> >>> >
> >>>
> org.apache.commons.httpclient.ContentLengthInputStream.close(Unknown
> >>> > Source)
> >>> >     at
> >>> >
> >>>
> java.io.FilterInputStream.close(FilterInputStream.java:155)
> >>> >     at
> >>> >
> >>>
> org.apache.commons.httpclient.AutoCloseInputStream.notifyWatcher(Unknown
> >>> > Source)
> >>> >     at
> >>> >
> >>>
> org.apache.commons.httpclient.AutoCloseInputStream.close(Unknown
> >>> > Source)
> >>> >     at
> >>> >
> >>>
> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:1457)
> >>> >     at
> >>> >
> >>>
> org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
> >>> >     at
> >>> >
> >>>
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:549)
> >>> > DEBUG 2012-08-14 05:13:22,072 (Worker
> thread '42') -
> >>> > SharePoint: Path attribute name is null
> >>> >  WARN 2012-08-14 05:13:22,081 (Worker
> thread '39')
> >>> -
> >>> > SharePoint: IOException thrown: Connection
> reset
> >>> > java.net.SocketException: Connection
> reset
> >>> >     at
> >>> >
> >>>
> java.net.SocketInputStream.read(SocketInputStream.java:168)
> >>> >     at
> >>> >
> >>>
> java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
> >>> >     at
> >>> >
> >>>
> java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> >>> >     at
> >>> >
> >>>
> org.apache.commons.httpclient.ContentLengthInputStream.read(Unknown
> >>> > Source)
> >>> >     at
> >>> >
> >>>
> java.io.FilterInputStream.read(FilterInputStream.java:116)
> >>> >     at
> >>> >
> >>>
> org.apache.commons.httpclient.AutoCloseInputStream.read(Unknown
> >>> > Source)
> >>> >     at
> >>> >
> >>>
> java.io.FilterInputStream.read(FilterInputStream.java:90)
> >>> >     at
> >>> >
> >>>
> org.apache.commons.httpclient.AutoCloseInputStream.read(Unknown
> >>> > Source)
> >>> >     at
> >>> >
> >>>
> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:1447)
> >>> >     at
> >>> >
> >>>
> org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
> >>> >     at
> >>> >
> >>>
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:549)
> >>> >  WARN 2012-08-14 05:13:22,186 (Worker
> thread '39')
> >>> - Service
> >>> > interruption reported for job
> 1344906886879 connection
> >>> > 'SP2010': SharePoint is down attempting to
> read 
> 'http://iknowtest/Documents/ik_docs/vize_evraklari/ticaret_sicil_gazetesi.pdf',
> >>> > retrying: Connection reset
> >>> > ERROR 2012-08-14 05:13:22,230 (Worker
> thread '39') -
> >>> > Exception tossed: Unexpected jobqueue
> status - record
> >>> id
> >>> > 1344907007021, expecting active status,
> saw 3
> >>> >
> >>>
> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
> >>> > Unexpected jobqueue status - record id
> 1344907007021,
> >>> > expecting active status, saw 3
> >>> >     at
> >>> >
> >>>
> org.apache.manifoldcf.crawler.jobs.JobQueue.updateCompletedRecord(JobQueue.java:711)
> >>> >     at
> >>> >
> >>>
> org.apache.manifoldcf.crawler.jobs.JobManager.markDocumentCompletedMultiple(JobManager.java:2435)
> >>> >     at
> >>> >
> >>>
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:745)
> >>> >
> >>>
>

Reply via email to