I've also checked in the proposed change, if you care to try it. We're having network issues here this morning so I can't seem to update the ticket though.
Karl On Thu, May 19, 2011 at 8:35 AM, Karl Wright <daddy...@gmail.com> wrote: > CONNECTORS-200 is the ticket. > Karl > > On Thu, May 19, 2011 at 8:04 AM, Karl Wright <daddy...@gmail.com> wrote: >> This should be enough. >> >> I'll open a ticket. The changes to the solr connector are trivial; I >> can do them and check them in, if someone is willing to try it out for >> real. >> >> Karl >> >> On Thu, May 19, 2011 at 6:11 AM, Erlend Garåsen <e.f.gara...@usit.uio.no> >> wrote: >>> >>> Here's what I found in my simple history logs: >>> org.apache.tika.exception.TikaException: TIKA-418: RuntimeException while >>> getting content for thmx and xps file types >>> >>> So, yes, Tika exceptions are stored in the MCF logs, so I guess it should be >>> possible to find a workaround for this. >>> >>> Erlend >>> >>> On 19.05.11 12.00, Karl Wright wrote: >>>> >>>> There was a Solr ticket created I believe by Shinichiro. >>>> >>>> The question is whether the Solr 500 response has anything in its body >>>> that could help ManifoldCF recognize a Tika exception. If not there >>>> is little the Solr connector can do to detect this case. The problem >>>> is that you need to look in the Simple History to see what the >>>> response actually is, and I don't think Shinichiro did that. >>>> >>>> Karl >>>> >>>> On Thu, May 19, 2011 at 4:42 AM, Erlend Garåsen<e.f.gara...@usit.uio.no> >>>> wrote: >>>>> >>>>> Do we have an MCF ticket for this issue yet? Or is rather a Solr issue? >>>>> >>>>> I agree with Karl. We should look for a TikaException and then tell MCF >>>>> to >>>>> skip affecting documents. But maybe this should just be a temporary fix >>>>> until it has been fixed in Solr Cell. >>>>> >>>>> Exactly the same happens if Tika cannot parse a document which it does >>>>> not >>>>> support. Solr/Solr Cell returns a 500 server error, causing MCF to retry >>>>> over and over again: >>>>> [2011-05-18 17:39:34.104] [] webapp=/solr path=/update/extract >>>>> >>>>> params={literal.id=http://foreninger.uio.no/akademikerne/Tillitsvalgte_i_akademikerforeninger_files/themedata.thmx} >>>>> status=500 QTime=5 >>>>> [2011-05-18 17:39:39.102] {} 0 4 >>>>> [2011-05-18 17:39:39.103] org.apache.solr.common.SolrException: >>>>> org.apache.tika.exception.TikaException: TIKA-418: RuntimeException while >>>>> getting content for thmx and xps file types >>>>> >>>>> And finally, the job just aborts: >>>>> Exception tossed: Repeated service interruptions - failure processing >>>>> document: Ingestion HTTP error code 500 >>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated >>>>> service >>>>> interruptions - failure processing document: Ingestion HTTP error code >>>>> 500 >>>>> at >>>>> >>>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:630) >>>>> Caused by: org.apache.manifoldcf.core.interfaces.ManifoldCFException: >>>>> Ingestion HTTP error code 500 >>>>> at >>>>> >>>>> org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:1362) >>>>> >>>>> I guess I can find a workaround since I have created my own >>>>> ExtractingRequestHandler in order to support language detection etc., but >>>>> I >>>>> think MCF should act differently when the underlying cause is a >>>>> TikaException. >>>>> >>>>> Erlend >>>>> >>>>> >>>>> On 27.04.11 12.25, Karl Wright wrote: >>>>>> >>>>>> If I recall, it treats the 400 response as meaning "this document >>>>>> should be skipped", and it treats the 500 response as meaning "this >>>>>> document should be retried because I have absolutely no idea what >>>>>> happened". However, we could modify the code for the 500 response to >>>>>> look at the content of the response as well, and look for a string in >>>>>> it that would give us a clue, such as "TikaException". If we see a >>>>>> TikaException, we could have it conclude "this document should be >>>>>> skipped". That was what I was thinking. >>>>>> >>>>>> Karl >>>>>> >>>>>> On Wed, Apr 27, 2011 at 6:00 AM, Shinichiro Abe >>>>>> <shinichiro.ab...@gmail.com> wrote: >>>>>>> >>>>>>> Hi.Thank you for your reply. >>>>>>> >>>>>>> It seems that Solr.ExtractingRequestHandler responds the same HTTP >>>>>>> response(SERVER_ERROR( 500 )) at any time error occurs. >>>>>>> I'll try to open a ticket for solr. >>>>>>> >>>>>>> Is it correct that MCF re-try crawling was processed when it receives >>>>>>> 500 >>>>>>> level response, not 400 level response? >>>>>>> >>>>>>> Thank you. >>>>>>> Shinichiro Abe >>>>>>> >>>>>>> >>>>>>> On 2011/04/27, at 14:45, Karl Wright wrote: >>>>>>> >>>>>>>> So the 500 error is occurring because Solr is throwing an exception at >>>>>>>> indexing time, is that correct? >>>>>>>> >>>>>>>> If this is correct, then here's my take. (1) A 500 error is a nasty >>>>>>>> error that Solr should not be returning under normal conditions. (2) >>>>>>>> A password-protected PDF is not what I would consider exceptional, so >>>>>>>> Tika should not be throwing an exception when it sees it, merely (at >>>>>>>> worst) logging an error and continuing. However, having said that, >>>>>>>> output connectors in ManifoldCF can make the decision to never retry >>>>>>>> the document, by returning a certain status, provided the connector >>>>>>>> can figure out that the error warrants this treatment. >>>>>>>> >>>>>>>> My suggestion is therefore the following. First, we should open a >>>>>>>> ticket for Solr about this. Second, if you can see the error output >>>>>>>> from the Simple History for a TikaException being thrown in Solr, we >>>>>>>> can look for that text in the response from Solr and perhaps modify >>>>>>>> the Solr Connector to detect the case. If you could open a ManifoldCF >>>>>>>> ticket and include that text I'd be very grateful. >>>>>>>> >>>>>>>> Thanks! >>>>>>>> Karl >>>>>>>> >>>>>>>> On Tue, Apr 26, 2011 at 10:53 PM, Shinichiro Abe >>>>>>>> <shinichiro.ab...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> Hello. >>>>>>>>> >>>>>>>>> There are pdf and office files that are protected by reading >>>>>>>>> password. >>>>>>>>> We do not have to read those files if we do not know the password of >>>>>>>>> files. >>>>>>>>> >>>>>>>>> Now, MCF job starts to crawl the filesystem repository and post to >>>>>>>>> Solr. >>>>>>>>> Document ingestion of non-protected files is done successfully, >>>>>>>>> but one of protected file is not done successfully as far as the job >>>>>>>>> is >>>>>>>>> processed beyond Retry Limit. >>>>>>>>> During that time, it is logging 500 result code in simple history. >>>>>>>>> (Solr throws TikaException caused by PDFBox or apache poi as the >>>>>>>>> reason >>>>>>>>> that it does not read protected documents.) >>>>>>>>> >>>>>>>>> When I ran that test by continuous clawing, not by simple once >>>>>>>>> crawling, >>>>>>>>> the job was done halfway and logged the following: >>>>>>>>> Error: Repeated service interruptions - failure processing document: >>>>>>>>> Ingestion HTTP error code 500 >>>>>>>>> the job tried to crawl that files many times. >>>>>>>>> >>>>>>>>> It seems that a job takes a lot of time and costs for treating >>>>>>>>> protected files. >>>>>>>>> So I want to find a way to skip quickly reading those files. >>>>>>>>> >>>>>>>>> In my survey: >>>>>>>>> Hopfillers is not relevant.(right?) >>>>>>>>> Then Tika, PDFBox, and POI have the mechanism to decrypt protected >>>>>>>>> files, >>>>>>>>> but throw each another exception in the case that given invalid >>>>>>>>> password. >>>>>>>>> It occurs to me that Solr throws another result code when protected >>>>>>>>> files are posted, >>>>>>>>> as one idea apart from possibility or not. >>>>>>>>> >>>>>>>>> Do you have any ideas? >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Shinichiro Abe >>>>>>> >>>>>>> >>>>> >>>>> >>>>> -- >>>>> Erlend Garåsen >>>>> Center for Information Technology Services >>>>> University of Oslo >>>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway >>>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: >>>>> 31050 >>>>> >>> >>> >>> -- >>> Erlend Garåsen >>> Center for Information Technology Services >>> University of Oslo >>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway >>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050 >>> >> >