Re: Treatment of protected files

Karl Wright Thu, 19 May 2011 05:40:20 -0700

I've also checked in the proposed change, if you care to try it.
We're having network issues here this morning so I can't seem to
update the ticket though.


Karl

On Thu, May 19, 2011 at 8:35 AM, Karl Wright <daddy...@gmail.com> wrote:
> CONNECTORS-200 is the ticket.
> Karl
>
> On Thu, May 19, 2011 at 8:04 AM, Karl Wright <daddy...@gmail.com> wrote:
>> This should be enough.
>>
>> I'll open a ticket.  The changes to the solr connector are trivial; I
>> can do them and check them in, if someone is willing to try it out for
>> real.
>>
>> Karl
>>
>> On Thu, May 19, 2011 at 6:11 AM, Erlend Garåsen <e.f.gara...@usit.uio.no> 
>> wrote:
>>>
>>> Here's what I found in my simple history logs:
>>> org.apache.tika.exception.TikaException: TIKA-418: RuntimeException while
>>> getting content for thmx and xps file types
>>>
>>> So, yes, Tika exceptions are stored in the MCF logs, so I guess it should be
>>> possible to find a workaround for this.
>>>
>>> Erlend
>>>
>>> On 19.05.11 12.00, Karl Wright wrote:
>>>>
>>>> There was a Solr ticket created I believe by Shinichiro.
>>>>
>>>> The question is whether the Solr 500 response has anything in its body
>>>> that could help ManifoldCF recognize a Tika exception.  If not there
>>>> is little the Solr connector can do to detect this case.  The problem
>>>> is that you need to look in the Simple History to see what the
>>>> response actually is, and I don't think Shinichiro did that.
>>>>
>>>> Karl
>>>>
>>>> On Thu, May 19, 2011 at 4:42 AM, Erlend Garåsen<e.f.gara...@usit.uio.no>
>>>>  wrote:
>>>>>
>>>>> Do we have an MCF ticket for this issue yet? Or is rather a Solr issue?
>>>>>
>>>>> I agree with Karl. We should look for a TikaException and then tell MCF
>>>>> to
>>>>> skip affecting documents. But maybe this should just be a temporary fix
>>>>> until it has been fixed in Solr Cell.
>>>>>
>>>>> Exactly the same happens if Tika cannot parse a document which it does
>>>>> not
>>>>> support. Solr/Solr Cell returns a 500 server error, causing MCF to retry
>>>>> over and over again:
>>>>> [2011-05-18 17:39:34.104] [] webapp=/solr path=/update/extract
>>>>>
>>>>> params={literal.id=http://foreninger.uio.no/akademikerne/Tillitsvalgte_i_akademikerforeninger_files/themedata.thmx}
>>>>> status=500 QTime=5
>>>>> [2011-05-18 17:39:39.102] {} 0 4
>>>>> [2011-05-18 17:39:39.103] org.apache.solr.common.SolrException:
>>>>> org.apache.tika.exception.TikaException: TIKA-418: RuntimeException while
>>>>> getting content for thmx and xps file types
>>>>>
>>>>> And finally, the job just aborts:
>>>>> Exception tossed: Repeated service interruptions - failure processing
>>>>> document: Ingestion HTTP error code 500
>>>>> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated
>>>>> service
>>>>> interruptions - failure processing document: Ingestion HTTP error code
>>>>> 500
>>>>>        at
>>>>>
>>>>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:630)
>>>>> Caused by: org.apache.manifoldcf.core.interfaces.ManifoldCFException:
>>>>> Ingestion HTTP error code 500
>>>>>        at
>>>>>
>>>>> org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:1362)
>>>>>
>>>>> I guess I can find a workaround since I have created my own
>>>>> ExtractingRequestHandler in order to support language detection etc., but
>>>>> I
>>>>> think MCF should act differently when the underlying cause is a
>>>>> TikaException.
>>>>>
>>>>> Erlend
>>>>>
>>>>>
>>>>> On 27.04.11 12.25, Karl Wright wrote:
>>>>>>
>>>>>> If I recall, it treats the 400 response as meaning "this document
>>>>>> should be skipped", and it treats the 500 response as meaning "this
>>>>>> document should be retried because I have absolutely no idea what
>>>>>> happened".  However, we could modify the code for the 500 response to
>>>>>> look at the content of the response as well, and look for a string in
>>>>>> it that would give us a clue, such as "TikaException".  If we see a
>>>>>> TikaException, we could have it conclude "this document should be
>>>>>> skipped".  That was what I was thinking.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>> On Wed, Apr 27, 2011 at 6:00 AM, Shinichiro Abe
>>>>>> <shinichiro.ab...@gmail.com>    wrote:
>>>>>>>
>>>>>>> Hi.Thank you for your reply.
>>>>>>>
>>>>>>> It seems that Solr.ExtractingRequestHandler responds the same HTTP
>>>>>>> response(SERVER_ERROR( 500 )) at any time error occurs.
>>>>>>> I'll try to open a ticket for solr.
>>>>>>>
>>>>>>> Is it correct that MCF re-try crawling was processed when it receives
>>>>>>> 500
>>>>>>> level response, not 400 level response?
>>>>>>>
>>>>>>> Thank you.
>>>>>>> Shinichiro Abe
>>>>>>>
>>>>>>>
>>>>>>> On 2011/04/27, at 14:45, Karl Wright wrote:
>>>>>>>
>>>>>>>> So the 500 error is occurring because Solr is throwing an exception at
>>>>>>>> indexing time, is that correct?
>>>>>>>>
>>>>>>>> If this is correct, then here's my take.  (1) A 500 error is a nasty
>>>>>>>> error that Solr should not be returning under normal conditions.  (2)
>>>>>>>> A password-protected PDF is not what I would consider exceptional, so
>>>>>>>> Tika should not be throwing an exception when it sees it, merely (at
>>>>>>>> worst) logging an error and continuing.  However, having said that,
>>>>>>>> output connectors in ManifoldCF can make the decision to never retry
>>>>>>>> the document, by returning a certain status, provided the connector
>>>>>>>> can figure out that the error warrants this treatment.
>>>>>>>>
>>>>>>>> My suggestion is therefore the following.  First, we should open a
>>>>>>>> ticket for Solr about this.  Second, if you can see the error output
>>>>>>>> from the Simple History for a TikaException being thrown in Solr, we
>>>>>>>> can look for that text in the response from Solr and perhaps modify
>>>>>>>> the Solr Connector to detect the case.  If you could open a ManifoldCF
>>>>>>>> ticket and include that text I'd be very grateful.
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>> Karl
>>>>>>>>
>>>>>>>> On Tue, Apr 26, 2011 at 10:53 PM, Shinichiro Abe
>>>>>>>> <shinichiro.ab...@gmail.com>    wrote:
>>>>>>>>>
>>>>>>>>> Hello.
>>>>>>>>>
>>>>>>>>> There are pdf and office files that are protected by reading
>>>>>>>>> password.
>>>>>>>>> We do not have to read those files if we do not know the password of
>>>>>>>>> files.
>>>>>>>>>
>>>>>>>>> Now, MCF job starts to crawl the filesystem repository and post to
>>>>>>>>> Solr.
>>>>>>>>> Document ingestion of non-protected files is done successfully,
>>>>>>>>> but one of protected file is not done successfully as far as the job
>>>>>>>>> is
>>>>>>>>> processed beyond Retry Limit.
>>>>>>>>> During that time, it is logging 500 result code in simple history.
>>>>>>>>> (Solr throws TikaException caused by PDFBox or apache poi as the
>>>>>>>>> reason
>>>>>>>>> that it does not read protected documents.)
>>>>>>>>>
>>>>>>>>> When I ran that test by continuous clawing, not by simple once
>>>>>>>>> crawling,
>>>>>>>>> the job was done halfway and logged the following:
>>>>>>>>> Error: Repeated service interruptions - failure processing document:
>>>>>>>>> Ingestion HTTP error code 500
>>>>>>>>> the job tried to crawl that files many times.
>>>>>>>>>
>>>>>>>>> It seems that a job takes a lot of time and costs for treating
>>>>>>>>> protected files.
>>>>>>>>> So I want to find a way to skip quickly reading those files.
>>>>>>>>>
>>>>>>>>> In my survey:
>>>>>>>>> Hopfillers is not relevant.(right?)
>>>>>>>>> Then Tika, PDFBox, and POI have the mechanism to decrypt protected
>>>>>>>>> files,
>>>>>>>>> but throw each another exception in the case that given invalid
>>>>>>>>> password.
>>>>>>>>> It occurs to me that Solr throws another result code when protected
>>>>>>>>> files are posted,
>>>>>>>>> as one idea apart from possibility or not.
>>>>>>>>>
>>>>>>>>> Do you have any ideas?
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Shinichiro Abe
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Erlend Garåsen
>>>>> Center for Information Technology Services
>>>>> University of Oslo
>>>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
>>>>> 31050
>>>>>
>>>
>>>
>>> --
>>> Erlend Garåsen
>>> Center for Information Technology Services
>>> University of Oslo
>>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>>>
>>
>

Re: Treatment of protected files

Reply via email to