Here's what I found in my simple history logs:
org.apache.tika.exception.TikaException: TIKA-418: RuntimeException
while getting content for thmx and xps file types
So, yes, Tika exceptions are stored in the MCF logs, so I guess it
should be possible to find a workaround for this.
Erlend
On 19.05.11 12.00, Karl Wright wrote:
There was a Solr ticket created I believe by Shinichiro.
The question is whether the Solr 500 response has anything in its body
that could help ManifoldCF recognize a Tika exception. If not there
is little the Solr connector can do to detect this case. The problem
is that you need to look in the Simple History to see what the
response actually is, and I don't think Shinichiro did that.
Karl
On Thu, May 19, 2011 at 4:42 AM, Erlend Garåsen<e.f.gara...@usit.uio.no> wrote:
Do we have an MCF ticket for this issue yet? Or is rather a Solr issue?
I agree with Karl. We should look for a TikaException and then tell MCF to
skip affecting documents. But maybe this should just be a temporary fix
until it has been fixed in Solr Cell.
Exactly the same happens if Tika cannot parse a document which it does not
support. Solr/Solr Cell returns a 500 server error, causing MCF to retry
over and over again:
[2011-05-18 17:39:34.104] [] webapp=/solr path=/update/extract
params={literal.id=http://foreninger.uio.no/akademikerne/Tillitsvalgte_i_akademikerforeninger_files/themedata.thmx}
status=500 QTime=5
[2011-05-18 17:39:39.102] {} 0 4
[2011-05-18 17:39:39.103] org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: TIKA-418: RuntimeException while
getting content for thmx and xps file types
And finally, the job just aborts:
Exception tossed: Repeated service interruptions - failure processing
document: Ingestion HTTP error code 500
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated service
interruptions - failure processing document: Ingestion HTTP error code 500
at
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:630)
Caused by: org.apache.manifoldcf.core.interfaces.ManifoldCFException:
Ingestion HTTP error code 500
at
org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:1362)
I guess I can find a workaround since I have created my own
ExtractingRequestHandler in order to support language detection etc., but I
think MCF should act differently when the underlying cause is a
TikaException.
Erlend
On 27.04.11 12.25, Karl Wright wrote:
If I recall, it treats the 400 response as meaning "this document
should be skipped", and it treats the 500 response as meaning "this
document should be retried because I have absolutely no idea what
happened". However, we could modify the code for the 500 response to
look at the content of the response as well, and look for a string in
it that would give us a clue, such as "TikaException". If we see a
TikaException, we could have it conclude "this document should be
skipped". That was what I was thinking.
Karl
On Wed, Apr 27, 2011 at 6:00 AM, Shinichiro Abe
<shinichiro.ab...@gmail.com> wrote:
Hi.Thank you for your reply.
It seems that Solr.ExtractingRequestHandler responds the same HTTP
response(SERVER_ERROR( 500 )) at any time error occurs.
I'll try to open a ticket for solr.
Is it correct that MCF re-try crawling was processed when it receives 500
level response, not 400 level response?
Thank you.
Shinichiro Abe
On 2011/04/27, at 14:45, Karl Wright wrote:
So the 500 error is occurring because Solr is throwing an exception at
indexing time, is that correct?
If this is correct, then here's my take. (1) A 500 error is a nasty
error that Solr should not be returning under normal conditions. (2)
A password-protected PDF is not what I would consider exceptional, so
Tika should not be throwing an exception when it sees it, merely (at
worst) logging an error and continuing. However, having said that,
output connectors in ManifoldCF can make the decision to never retry
the document, by returning a certain status, provided the connector
can figure out that the error warrants this treatment.
My suggestion is therefore the following. First, we should open a
ticket for Solr about this. Second, if you can see the error output
from the Simple History for a TikaException being thrown in Solr, we
can look for that text in the response from Solr and perhaps modify
the Solr Connector to detect the case. If you could open a ManifoldCF
ticket and include that text I'd be very grateful.
Thanks!
Karl
On Tue, Apr 26, 2011 at 10:53 PM, Shinichiro Abe
<shinichiro.ab...@gmail.com> wrote:
Hello.
There are pdf and office files that are protected by reading password.
We do not have to read those files if we do not know the password of
files.
Now, MCF job starts to crawl the filesystem repository and post to
Solr.
Document ingestion of non-protected files is done successfully,
but one of protected file is not done successfully as far as the job is
processed beyond Retry Limit.
During that time, it is logging 500 result code in simple history.
(Solr throws TikaException caused by PDFBox or apache poi as the reason
that it does not read protected documents.)
When I ran that test by continuous clawing, not by simple once
crawling,
the job was done halfway and logged the following:
Error: Repeated service interruptions - failure processing document:
Ingestion HTTP error code 500
the job tried to crawl that files many times.
It seems that a job takes a lot of time and costs for treating
protected files.
So I want to find a way to skip quickly reading those files.
In my survey:
Hopfillers is not relevant.(right?)
Then Tika, PDFBox, and POI have the mechanism to decrypt protected
files,
but throw each another exception in the case that given invalid
password.
It occurs to me that Solr throws another result code when protected
files are posted,
as one idea apart from possibility or not.
Do you have any ideas?
Regards,
Shinichiro Abe
--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050