Treatment of protected files

Shinichiro Abe Tue, 26 Apr 2011 19:53:52 -0700

Hello.

There are pdf and office files that are protected by reading password.
We do not have to read those files if we do not know the password of files.


Now, MCF job starts to crawl the filesystem repository and post to Solr.
Document ingestion of non-protected files is done successfully, 
but one of protected file is not done successfully as far as the job is 
processed beyond Retry Limit.
During that time, it is logging 500 result code in simple history. 
(Solr throws TikaException caused by PDFBox or apache poi as the reason that it 
does not read protected documents.)

When I ran that test by continuous clawing, not by simple once crawling, 
the job was done halfway and logged the following:
Error: Repeated service interruptions - failure processing document: Ingestion 
HTTP error code 500
the job tried to crawl that files many times.

It seems that a job takes a lot of time and costs for treating protected files.
So I want to find a way to skip quickly reading those files.

In my survey:
Hopfillers is not relevant.(right?)
Then Tika, PDFBox, and POI have the mechanism to decrypt protected files, 
but throw each another exception in the case that given invalid password.
It occurs to me that Solr throws another result code when protected files are 
posted, 
as one idea apart from possibility or not.

Do you have any ideas?

Regards,
Shinichiro Abe

Treatment of protected files

Reply via email to