Re: [Dspace-tech] Indexing of scanned PDFs

Tim Donohue Wed, 23 Apr 2008 08:04:48 -0700

A bit more info (but similar answer to Graham's)

It's hard to tell what exactly is going on here.  By default, the PDFBox 
software which DSpace uses to index PDFs should be able to index a PDF 
which has embedded OCR text (it's worked for us in this way).  However, 
there are admittedly bugs with this underlying PDFBox software that 
folks have run into in the past (myself included)


Michael, you may want to check a few things:

(1) you need to make sure that you are running the 'filter-media' script 
each night.  This is what full-text indexes PDF, Word and HTML.
(2) If you are running 'filter-media', you may want to set up your cron 
job to write its output to a log file, so you can see what errors may be 
occurring.  Something similar to this:
[dspace]/bin/filter-media  > [dspace]/log/filter.log 2>&1

If you are running filter-media, that log file should be able to tell 
you what is erroring out.  If you don't understand the error message, 
you can send it to dspace-tech and we can try and help you debug it.

Finally, as Graham mentioned, there are a few common errors with that 
PDFBox software which we've now got workarounds for in DSpace 1.5. 
Namely these two configs:

pdffilter.largepdfs = true
(If true, it writes larger PDFs to a temp file as it indexes them...this 
is slower, but helps ensure that PDFBox software doesn't eat up all your 
memory)

pdffilter.skiponmemoryexception=true
(If true, it skips any PDFs which still result in an Out of Memory error 
from PDFBox...these PDFs just will never be indexed until the PDFBox 
software we are using fixes some of its memory usage problems)

BTW...Graham, those two 'pdffilter' settings didn't make it into the 
DSpace 1.5 dspace.cfg file!  We need to push those into the 1.5.1 
bug-fix release!

Hope that helps!

- Tim





Graham Triggs wrote:
> Dorothea Salo wrote:
>> You didn't say what version of DSpace you're running (and honestly,
>> I'm not completely sure this was fixed in 1.5 -- anybody know?),
>> but... one thing that may be happening is that the filter-media cron
>> job is dying. Since it's written without error-recovery, it stops dead
>> at the first file it thinks it should be able to handle but can't.
>>
>> Run it from the command-line and see if it errors out. If I'm right,
>> there's no obvious workaround I'm aware of, though somebody (Tim?) may
>> have hacked one.
>>
>> Dorothea
>>
> 
> The filter-media in 1.5 is a bit more robust. If it hits an Exception 
> when dealing with one file, it will attempt to clean itself up a bit and 
> carry on with the next one.
> 
> In the cases where PDF extraction is failing due to a PDFBox bug, this 
> is usually good enough for it to finish the filtering normally 
> (excluding the file that caused the problem).
> 
> However, I can't guarantee that will be enough in this case. But then 
> judging by Mike's message, it's possible that filter-media wasn't even 
> run at all. (only index-all is mentioned)
> 
> G
> 
>  
>  
> This e-mail is confidential and should not be used by anyone who is not the 
> original intended recipient. BioMed Central Limited does not accept liability 
> for any statements made which are clearly the sender's own and not expressly 
> made on behalf of BioMed Central Limited. No contracts may be concluded on 
> behalf of BioMed Central Limited by means of e-mail communication. BioMed 
> Central Limited Registered in England and Wales with registered number 
> 3680030 Registered Office Middlesex House, 34-42 Cleveland Street, London W1T 
> 4LB
> This email has been scanned by Postini.
> For more information please visit http://www.postini.com
> 
> 
> -------------------------------------------------------------------------
> This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
> Don't miss this year's exciting event. There's still time to save $100. 
> Use priority code J8TL2D2. 
> http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
> _______________________________________________
> DSpace-tech mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
> 

-- 

========================================
Tim Donohue
Research Programmer, Illinois Digital Environment for
Access to Learning and Scholarship (IDEALS)
135 Grainger Engineering Library
University of Illinois at Urbana-Champaign

email: [EMAIL PROTECTED]
web:   http://www.ideals.uiuc.edu
phone: (217) 333-4648
fax:   (217) 244-7764
========================================

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] Indexing of scanned PDFs

Reply via email to