Susan,

Although you never mentioned it specifically, it sounds like you are 
talking about filtering of *PDF* documents. Before I get to some of the 
reasons for your current problems, it's worth mentioning a bit of 
background (you may have already heard this before, but I want to be 
sure).

(1) The DSpace filter-media script uses the open source PDFBox 
(www.pdfbox.org) tool to actually perform filtering of PDFs and try to 
extract text

(It's worth noting PDFBox is only called for PDFs...DSpace uses other 
open source tools to parse Word & HTML documents)

(2) If you look closely at the PDFBox site, you'll notice that software 
has *not* had an updated release since Oct 2006.

(3) Unfortunately, there doesn't seem to be any comparable PDF tools to 
PDFBox (If anyone knows of one, we'd love to hear of it).  So, DSpace 
has been stuck using the (currently unmaintained) PDFBox software until 
something else comes along.

(4) Failures of 'filter-media' to process a PDF, are actually failures 
of this underlying PDFBox software (since all filter-media does is take 
the results from PDFBox and writes it out to a .txt file)

So, that's a quick background...now on to reasons why PDFBox (and 
filter-media) will sometimes fail:

Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote:
> Hi,
> 
>      I have spent a lot of time recently working on the filter-media 
> cron, on the issue of the errors that occur when it encounters a 
> document that is not filterable for some reason.  It seems that there 
> are several different reasons why filter-media fails:
> 
> 1.          The document is very large and a *“Java Heap Space”* error 
> occurs.  In the original version of filter-media in DSpace 1.4.2 (I’m 
> not sure about 1.5), this error causes the process to fail.  I know I 
> received an email recently from someone who kindly provided me with some 
> code changes for MediaFilterManager.java that would allow the process to 
> continue when this error occurs instead of failing.

There are some unfortunate known memory issues with the PDFBox software. 
  I had submitted a report of this "Java Heap Space" problem to the 
creators of the PDFBox software nearly a year ago (Oct 2007), and have 
still received no response on it.  Here's my bug report off the PDFBox site:

http://sourceforge.net/tracker/index.php?func=detail&aid=1805929&group_id=78314&atid=552832

However, as you mentioned Graham Triggs figured out a workaround for 
DSpace.  The "workaround" is not perfect, as all we can really do is 
skip over those documents that cause those errors (since there's no way 
to get PDFBox to parse them properly).  This workaround is now in DSpace 
1.5, and there is a patch for 1.4.2.

> 2.          The document is very large, but the process does not fail.  
> Instead, a blank .txt file is created.

A blank .txt file could be caused by one of the following:

(a) PDFBox software failed to parse out any usable text

OR

(b) The PDF itself is *image-based*, and has no plain text (or OCR) 
within it.  PDFBox cannot parse image-based PDFs, as it cannot perform 
OCR.  (A solution to this problem would be to OCR all PDFs *before* 
placing in DSpace...that way PDFBox, *should* be able to extract that 
OCR text)

> 3.          The document contains unreadable characters and cannot be 
> filtered.  These documents are “skipped”.

This sounds like it's likely a problem within the PDFBox software. 
Again, the 'filter-media' script does nothing "magical" in parsing 
PDFs...all the parsing is performed by PDFBox.


>      My question is this:  Does anyone know if there a size limit/cutoff 
> point where a document is TOO large to be filtered??  If there is, then 
> I have no idea what to do about these documents.  The largest document 
> in our repository is 1,862,628,176 bytes or 1.9 GB.  I don’t suppose a 
> .zip file can be filtered???

I'm not sure of a size limit.  I think most likely for larger sized 
files PDFBox will either work, or throw a "Java Heap Space" (in which 
case DSpace will skip over that document).

At this point in time, DSpace *doesn't* filter ZIP files.  Currently, 
there's only filters built for PDF, Word and HTML.  I've also built a 
custom one for OpenOffice.org files (which requires OpenOffice.org 
software installed on the box running DSpace...it's not yet available in 
out-of-the-box DSpace):

https://services.ideals.uiuc.edu/wiki/bin/view/IDEALS/Internal/OpenOfficeConvert


So, I know this entire message sounds like I'm blaming PDFBox for all of 
your PDF problems (in a way I guess I am).  I'm sorry this really 
doesn't *resolve* any of the problems you are encountering (but at least 
it lets you know you are not alone in these issues).

Personally, because of the lack of activity around the PDFBox software, 
I feel DSpace should start to investigate *other means* to filter PDFs. 
  If anyone out there is aware of software that could *replace* PDFBox, 
I think that would be worth closer investigation.  I'd love to help 
DSpace get around these filtering issues...we just need to come up with 
some potential solutions.

Thoughts or ideas are welcome.

- Tim

-- 
Tim Donohue
Research Programmer, IDEALS
http://www.ideals.uiuc.edu/
University of Illinois
[EMAIL PROTECTED] | (217) 333-4648

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to