Susan, Although you never mentioned it specifically, it sounds like you are talking about filtering of *PDF* documents. Before I get to some of the reasons for your current problems, it's worth mentioning a bit of background (you may have already heard this before, but I want to be sure).
(1) The DSpace filter-media script uses the open source PDFBox (www.pdfbox.org) tool to actually perform filtering of PDFs and try to extract text (It's worth noting PDFBox is only called for PDFs...DSpace uses other open source tools to parse Word & HTML documents) (2) If you look closely at the PDFBox site, you'll notice that software has *not* had an updated release since Oct 2006. (3) Unfortunately, there doesn't seem to be any comparable PDF tools to PDFBox (If anyone knows of one, we'd love to hear of it). So, DSpace has been stuck using the (currently unmaintained) PDFBox software until something else comes along. (4) Failures of 'filter-media' to process a PDF, are actually failures of this underlying PDFBox software (since all filter-media does is take the results from PDFBox and writes it out to a .txt file) So, that's a quick background...now on to reasons why PDFBox (and filter-media) will sometimes fail: Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote: > Hi, > > I have spent a lot of time recently working on the filter-media > cron, on the issue of the errors that occur when it encounters a > document that is not filterable for some reason. It seems that there > are several different reasons why filter-media fails: > > 1. The document is very large and a *“Java Heap Space”* error > occurs. In the original version of filter-media in DSpace 1.4.2 (I’m > not sure about 1.5), this error causes the process to fail. I know I > received an email recently from someone who kindly provided me with some > code changes for MediaFilterManager.java that would allow the process to > continue when this error occurs instead of failing. There are some unfortunate known memory issues with the PDFBox software. I had submitted a report of this "Java Heap Space" problem to the creators of the PDFBox software nearly a year ago (Oct 2007), and have still received no response on it. Here's my bug report off the PDFBox site: http://sourceforge.net/tracker/index.php?func=detail&aid=1805929&group_id=78314&atid=552832 However, as you mentioned Graham Triggs figured out a workaround for DSpace. The "workaround" is not perfect, as all we can really do is skip over those documents that cause those errors (since there's no way to get PDFBox to parse them properly). This workaround is now in DSpace 1.5, and there is a patch for 1.4.2. > 2. The document is very large, but the process does not fail. > Instead, a blank .txt file is created. A blank .txt file could be caused by one of the following: (a) PDFBox software failed to parse out any usable text OR (b) The PDF itself is *image-based*, and has no plain text (or OCR) within it. PDFBox cannot parse image-based PDFs, as it cannot perform OCR. (A solution to this problem would be to OCR all PDFs *before* placing in DSpace...that way PDFBox, *should* be able to extract that OCR text) > 3. The document contains unreadable characters and cannot be > filtered. These documents are “skipped”. This sounds like it's likely a problem within the PDFBox software. Again, the 'filter-media' script does nothing "magical" in parsing PDFs...all the parsing is performed by PDFBox. > My question is this: Does anyone know if there a size limit/cutoff > point where a document is TOO large to be filtered?? If there is, then > I have no idea what to do about these documents. The largest document > in our repository is 1,862,628,176 bytes or 1.9 GB. I don’t suppose a > .zip file can be filtered??? I'm not sure of a size limit. I think most likely for larger sized files PDFBox will either work, or throw a "Java Heap Space" (in which case DSpace will skip over that document). At this point in time, DSpace *doesn't* filter ZIP files. Currently, there's only filters built for PDF, Word and HTML. I've also built a custom one for OpenOffice.org files (which requires OpenOffice.org software installed on the box running DSpace...it's not yet available in out-of-the-box DSpace): https://services.ideals.uiuc.edu/wiki/bin/view/IDEALS/Internal/OpenOfficeConvert So, I know this entire message sounds like I'm blaming PDFBox for all of your PDF problems (in a way I guess I am). I'm sorry this really doesn't *resolve* any of the problems you are encountering (but at least it lets you know you are not alone in these issues). Personally, because of the lack of activity around the PDFBox software, I feel DSpace should start to investigate *other means* to filter PDFs. If anyone out there is aware of software that could *replace* PDFBox, I think that would be worth closer investigation. I'd love to help DSpace get around these filtering issues...we just need to come up with some potential solutions. Thoughts or ideas are welcome. - Tim -- Tim Donohue Research Programmer, IDEALS http://www.ideals.uiuc.edu/ University of Illinois [EMAIL PROTECTED] | (217) 333-4648 ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ DSpace-tech mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspace-tech

