Re: [Dspace-tech] Problem with filter-media

Graham Triggs Thu, 09 Oct 2008 14:03:06 -0700

Hi Susan,

These are long known issues with PDF text extraction. And they areboth due to bugs in the underlying libraries that are used, and notnecessarily an issue with the PDF content or size.

For the heap space issue, a new configuration option was added toDSpace 1.5 - if you add to your dspace.cfg:


pdffilter.skiponmemoryexception=true

then it will skip the PDF when an out of memory exception occurs,rather than failing the process.

But there isn't anything that we can do to extract data from PDFswhere the errors are occurring.

Note that if you aren't running DSpace 1.5, you might want to makechanges to your local PDFFilter class, in line with the diff here:


http://fisheye3.atlassian.com/browse/dspace/branches/dspace-1_5_x/dspace-api/src/main/java/org/dspace/app/mediafilter/PDFFilter.java?r1=2260&r2=2581

G

On 9 Oct 2008, at 21:20, Thornton, Susan M. (LARC-B702)[NCIINFORMATION SYSTEMS] wrote:

We’ve been having a problem with filter-media for as long as Ican remember, with DSpace 1.3.1 and now with DSpace 1.4.2. I’veemailed the list and discussed this problem with some of thedevelopers before, but we’ve never had a resolution. I’ve beendoing some more research on it myself for the past day or so andhere are some interesting things that I’ve found:
99% of our documents are .pdf files. filter-media seems to failwith two different types of errors:
Java heap space – memory error
Possibly unreadable character(s) error or problem with the actualformat and/or scanning of the document
filter-media does not actually fail with error type (b.) above, butit does fail with error type (a.). This error has resulted inhundreds, maybe thousands of our documents not being filtered and,consequently, not being full-text searchable.
I used to think that perhaps the memory error was caused by ourrepository being fairly large (right now we have a total of 101,633Items and are in the process of loading thousands more) – thatperhaps the memory problem resulted *after* filtering lots ofdocuments – maybe it had eaten up all the memory in the process.Today I figured out that is absolutely not the problem. What I didin an attempt to get all the unfiltered documents filtered, is Iwrote a sql query that created a filter-media execution line(“$BINDIR/dsrun org.dspace.app.mediafilter.MediaFilterManager -n -i2121/68481 [EMAIL PROTECTED]) for each individual Item in DSpace that did NOT havea $$$$$$$.pdf.txt document in the Bitstream table, then I copied allthese lines into one script and ran it. So basically what happensis that filter-media executes over and over again, with the –ioption (where you specify a handle you want filtered), once for eachdocument that hadn’t been previously filtered. What I found is thatthe errors were occurring on the filtering of a *single* documentand were not caused by an “memory accumulation” effect.
In looking at some of the documents that were causing the errors, itappears that perhaps it is the larger documents that are getting theJava heap space error, although I’m not quite sure of this. Here isone of the errors that occurred:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.nio.CharBuffer.wrap(CharBuffer.java:350)
        at java.nio.CharBuffer.wrap(CharBuffer.java:373)
at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:138)
        at java.lang.StringCoding.decode(StringCoding.java:173)
        at java.lang.String.<init>(String.java:444)
        at java.lang.String.<init>(String.java:516)
atorg.fontbox.cmap.CMapParser.createStringFromBytes(CMapParser.java:418)
        at org.fontbox.cmap.CMapParser.parse(CMapParser.java:152)
        at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:535)
        at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387)
atorg.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)
        at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
atorg.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)atorg.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)atorg.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)atorg.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)atorg.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)atorg.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)atorg.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:142)atorg.dspace.app.mediafilter.MediaFilter.processBitstream(MediaFilter.java:169)atorg.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:344)atorg.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:313)atorg.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:280)atorg.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:219)
Seems a lot of the Googling I’ve been doing on this indicates eitherthe document is too large to be filtered, or there are some stringsin the document that are too large for the String or Substring it’strying to do.
The other errors seem to be caused by, perhaps, non-readablecharacters (maybe a bad scan of the document..??) or somethingactually wrong with the scanned document. Here are some of thoseerrors:
ERROR filtering, skipping bitstream #46251 java.io.IOException:Error expected floating point number actual='110.-21'
ERROR filtering, skipping bitstream #46372java.io.StreamCorruptedException: Error: data is null
ERROR filtering, skipping bitstream #46675 java.io.IOException:Error expected floating point number actual='98.-46'
ERROR filtering, skipping bitstream #46823 java.io.IOException:Error: Expected operator 'ID' actual='IM'
ERROR filtering, skipping bitstream #51652 java.io.EOFException:Unexpected end of ZLIB input stream (Sue: WHAT??!!)
ERROR filtering, skipping bitstream #46894 java.io.IOException:Error getting pdf version:java.lang.NumberFormatException: For inputstring: "fi" (Sue: Wow! This is interesting…..??)
ERROR filtering, skipping bitstream #46938 java.io.IOException:Error: Expected operator 'ID' actual='IM'
I am going to have a few of these documents rescanned to see if thatwill correct the problem, however I have no idea how to correct theheap space error. Here’s what our “dsrun” looks like:
java -Xmx3072m -Dfile.encoding=UTF-8 -classpath $FULLPATH "$@"
We are running postgreSQL 8.2.5 on Sun Solaris 10 with DSpace 1.4.2(and gearing up for 1.5).
Can anyone help with this? This is a serious problem for us, sincelike I said, it is causing our full-text searchability to beinaccurate/incomplete.
Thanks in advance,
Sue





Sue Walker-Thornton
ConITS Contract
NASA Langley Research Center
Integrated Library Systems Application & Database Administrator
130 Research Drive
Hampton, VA  23666
Office: (757) 224-4074
Fax:    (757) 224-4001
Pager: (757) 988-2547
Email:  [EMAIL PROTECTED]

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer'schallengeBuild the coolest Linux based applications with Moblin SDK & wingreat prizesGrand prize is a trip for two to an Open Source event anywhere inthe world
http://moblin-contest.org/redirect.php?banner_id=100&url=/_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech





This e-mail is confidential and should not be used by anyone who is not the 
original intended recipient. BioMed Central Limited does not accept liability 
for any statements made which are clearly the sender's own and not expressly 
made on behalf of BioMed Central Limited. No contracts may be concluded on 
behalf of BioMed Central Limited by means of e-mail communication. BioMed 
Central Limited Registered in England and Wales with registered number 3680030 
Registered Office Middlesex House, 34-42 Cleveland Street, London W1T 4LB

This email has been scanned by Postini.
For more information please visit http://www.postini.com

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/

_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] Problem with filter-media

Reply via email to