Hi Susan,

These are long known issues with PDF text extraction. And they are both due to bugs in the underlying libraries that are used, and not necessarily an issue with the PDF content or size.

For the heap space issue, a new configuration option was added to DSpace 1.5 - if you add to your dspace.cfg:

pdffilter.skiponmemoryexception=true

then it will skip the PDF when an out of memory exception occurs, rather than failing the process.

But there isn't anything that we can do to extract data from PDFs where the errors are occurring.

Note that if you aren't running DSpace 1.5, you might want to make changes to your local PDFFilter class, in line with the diff here:

http://fisheye3.atlassian.com/browse/dspace/branches/dspace-1_5_x/dspace-api/src/main/java/org/dspace/app/mediafilter/PDFFilter.java?r1=2260&r2=2581

G

On 9 Oct 2008, at 21:20, Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS] wrote:

We’ve been having a problem with filter-media for as long as I can remember, with DSpace 1.3.1 and now with DSpace 1.4.2. I’ve emailed the list and discussed this problem with some of the developers before, but we’ve never had a resolution. I’ve been doing some more research on it myself for the past day or so and here are some interesting things that I’ve found:

99% of our documents are .pdf files. filter-media seems to fail with two different types of errors:
Java heap space – memory error
Possibly unreadable character(s) error or problem with the actual format and/or scanning of the document

filter-media does not actually fail with error type (b.) above, but it does fail with error type (a.). This error has resulted in hundreds, maybe thousands of our documents not being filtered and, consequently, not being full-text searchable.

I used to think that perhaps the memory error was caused by our repository being fairly large (right now we have a total of 101,633 Items and are in the process of loading thousands more) – that perhaps the memory problem resulted *after* filtering lots of documents – maybe it had eaten up all the memory in the process. Today I figured out that is absolutely not the problem. What I did in an attempt to get all the unfiltered documents filtered, is I wrote a sql query that created a filter-media execution line (“$BINDIR/dsrun org.dspace.app.mediafilter.MediaFilterManager -n -i 2121/68481 [EMAIL PROTECTED]) for each individual Item in DSpace that did NOT have a $$$$$$$.pdf.txt document in the Bitstream table, then I copied all these lines into one script and ran it. So basically what happens is that filter-media executes over and over again, with the –i option (where you specify a handle you want filtered), once for each document that hadn’t been previously filtered. What I found is that the errors were occurring on the filtering of a *single* document and were not caused by an “memory accumulation” effect.

In looking at some of the documents that were causing the errors, it appears that perhaps it is the larger documents that are getting the Java heap space error, although I’m not quite sure of this. Here is one of the errors that occurred:


Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.nio.CharBuffer.wrap(CharBuffer.java:350)
        at java.nio.CharBuffer.wrap(CharBuffer.java:373)
at java.lang.StringCoding $StringDecoder.decode(StringCoding.java:138)
        at java.lang.StringCoding.decode(StringCoding.java:173)
        at java.lang.String.<init>(String.java:444)
        at java.lang.String.<init>(String.java:516)
at org.fontbox.cmap.CMapParser.createStringFromBytes(CMapParser.java:418)
        at org.fontbox.cmap.CMapParser.parse(CMapParser.java:152)
        at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:535)
        at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387)
at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)
        at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java: 452) at org .pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java: 215) at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java: 174) at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336) at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259) at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216) at org .dspace .app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:142) at org .dspace .app.mediafilter.MediaFilter.processBitstream(MediaFilter.java:169) at org .dspace .app .mediafilter .MediaFilterManager.filterBitstream(MediaFilterManager.java:344) at org .dspace .app .mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java: 313) at org .dspace .app .mediafilter .MediaFilterManager.applyFiltersItem(MediaFilterManager.java:280) at org .dspace .app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:219)


Seems a lot of the Googling I’ve been doing on this indicates either the document is too large to be filtered, or there are some strings in the document that are too large for the String or Substring it’s trying to do.


The other errors seem to be caused by, perhaps, non-readable characters (maybe a bad scan of the document..??) or something actually wrong with the scanned document. Here are some of those errors:

ERROR filtering, skipping bitstream #46251 java.io.IOException: Error expected floating point number actual='110.-21'

ERROR filtering, skipping bitstream #46372 java.io.StreamCorruptedException: Error: data is null

ERROR filtering, skipping bitstream #46675 java.io.IOException: Error expected floating point number actual='98.-46'

ERROR filtering, skipping bitstream #46823 java.io.IOException: Error: Expected operator 'ID' actual='IM'

ERROR filtering, skipping bitstream #51652 java.io.EOFException: Unexpected end of ZLIB input stream (Sue: WHAT??!!)

ERROR filtering, skipping bitstream #46894 java.io.IOException: Error getting pdf version:java.lang.NumberFormatException: For input string: "fi" (Sue: Wow! This is interesting…..??)

ERROR filtering, skipping bitstream #46938 java.io.IOException: Error: Expected operator 'ID' actual='IM'


I am going to have a few of these documents rescanned to see if that will correct the problem, however I have no idea how to correct the heap space error. Here’s what our “dsrun” looks like:

java -Xmx3072m -Dfile.encoding=UTF-8 -classpath $FULLPATH "$@"

We are running postgreSQL 8.2.5 on Sun Solaris 10 with DSpace 1.4.2 (and gearing up for 1.5).

Can anyone help with this? This is a serious problem for us, since like I said, it is causing our full-text searchability to be inaccurate/incomplete.

Thanks in advance,
Sue





Sue Walker-Thornton
ConITS Contract
NASA Langley Research Center
Integrated Library Systems Application & Database Administrator
130 Research Drive
Hampton, VA  23666
Office: (757) 224-4074
Fax:    (757) 224-4001
Pager: (757) 988-2547
Email:  [EMAIL PROTECTED]

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech




This e-mail is confidential and should not be used by anyone who is not the 
original intended recipient. BioMed Central Limited does not accept liability 
for any statements made which are clearly the sender's own and not expressly 
made on behalf of BioMed Central Limited. No contracts may be concluded on 
behalf of BioMed Central Limited by means of e-mail communication. BioMed 
Central Limited Registered in England and Wales with registered number 3680030 
Registered Office Middlesex House, 34-42 Cleveland Street, London W1T 4LB

This email has been scanned by Postini.
For more information please visit http://www.postini.com

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech
  • [Dspace-tech] Prob... Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
    • Re: [Dspace-t... Graham Triggs
      • Re: [Dspa... Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]

Reply via email to