Hi Susan,
These are long known issues with PDF text extraction. And they are
both due to bugs in the underlying libraries that are used, and not
necessarily an issue with the PDF content or size.
For the heap space issue, a new configuration option was added to
DSpace 1.5 - if you add to your dspace.cfg:
pdffilter.skiponmemoryexception=true
then it will skip the PDF when an out of memory exception occurs,
rather than failing the process.
But there isn't anything that we can do to extract data from PDFs
where the errors are occurring.
Note that if you aren't running DSpace 1.5, you might want to make
changes to your local PDFFilter class, in line with the diff here:
http://fisheye3.atlassian.com/browse/dspace/branches/dspace-1_5_x/dspace-api/src/main/java/org/dspace/app/mediafilter/PDFFilter.java?r1=2260&r2=2581
G
On 9 Oct 2008, at 21:20, Thornton, Susan M. (LARC-B702)[NCI
INFORMATION SYSTEMS] wrote:
We’ve been having a problem with filter-media for as long as I
can remember, with DSpace 1.3.1 and now with DSpace 1.4.2. I’ve
emailed the list and discussed this problem with some of the
developers before, but we’ve never had a resolution. I’ve been
doing some more research on it myself for the past day or so and
here are some interesting things that I’ve found:
99% of our documents are .pdf files. filter-media seems to fail
with two different types of errors:
Java heap space – memory error
Possibly unreadable character(s) error or problem with the actual
format and/or scanning of the document
filter-media does not actually fail with error type (b.) above, but
it does fail with error type (a.). This error has resulted in
hundreds, maybe thousands of our documents not being filtered and,
consequently, not being full-text searchable.
I used to think that perhaps the memory error was caused by our
repository being fairly large (right now we have a total of 101,633
Items and are in the process of loading thousands more) – that
perhaps the memory problem resulted *after* filtering lots of
documents – maybe it had eaten up all the memory in the process.
Today I figured out that is absolutely not the problem. What I did
in an attempt to get all the unfiltered documents filtered, is I
wrote a sql query that created a filter-media execution line
(“$BINDIR/dsrun org.dspace.app.mediafilter.MediaFilterManager -n -i
2121/68481 [EMAIL PROTECTED]) for each individual Item in DSpace that did NOT have
a $$$$$$$.pdf.txt document in the Bitstream table, then I copied all
these lines into one script and ran it. So basically what happens
is that filter-media executes over and over again, with the –i
option (where you specify a handle you want filtered), once for each
document that hadn’t been previously filtered. What I found is that
the errors were occurring on the filtering of a *single* document
and were not caused by an “memory accumulation” effect.
In looking at some of the documents that were causing the errors, it
appears that perhaps it is the larger documents that are getting the
Java heap space error, although I’m not quite sure of this. Here is
one of the errors that occurred:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.nio.CharBuffer.wrap(CharBuffer.java:350)
at java.nio.CharBuffer.wrap(CharBuffer.java:373)
at java.lang.StringCoding
$StringDecoder.decode(StringCoding.java:138)
at java.lang.StringCoding.decode(StringCoding.java:173)
at java.lang.String.<init>(String.java:444)
at java.lang.String.<init>(String.java:516)
at
org.fontbox.cmap.CMapParser.createStringFromBytes(CMapParser.java:418)
at org.fontbox.cmap.CMapParser.parse(CMapParser.java:152)
at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:535)
at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387)
at
org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)
at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
at
org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:
452)
at
org
.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:
215)
at
org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:
174)
at
org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
at
org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
at
org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
at
org
.dspace
.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:142)
at
org
.dspace
.app.mediafilter.MediaFilter.processBitstream(MediaFilter.java:169)
at
org
.dspace
.app
.mediafilter
.MediaFilterManager.filterBitstream(MediaFilterManager.java:344)
at
org
.dspace
.app
.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:
313)
at
org
.dspace
.app
.mediafilter
.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:280)
at
org
.dspace
.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:219)
Seems a lot of the Googling I’ve been doing on this indicates either
the document is too large to be filtered, or there are some strings
in the document that are too large for the String or Substring it’s
trying to do.
The other errors seem to be caused by, perhaps, non-readable
characters (maybe a bad scan of the document..??) or something
actually wrong with the scanned document. Here are some of those
errors:
ERROR filtering, skipping bitstream #46251 java.io.IOException:
Error expected floating point number actual='110.-21'
ERROR filtering, skipping bitstream #46372
java.io.StreamCorruptedException: Error: data is null
ERROR filtering, skipping bitstream #46675 java.io.IOException:
Error expected floating point number actual='98.-46'
ERROR filtering, skipping bitstream #46823 java.io.IOException:
Error: Expected operator 'ID' actual='IM'
ERROR filtering, skipping bitstream #51652 java.io.EOFException:
Unexpected end of ZLIB input stream (Sue: WHAT??!!)
ERROR filtering, skipping bitstream #46894 java.io.IOException:
Error getting pdf version:java.lang.NumberFormatException: For input
string: "fi" (Sue: Wow! This is interesting…..??)
ERROR filtering, skipping bitstream #46938 java.io.IOException:
Error: Expected operator 'ID' actual='IM'
I am going to have a few of these documents rescanned to see if that
will correct the problem, however I have no idea how to correct the
heap space error. Here’s what our “dsrun” looks like:
java -Xmx3072m -Dfile.encoding=UTF-8 -classpath $FULLPATH "$@"
We are running postgreSQL 8.2.5 on Sun Solaris 10 with DSpace 1.4.2
(and gearing up for 1.5).
Can anyone help with this? This is a serious problem for us, since
like I said, it is causing our full-text searchability to be
inaccurate/incomplete.
Thanks in advance,
Sue
Sue Walker-Thornton
ConITS Contract
NASA Langley Research Center
Integrated Library Systems Application & Database Administrator
130 Research Drive
Hampton, VA 23666
Office: (757) 224-4074
Fax: (757) 224-4001
Pager: (757) 988-2547
Email: [EMAIL PROTECTED]
-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's
challenge
Build the coolest Linux based applications with Moblin SDK & win
great prizes
Grand prize is a trip for two to an Open Source event anywhere in
the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech
This e-mail is confidential and should not be used by anyone who is not the
original intended recipient. BioMed Central Limited does not accept liability
for any statements made which are clearly the sender's own and not expressly
made on behalf of BioMed Central Limited. No contracts may be concluded on
behalf of BioMed Central Limited by means of e-mail communication. BioMed
Central Limited Registered in England and Wales with registered number 3680030
Registered Office Middlesex House, 34-42 Cleveland Street, London W1T 4LB
This email has been scanned by Postini.
For more information please visit http://www.postini.com
-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech