Aww, darn George.  I know how you feel.  I was absolutely thrilled after years 
of problems with PDFBox when I finally installed XPDF and *all* of my truly 
non-corrupt documents were filtered in just a couple of hours.  Previously, 
filter-media was taking days and days to complete and had a lot of documents it 
couldn’t filter for some unknown reason.  Then I tried to get pdftoppm to work 
so I could create thumbnails for a particular collection, and I had the same 
problem as you.  I spent like a week trying to get it to work, and finally just 
gave up.  Seems I was getting a very similar error as you - and it worked on 
one machine and not on another - and no matter what I tried it wouldn’t work.  
It was very disappointing.  While it’s been awhile and I don’t remember the 
exact details, I do remember thinking it was a problem with the package 
installation on that particular machine because I was able to trace it down to 
a specific error that said something like “…reinstall your software…”, but 
we’ve never gotten back to it.  And while I can’t be sure in your case, my 
guess would be there is some, perhaps subtle difference in the environments on 
the 2 machines, in the operating system or in some component that xpdf needs to 
execute.

Sorry you had to go back to PDFBox.  If I ever get a chance to work on the 
problem with pdftoppm again and I get it to work, I’ll let you know!

Best regards,
Sue



Sue Walker-Thornton
Software Developer/Database Administrator
NASA Langley Research Center|LITES Contract
(757) 224-4074


From: George Stanley Kozak [mailto:[email protected]]
Sent: Tuesday, February 15, 2011 9:42 AM
To: [email protected]
Subject: Re: [Dspace-tech] Strange problem with xpdf

Hi…

Last week I wrote to the list about a strange problem that I was having using 
xpdf-3.02 with my filter-media with my DSpace 1.6.2 instance.  Here is the 
final update for anyone who might be interested.

The problem was that it worked fine on my test server, but text extracting on 
production was failing with the message:
java.io.IOException: pdftotext failed, maybe corrupt PDF? status=9

My test and production machines are virtually mirrors of each other when it 
comes to setup.

I tried reinstalling xpdf on my production machine, but I still couldn’t get 
the pdftotext to function properly.  In desperation (because I had a lot of 
recent PDFs that needed to be indexed), I went back to using PDFBox in my 
filter-media, and everything is working fine now.

I the end, I have no idea why  xpdf would not work on my production machine, 
but for now my problem is fixed.

George Kozak
Digital Library Specialist
Cornell University Library Information Technologies (CUL-IT)
501 Olin Library
Cornell University
Ithaca, NY 14853
607-255-8924

From: George Stanley Kozak
Sent: Friday, February 11, 2011 10:22 AM
To: [email protected]
Subject: Strange problem with xpdf

Hi…

I am using xpdf-3.02 with my filter-media with my DSpace 1.6.2 instance.

On my test server, running filter-media works fine.  On my production server, I 
have discovered that the pdftotext is failing with:
java.io.IOException: pdftotext failed, maybe corrupt PDF? status=9
java.io.IOException: pdftotext failed, maybe corrupt PDF? status=9
        at 
org.dspace.app.mediafilter.XPDF2Text.getDestinationStream(XPDF2Text.java:159)

The same PDFs that can be filtered on the Test Server, do not filter on the 
Production Server.

I have checked the xpdf binaries and they are correct (I even recompiled them 
on Production).  The libraries seem to be correct.

Does anyone have any ideas as to why this would work on my test instance and 
not on my production instance?

By the way, I built my instance using “mvn –Pxpdf-mediafilter-support –U clean 
package”

George Kozak
Digital Library Specialist
Cornell University Library Information Technologies (CUL-IT)
501 Olin Library
Cornell University
Ithaca, NY 14853
607-255-8924

------------------------------------------------------------------------------
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to