Alice,

I did some digging around and it turns out that DSpace is using PDFBox to do 
the text extraction. Back in 2007, this bug was reported in PDFBox:

https://issues.apache.org/jira/browse/PDFBOX-234

<https://issues.apache.org/jira/browse/PDFBOX-234>And it looks to have been 
fixed in PDFBox Version 0.8.x. Our installed version of DSpace (1.6.1) is using 
PDFBox version 0.7.3.

In digging through the DSpace Jira site, I found this, which indicates that 
this problem is fixed in DSpace 1.7.1

https://jira.duraspace.org/browse/DS-704

DSpace now includes a much later version of PDFBox (1.2.1).


<https://jira.duraspace.org/browse/DS-704>I guess it's time to upgrade!

--Joel


Joel Richard
IT Specialist, Web Services Department
Smithsonian Institution Libraries | http://www.sil.si.edu/
(202) 633-1706 | [email protected]<mailto:[email protected]>




On Apr 12, 2011, at 3:53 PM, Platt, Alice wrote:

I have also run across this problem – it seems like even though my PDFs have 
readable text, DSpace chooses to OCR the text on its own, resulting in a lot of 
errors.

Alice Platt
Digital Initiatives Librarian
Shapiro Library
Southern New Hampshire University
2500 North River Rd
Manchester, NH 03106
USA

From: Hutchinson, Alvin [mailto:[email protected]]
Sent: Tuesday, April 12, 2011 2:30 PM
To: 
'[email protected]<mailto:'[email protected]>'
Cc: Richard, Joel M
Subject: [Dspace-general] Filter Media Text Error

In recent weeks we have uploaded content (PDF) that produces some strange text 
when filter-media is run.

The text in the PDF is selectable and readable but the corresponding *.txt file 
created by filter-media has removed all spaces between words.

So we are unable to search for certain words (e.g. scientific plant or animal 
names) because the terms are all run together in one string.

I have attached both files, but if they are not transmitted due to listserv 
software, etc. an example is below.


My question: Has anyone else run across this or can anyone tell me what the 
problem is?

I once thought it was the manner in which these files were scanned, but I am 
able to select, copy and paste the text from the PDF and it maintains word and 
character spacing.



The PDF reads, for example:

larval stages of the Xanthidae are better known than those
of any other family of the Brachyura. This doubtless is due to the
fact that the adults habitually are found in shallow water near
the shore and usually are very abundant. Ovigerous females may
be taken without trouble, and thus the early zoeal stages may be
known with certainty.

But the lines from the corresponding *.txt file shows

larvalstagesoftheXanthidaearebetterknownthanthoseofanyotherfamilyoftheBrachyura.Thisdoubtlessisduetothefactthattheadultshabituallyarefoundinshallowwaterneartheshoreandusuallyareveryabundant.Ovigerousfemalesmay
betakenwithouttrouble,andthustheearlyzoealstagesmaybeknownwithcertainty


Thanks in advance for any help

Alvin Hutchinson
Smithsonian Institution Libraries
(202) 633-1031



Please consider the environment before printing this e-mail.


------------------------------------------------------------------------------
Fulfilling the Lean Software Promise
Lean software platforms are now widely adopted and the benefits have been 
demonstrated beyond question. Learn why your peers are replacing JEE 
containers with lightweight application servers - and what you can gain 
from the move. http://p.sf.net/sfu/vmware-sfemails
_______________________________________________
Dspace-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-general

Reply via email to