I have also run across this problem - it seems like even though my PDFs have 
readable text, DSpace chooses to OCR the text on its own, resulting in a lot of 
errors.

Alice Platt
Digital Initiatives Librarian
Shapiro Library
Southern New Hampshire University
2500 North River Rd
Manchester, NH 03106
USA

From: Hutchinson, Alvin [mailto:[email protected]]
Sent: Tuesday, April 12, 2011 2:30 PM
To: '[email protected]'
Cc: Richard, Joel M
Subject: [Dspace-general] Filter Media Text Error


In recent weeks we have uploaded content (PDF) that produces some strange text 
when filter-media is run.



The text in the PDF is selectable and readable but the corresponding *.txt file 
created by filter-media has removed all spaces between words.



So we are unable to search for certain words (e.g. scientific plant or animal 
names) because the terms are all run together in one string.



I have attached both files, but if they are not transmitted due to listserv 
software, etc. an example is below.





My question: Has anyone else run across this or can anyone tell me what the 
problem is?



I once thought it was the manner in which these files were scanned, but I am 
able to select, copy and paste the text from the PDF and it maintains word and 
character spacing.







The PDF reads, for example:



larval stages of the Xanthidae are better known than those

of any other family of the Brachyura. This doubtless is due to the

fact that the adults habitually are found in shallow water near

the shore and usually are very abundant. Ovigerous females may

be taken without trouble, and thus the early zoeal stages may be

known with certainty.



But the lines from the corresponding *.txt file shows



larvalstagesoftheXanthidaearebetterknownthanthoseofanyotherfamilyoftheBrachyura.Thisdoubtlessisduetothefactthattheadultshabituallyarefoundinshallowwaterneartheshoreandusuallyareveryabundant.Ovigerousfemalesmay

betakenwithouttrouble,andthustheearlyzoealstagesmaybeknownwithcertainty





Thanks in advance for any help



Alvin Hutchinson

Smithsonian Institution Libraries

(202) 633-1031





Please consider the environment before printing this e-mail.
------------------------------------------------------------------------------
Forrester Wave Report - Recovery time is now measured in hours and minutes
not days. Key insights are discussed in the 2010 Forrester Wave Report as
part of an in-depth evaluation of disaster recovery service providers.
Forrester found the best-in-class provider in terms of services and vision.
Read this report now!  http://p.sf.net/sfu/ibm-webcastpromo
_______________________________________________
Dspace-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-general

Reply via email to