I have also run across this problem - it seems like even though my PDFs have
readable text, DSpace chooses to OCR the text on its own, resulting in a lot of
errors.
Alice Platt
Digital Initiatives Librarian
Shapiro Library
Southern New Hampshire University
2500 North River Rd
Manchester, NH 03106
USA
From: Hutchinson, Alvin [mailto:[email protected]]
Sent: Tuesday, April 12, 2011 2:30 PM
To: '[email protected]'
Cc: Richard, Joel M
Subject: [Dspace-general] Filter Media Text Error
In recent weeks we have uploaded content (PDF) that produces some strange text
when filter-media is run.
The text in the PDF is selectable and readable but the corresponding *.txt file
created by filter-media has removed all spaces between words.
So we are unable to search for certain words (e.g. scientific plant or animal
names) because the terms are all run together in one string.
I have attached both files, but if they are not transmitted due to listserv
software, etc. an example is below.
My question: Has anyone else run across this or can anyone tell me what the
problem is?
I once thought it was the manner in which these files were scanned, but I am
able to select, copy and paste the text from the PDF and it maintains word and
character spacing.
The PDF reads, for example:
larval stages of the Xanthidae are better known than those
of any other family of the Brachyura. This doubtless is due to the
fact that the adults habitually are found in shallow water near
the shore and usually are very abundant. Ovigerous females may
be taken without trouble, and thus the early zoeal stages may be
known with certainty.
But the lines from the corresponding *.txt file shows
larvalstagesoftheXanthidaearebetterknownthanthoseofanyotherfamilyoftheBrachyura.Thisdoubtlessisduetothefactthattheadultshabituallyarefoundinshallowwaterneartheshoreandusuallyareveryabundant.Ovigerousfemalesmay
betakenwithouttrouble,andthustheearlyzoealstagesmaybeknownwithcertainty
Thanks in advance for any help
Alvin Hutchinson
Smithsonian Institution Libraries
(202) 633-1031
Please consider the environment before printing this e-mail.
------------------------------------------------------------------------------
Forrester Wave Report - Recovery time is now measured in hours and minutes
not days. Key insights are discussed in the 2010 Forrester Wave Report as
part of an in-depth evaluation of disaster recovery service providers.
Forrester found the best-in-class provider in terms of services and vision.
Read this report now! http://p.sf.net/sfu/ibm-webcastpromo
_______________________________________________
Dspace-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-general