Hi,
Our graphics unit is experimenting with scanning old theses for
inclusion in our repository - they have just uploaded the first scanned
thesis to the repository (http://hdl.handle.net/1893/340) but DSpace
doesn't appear to be indexing the theses text. [The thesis was uploaded
yesterday and index-all runs nightly as a cron job - I've also just run
index-all from the command line just to make sure].
I'm not involved in the digitisation side, so I'm not 100% sure what
they are doing (and the person that did it is off on holiday now so I
can't ask them), but the PDF file appears to contain the content both as
scanned images (for accurate reproduction), and embedded OCR'd text (for
searching, accessibility etc). Even though the displayed page is
obviously an image, it is possible to select text and copy and paste it
(although I can see obvious OCR errors in the pasted text) and also
search the PDF file directly from Acrobat . . .
Has anyone come across this type of PDF file before (or is there
something more subtle going on here that I've missed)? If the PDF file
does indeed also contain the OCR'd text, any idea how to get DSpace to
index it? If not, is there any advice I should be giving to the folk
doing the digitisation in order to enable them to produce more DSpace
friendly PDFs?
Thanks as ever,
Mike
Michael White
eLearning Developer
Centre for eLearning Development (CeLD)
S7, The Library
University of Stirling
Stirling SCOTLAND
FK9 4LA
Email: [EMAIL PROTECTED]
Tel: +44 (0) 1786 466877
Fax: +44 (0) 1786 466880
http://www.is.stir.ac.uk/celd/ <http://www.is.stir.ac.uk/celd/>
--
The University of Stirling (a charity registered in Scotland, number
SCO11159) is a university established in Scotland by charter at Stirling,
FK9 4LA. Privileged/Confidential Information may be contained in this
message. If you are not the addressee indicated in this message (or
responsible for delivery of the message to such person), you may not
disclose, copy or deliver this message to anyone and any action taken or
omitted to be taken in reliance on it, is prohibited and may be unlawful.
In such case, you should destroy this message and kindly notify the sender
by reply email. Please advise immediately if you or your employer do not
consent to Internet email for messages of this kind.
-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech