Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata and/or a Database

Fitchett, Deborah Tue, 23 Jun 2015 13:53:48 -0700

For turning a bibliography into RIS format, I wrote a tool based on a whole 
pile of regex commands bundled into sed files wrapped in an AppleScript app:

Webpage: http://deborahfitchett.com/toys/ref2ris/ 
Code4Lib article: http://journal.code4lib.org/articles/6286

Let me know if you've got questions about using/adapting it. Both of those 
links also list other tools I found trying to do similar things.

Deborah

-----Original Message-----
From: Code for Libraries [mailto:[email protected]] On Behalf Of Eric 
Lease Morgan
Sent: Friday, 19 June 2015 5:04 a.m.
To: [email protected]
Subject: Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata 
and/or a Database

On Jun 18, 2015, at 12:02 PM, Matt Sherman <[email protected]> wrote:

> I am working with colleague on a side project which involves some 
> scanned bibliographies and making them more web 
> searchable/sortable/browse-able.
> While I am quite familiar with the metadata and organization aspects 
> we need, but I am at a bit of a loss on how to automate the process of 
> putting the bibliography in a more structured format so that we can 
> avoid going through hundreds of pages by hand.  I am pretty sure 
> regular expressions are needed, but I have not had an instance where I 
> need to automate extracting data from one file type (PDF OCR or text 
> extracted to Word doc) and place it into another (either a database or 
> an XML file) with some enrichment.  I would appreciate any suggestions 
> for approaches or tools to look into.  Thanks for any help/thoughts people 
> can give.

If I understand your question correctly, then you have two problems to address: 
1) converting PDF, Word, etc. files into plain text, and 2) marking up the 
result (which is a bibliography) into structure data. Correct?

If so, then if your PDF documents have already been OCRed, or if you have other 
files, then you can probably feed them to TIKA to quickly and easily extract 
the underlying plain text. [1] I wrote a brain-dead shell script to run TIKA in 
server mode and then convert Word (.docx) files. [2]

When it comes to marking up the result into structured data, well, good luck. I 
think such an application is something Library Land sought for a long time. 
“Can you say Holy Grail?"

[1] Tika - https://tika.apache.org
[2] brain-dead script - 
https://gist.github.com/ericleasemorgan/c4e34ffad96c0221f1ff

—
Eric

________________________________
P Please consider the environment before you print this email.
"The contents of this e-mail (including any attachments) may be confidential 
and/or subject to copyright. Any unauthorised use, distribution, or copying of 
the contents is expressly prohibited. If you have received this e-mail in 
error, please advise the sender by return e-mail or telephone and then delete 
this e-mail together with all attachments from your system."

Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata and/or a Database

Reply via email to