Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata and/or a Database

Kevin Hawkins Fri, 19 Jun 2015 06:41:29 -0700

See also http://wiki.tei-c.org/index.php/Heuristics , which discussesthis problem more broadly conceived. I've just added a link to thearchives of this very discussion. --Kevin


On 6/18/15 12:52 PM, Matt Sherman wrote:

The hope is to take these bibliographies put it into more of a web
searchable/sortable format for researchers to make use out of them.  My
colleague was taking some inspiration from the Marlowe Bibliography (
https://marlowebibliography.org/), though we are hoping to possibly get a
bit more robust with the bibliography we are working on.  The important
first step it to be able to parse the existing OCRed bibliography scans we
have into a database, possibly a custom XML format but a database will
probably be easier to append and expand down the road.


On Thu, Jun 18, 2015 at 1:11 PM, Kyle Banerjee <[email protected]>
wrote:

How you want to preprocess and structure the data depends on what you hope
to achieve. Can you say more about what you want the end product to look
like?

kyle

On Thu, Jun 18, 2015 at 10:08 AM, Matt Sherman <[email protected]>
wrote:

That is a pretty good summation of it yes.  I appreciate the suggestions,
this is a bit of a new realm for me and while I know what I want it to do
and the structure I want to put it in, the conversion process has been
eluding me so thanks for giving me some tools to look into.

On Thu, Jun 18, 2015 at 1:04 PM, Eric Lease Morgan <[email protected]>

wrote:

On Jun 18, 2015, at 12:02 PM, Matt Sherman <[email protected]>
wrote:

I am working with colleague on a side project which involves some

scanned

bibliographies and making them more web

searchable/sortable/browse-able.

While I am quite familiar with the metadata and organization aspects

we

need, but I am at a bit of a loss on how to automate the process of

putting

the bibliography in a more structured format so that we can avoid

going

through hundreds of pages by hand.  I am pretty sure regular

expressions

are needed, but I have not had an instance where I need to automate
extracting data from one file type (PDF OCR or text extracted to Word

doc)

and place it into another (either a database or an XML file) with

some

enrichment.  I would appreciate any suggestions for approaches or

tools

to

look into.  Thanks for any help/thoughts people can give.



If I understand your question correctly, then you have two problems to
address: 1) converting PDF, Word, etc. files into plain text, and 2)
marking up the result (which is a bibliography) into structure data.
Correct?

If so, then if your PDF documents have already been OCRed, or if you

have

other files, then you can probably feed them to TIKA to quickly and

easily

extract the underlying plain text. [1] I wrote a brain-dead shell

script

to

run TIKA in server mode and then convert Word (.docx) files. [2]

When it comes to marking up the result into structured data, well, good
luck. I think such an application is something Library Land sought for

long time. “Can you say Holy Grail?"

[1] Tika - https://tika.apache.org
[2] brain-dead script -
https://gist.github.com/ericleasemorgan/c4e34ffad96c0221f1ff

—
Eric

Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata and/or a Database

Reply via email to