See also http://wiki.tei-c.org/index.php/Heuristics , which discusses
this problem more broadly conceived. I've just added a link to the
archives of this very discussion. --Kevin
On 6/18/15 12:52 PM, Matt Sherman wrote:
The hope is to take these bibliographies put it into more of a web
searchable/sortable format for researchers to make use out of them. My
colleague was taking some inspiration from the Marlowe Bibliography (
https://marlowebibliography.org/), though we are hoping to possibly get a
bit more robust with the bibliography we are working on. The important
first step it to be able to parse the existing OCRed bibliography scans we
have into a database, possibly a custom XML format but a database will
probably be easier to append and expand down the road.
On Thu, Jun 18, 2015 at 1:11 PM, Kyle Banerjee <[email protected]>
wrote:
How you want to preprocess and structure the data depends on what you hope
to achieve. Can you say more about what you want the end product to look
like?
kyle
On Thu, Jun 18, 2015 at 10:08 AM, Matt Sherman <[email protected]>
wrote:
That is a pretty good summation of it yes. I appreciate the suggestions,
this is a bit of a new realm for me and while I know what I want it to do
and the structure I want to put it in, the conversion process has been
eluding me so thanks for giving me some tools to look into.
On Thu, Jun 18, 2015 at 1:04 PM, Eric Lease Morgan <[email protected]>
wrote:
On Jun 18, 2015, at 12:02 PM, Matt Sherman <[email protected]>
wrote:
I am working with colleague on a side project which involves some
scanned
bibliographies and making them more web
searchable/sortable/browse-able.
While I am quite familiar with the metadata and organization aspects
we
need, but I am at a bit of a loss on how to automate the process of
putting
the bibliography in a more structured format so that we can avoid
going
through hundreds of pages by hand. I am pretty sure regular
expressions
are needed, but I have not had an instance where I need to automate
extracting data from one file type (PDF OCR or text extracted to Word
doc)
and place it into another (either a database or an XML file) with
some
enrichment. I would appreciate any suggestions for approaches or
tools
to
look into. Thanks for any help/thoughts people can give.
If I understand your question correctly, then you have two problems to
address: 1) converting PDF, Word, etc. files into plain text, and 2)
marking up the result (which is a bibliography) into structure data.
Correct?
If so, then if your PDF documents have already been OCRed, or if you
have
other files, then you can probably feed them to TIKA to quickly and
easily
extract the underlying plain text. [1] I wrote a brain-dead shell
script
to
run TIKA in server mode and then convert Word (.docx) files. [2]
When it comes to marking up the result into structured data, well, good
luck. I think such an application is something Library Land sought for
a
long time. “Can you say Holy Grail?"
[1] Tika - https://tika.apache.org
[2] brain-dead script -
https://gist.github.com/ericleasemorgan/c4e34ffad96c0221f1ff
—
Eric