We¹re actually also working on getting a bibliography from a Word Doc to a
more structured format. We¹re using regular expressions in LibreOffice
Writer to mark up the citations, then insert tabs between the elements,
and then copy into a spreadsheet (similar to what¹s described in
http://programminghistorian.org/lessons/understanding-regular-expressions).
 However, our bibliography was originally a Word Doc, not OCRed text. This
method is pretty reliant on consistent formatting, though, so messy OCR
could complicate things. Another thing to note is that it¹s easiest when
you know what format the citation is for (e.g., a book or article), since
that impacts how the citation is structured.  I¹d be happy to provide a
sample citation in each step of the process.

All the best,
Bonnie



On 6/18/15, 1:52 PM, "Matt Sherman" <[email protected]> wrote:

>The hope is to take these bibliographies put it into more of a web
>searchable/sortable format for researchers to make use out of them.  My
>colleague was taking some inspiration from the Marlowe Bibliography (
>https://marlowebibliography.org/), though we are hoping to possibly get a
>bit more robust with the bibliography we are working on.  The important
>first step it to be able to parse the existing OCRed bibliography scans we
>have into a database, possibly a custom XML format but a database will
>probably be easier to append and expand down the road.
>
>On Thu, Jun 18, 2015 at 1:11 PM, Kyle Banerjee <[email protected]>
>wrote:
>
>> How you want to preprocess and structure the data depends on what you
>>hope
>> to achieve. Can you say more about what you want the end product to look
>> like?
>>
>> kyle
>>
>> On Thu, Jun 18, 2015 at 10:08 AM, Matt Sherman
>><[email protected]>
>> wrote:
>>
>> > That is a pretty good summation of it yes.  I appreciate the
>>suggestions,
>> > this is a bit of a new realm for me and while I know what I want it
>>to do
>> > and the structure I want to put it in, the conversion process has been
>> > eluding me so thanks for giving me some tools to look into.
>> >
>> > On Thu, Jun 18, 2015 at 1:04 PM, Eric Lease Morgan <[email protected]>
>> wrote:
>> >
>> > > On Jun 18, 2015, at 12:02 PM, Matt Sherman
>><[email protected]>
>> > > wrote:
>> > >
>> > > > I am working with colleague on a side project which involves some
>> > scanned
>> > > > bibliographies and making them more web
>> > searchable/sortable/browse-able.
>> > > > While I am quite familiar with the metadata and organization
>>aspects
>> we
>> > > > need, but I am at a bit of a loss on how to automate the process
>>of
>> > > putting
>> > > > the bibliography in a more structured format so that we can avoid
>> going
>> > > > through hundreds of pages by hand.  I am pretty sure regular
>> > expressions
>> > > > are needed, but I have not had an instance where I need to
>>automate
>> > > > extracting data from one file type (PDF OCR or text extracted to
>>Word
>> > > doc)
>> > > > and place it into another (either a database or an XML file) with
>> some
>> > > > enrichment.  I would appreciate any suggestions for approaches or
>> tools
>> > > to
>> > > > look into.  Thanks for any help/thoughts people can give.
>> > >
>> > >
>> > > If I understand your question correctly, then you have two problems
>>to
>> > > address: 1) converting PDF, Word, etc. files into plain text, and 2)
>> > > marking up the result (which is a bibliography) into structure data.
>> > > Correct?
>> > >
>> > > If so, then if your PDF documents have already been OCRed, or if you
>> have
>> > > other files, then you can probably feed them to TIKA to quickly and
>> > easily
>> > > extract the underlying plain text. [1] I wrote a brain-dead shell
>> script
>> > to
>> > > run TIKA in server mode and then convert Word (.docx) files. [2]
>> > >
>> > > When it comes to marking up the result into structured data, well,
>>good
>> > > luck. I think such an application is something Library Land sought
>>for
>> a
>> > > long time. ³Can you say Holy Grail?"
>> > >
>> > > [1] Tika - https://tika.apache.org
>> > > [2] brain-dead script -
>> > > https://gist.github.com/ericleasemorgan/c4e34ffad96c0221f1ff
>> > >
>> > > ‹
>> > > Eric
>> > >
>> >
>>

Reply via email to