We¹re actually also working on getting a bibliography from a Word Doc to a more structured format. We¹re using regular expressions in LibreOffice Writer to mark up the citations, then insert tabs between the elements, and then copy into a spreadsheet (similar to what¹s described in http://programminghistorian.org/lessons/understanding-regular-expressions). However, our bibliography was originally a Word Doc, not OCRed text. This method is pretty reliant on consistent formatting, though, so messy OCR could complicate things. Another thing to note is that it¹s easiest when you know what format the citation is for (e.g., a book or article), since that impacts how the citation is structured. I¹d be happy to provide a sample citation in each step of the process.
All the best, Bonnie On 6/18/15, 1:52 PM, "Matt Sherman" <[email protected]> wrote: >The hope is to take these bibliographies put it into more of a web >searchable/sortable format for researchers to make use out of them. My >colleague was taking some inspiration from the Marlowe Bibliography ( >https://marlowebibliography.org/), though we are hoping to possibly get a >bit more robust with the bibliography we are working on. The important >first step it to be able to parse the existing OCRed bibliography scans we >have into a database, possibly a custom XML format but a database will >probably be easier to append and expand down the road. > >On Thu, Jun 18, 2015 at 1:11 PM, Kyle Banerjee <[email protected]> >wrote: > >> How you want to preprocess and structure the data depends on what you >>hope >> to achieve. Can you say more about what you want the end product to look >> like? >> >> kyle >> >> On Thu, Jun 18, 2015 at 10:08 AM, Matt Sherman >><[email protected]> >> wrote: >> >> > That is a pretty good summation of it yes. I appreciate the >>suggestions, >> > this is a bit of a new realm for me and while I know what I want it >>to do >> > and the structure I want to put it in, the conversion process has been >> > eluding me so thanks for giving me some tools to look into. >> > >> > On Thu, Jun 18, 2015 at 1:04 PM, Eric Lease Morgan <[email protected]> >> wrote: >> > >> > > On Jun 18, 2015, at 12:02 PM, Matt Sherman >><[email protected]> >> > > wrote: >> > > >> > > > I am working with colleague on a side project which involves some >> > scanned >> > > > bibliographies and making them more web >> > searchable/sortable/browse-able. >> > > > While I am quite familiar with the metadata and organization >>aspects >> we >> > > > need, but I am at a bit of a loss on how to automate the process >>of >> > > putting >> > > > the bibliography in a more structured format so that we can avoid >> going >> > > > through hundreds of pages by hand. I am pretty sure regular >> > expressions >> > > > are needed, but I have not had an instance where I need to >>automate >> > > > extracting data from one file type (PDF OCR or text extracted to >>Word >> > > doc) >> > > > and place it into another (either a database or an XML file) with >> some >> > > > enrichment. I would appreciate any suggestions for approaches or >> tools >> > > to >> > > > look into. Thanks for any help/thoughts people can give. >> > > >> > > >> > > If I understand your question correctly, then you have two problems >>to >> > > address: 1) converting PDF, Word, etc. files into plain text, and 2) >> > > marking up the result (which is a bibliography) into structure data. >> > > Correct? >> > > >> > > If so, then if your PDF documents have already been OCRed, or if you >> have >> > > other files, then you can probably feed them to TIKA to quickly and >> > easily >> > > extract the underlying plain text. [1] I wrote a brain-dead shell >> script >> > to >> > > run TIKA in server mode and then convert Word (.docx) files. [2] >> > > >> > > When it comes to marking up the result into structured data, well, >>good >> > > luck. I think such an application is something Library Land sought >>for >> a >> > > long time. ³Can you say Holy Grail?" >> > > >> > > [1] Tika - https://tika.apache.org >> > > [2] brain-dead script - >> > > https://gist.github.com/ericleasemorgan/c4e34ffad96c0221f1ff >> > > >> > > ‹ >> > > Eric >> > > >> > >>
