Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata and/or a Database

2015-06-23 Thread Fitchett, Deborah
For turning a bibliography into RIS format, I wrote a tool based on a whole pile of regex commands bundled into sed files wrapped in an AppleScript app: Webpage: http://deborahfitchett.com/toys/ref2ris/ Code4Lib article: http://journal.code4lib.org/articles/6286 Let me know if you've got

Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata and/or a Database

2015-06-19 Thread Kevin Hawkins
See also http://wiki.tei-c.org/index.php/Heuristics , which discusses this problem more broadly conceived. I've just added a link to the archives of this very discussion. --Kevin On 6/18/15 12:52 PM, Matt Sherman wrote: The hope is to take these bibliographies put it into more of a web

Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata and/or a Database

2015-06-19 Thread Sylvain Machefert
Hi all, As Matt's problem is related to parsing citations, I would definitely have a look at the tools cited by Cindy because going with regexp will quickly become a nightmare. Even if citations have been created following a common reference style: there will necessarily be incoherence,

Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata and/or a Database

2015-06-18 Thread Matt Sherman
That is a pretty good summation of it yes. I appreciate the suggestions, this is a bit of a new realm for me and while I know what I want it to do and the structure I want to put it in, the conversion process has been eluding me so thanks for giving me some tools to look into. On Thu, Jun 18,

Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata and/or a Database

2015-06-18 Thread Eric Lease Morgan
On Jun 18, 2015, at 12:02 PM, Matt Sherman matt.r.sher...@gmail.com wrote: I am working with colleague on a side project which involves some scanned bibliographies and making them more web searchable/sortable/browse-able. While I am quite familiar with the metadata and organization aspects we

Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata and/or a Database

2015-06-18 Thread Kyle Banerjee
How you want to preprocess and structure the data depends on what you hope to achieve. Can you say more about what you want the end product to look like? kyle On Thu, Jun 18, 2015 at 10:08 AM, Matt Sherman matt.r.sher...@gmail.com wrote: That is a pretty good summation of it yes. I appreciate

[CODE4LIB] Desiring Advice for Converting OCR Text into Metadata and/or a Database

2015-06-18 Thread Matt Sherman
Hi Code4Libbers, I am working with colleague on a side project which involves some scanned bibliographies and making them more web searchable/sortable/browse-able. While I am quite familiar with the metadata and organization aspects we need, but I am at a bit of a loss on how to automate the

Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata and/or a Database

2015-06-18 Thread Matt Sherman
The hope is to take these bibliographies put it into more of a web searchable/sortable format for researchers to make use out of them. My colleague was taking some inspiration from the Marlowe Bibliography ( https://marlowebibliography.org/), though we are hoping to possibly get a bit more robust

Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata and/or a Database

2015-06-18 Thread Gordon, Bonnie
We¹re actually also working on getting a bibliography from a Word Doc to a more structured format. We¹re using regular expressions in LibreOffice Writer to mark up the citations, then insert tabs between the elements, and then copy into a spreadsheet (similar to what¹s described in

Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata and/or a Database

2015-06-18 Thread Owen Stephens
It may depend on the format of the PDF, but I’ve used the Scraperwiki Python Module ‘pdf2xml’ function to extract text data from PDFs in the past. There is a write up (not by me) at http://schoolofdata.org/2013/08/16/scraping-pdfs-with-python-and-the-scraperwiki-module/

Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata and/or a Database

2015-06-18 Thread Matt Sherman
Thanks, that is interesting since we can export from the PDFs, and while the OCR text is a little messy it is in decent shape. I'll certainly look into that. On Thu, Jun 18, 2015 at 3:13 PM, Gordon, Bonnie bgor...@rockarch.org wrote: We¹re actually also working on getting a bibliography from a