It may depend on the format of the PDF, but I’ve used the Scraperwiki Python 
Module ‘pdf2xml’ function to extract text data from PDFs in the past. There is 
a write up (not by me) at 
http://schoolofdata.org/2013/08/16/scraping-pdfs-with-python-and-the-scraperwiki-module/
 
<http://schoolofdata.org/2013/08/16/scraping-pdfs-with-python-and-the-scraperwiki-module/>,
 and an example of how I’ve used it at 
https://github.com/ostephens/british_library_directory_of_library_codes/blob/master/scraper.py
 
<https://github.com/ostephens/british_library_directory_of_library_codes/blob/master/scraper.py>

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: [email protected]
Telephone: 0121 288 6936

> On 18 Jun 2015, at 17:02, Matt Sherman <[email protected]> wrote:
> 
> Hi Code4Libbers,
> 
> I am working with colleague on a side project which involves some scanned
> bibliographies and making them more web searchable/sortable/browse-able.
> While I am quite familiar with the metadata and organization aspects we
> need, but I am at a bit of a loss on how to automate the process of putting
> the bibliography in a more structured format so that we can avoid going
> through hundreds of pages by hand.  I am pretty sure regular expressions
> are needed, but I have not had an instance where I need to automate
> extracting data from one file type (PDF OCR or text extracted to Word doc)
> and place it into another (either a database or an XML file) with some
> enrichment.  I would appreciate any suggestions for approaches or tools to
> look into.  Thanks for any help/thoughts people can give.
> 
> Matt Sherman

Reply via email to