[CODE4LIB] indexing pdf files

Eric Lease Morgan Tue, 15 Sep 2009 06:33:03 -0700

I have been having fun recently indexing PDF files.

For the pasts six months or so I have been keeping the articles I'veread in a pile, and I was rather amazed at the size of the pile. Itwas about a foot tall. When I read these articles I "actively" readthem -- meaning, I write, scribble, highlight, and annotate the textwith my own special notation denoting names, keywords, definitions,citations, quotations, list items, examples, etc. This active readingprocess: 1) makes for better comprehension on my part, and 2) makesthe articles easier to review and pick out the ideas I thought weresalient. Being the librarian I am, I thought it might be cool ("kewl")to make the articles into a collection. Thus, the beginnings ofHighlights & Annotations: A Value-Added Reading List.

The techno-weenie process for creating and maintaining the content issomething this community might find interesting:


 1. Print article and read it actively.

 2. Convert the printed article into a PDF
    file -- complete with embedded OCR --
    with my handy-dandy ScanSnap scanner. [1]

 3. Use MyLibrary to create metadata (author,
    title, date published, date read, note,
    keywords, facet/term combinations, local
    and remote URLs, etc.) describing the
    article. [2]

 4. Save the PDF to my file system.

 5. Use pdttotext to extract the OCRed text
    from the PDF and index it along with
    the MyLibrary metadata using Solr. [3, 4]

 6. Provide a searchable/browsable user
    interface to the collection through a
    mod_perl module. [5, 6]

Software is never done, and if it were then it would be calledhardware. Accordingly, I know there are some things I need to dobefore I can truely deem the system version 1.0. At the same time myexcitment is overflowing and I thought I'd share some geekdom with myfellow hackers. Fun with PDF files and open source software.



[1] ScanSnap - http://tinyurl.com/oafgwe
[2] MyLibrary screen dump - http://infomotions.com/tmp/mylibrary.png
[3] pdftotext - http://www.foolabs.com/xpdf/
[4] Solr - http://lucene.apache.org/solr/
[5] module source code - http://infomotions.com/highlights/Highlights.pl
[6] user interface - http://infomotions.com/highlights/highlights.cgi

--
Eric Lease Morgan
University of Notre Dame




--
Eric Lease Morgan
Head, Digital Access and Information Architecture Department
Hesburgh Libraries, University of Notre Dame

(574) 631-8604

[CODE4LIB] indexing pdf files

Reply via email to