Hi Erica,

We are working on a similar project converting concert performances from the past 20 years for our School of Music. though we use simple OCR for PDFs (supporting full text searching), we are selectively cleaning up OCR for metadata purposes. That is taking the first page of PDFs, extracting text and converting said text to titles and dates. We use simple regular expressions to remove line breaks and extra white spacing.

Here are our working guidelines http://bit.ly/1v0c7w2. Perhaps there might be something here that could be of help to you?

Best of luck with your project!

kind regards,
Monica

Quoting Kevin Hawkins <kevin.s.hawk...@ultraslavonic.info>:

It sounds like there are two sorts of things you need to clean up:

a) OCR errors

b) Formatting (like unnecessary line breaks)

For the former, I understand that Adobe Acrobat and ABBYY FineReader have tools built in to spellchecking. PrimeOCR, an expensive OCR package, has a related package called PrimeVerify that does this.

If you don't have any of these, you could simply open the OCR output in a text editor with spellchecking to look for things to fix. You could even copy and paste into Microsoft Word and use its spellchecker; you'd probably need to correct the source file in parallel to scanning it in Word.

As for formatting, this one is harder. But instead of trying to solve that, I wonder if you're sure it's worth doing. If you're only using the OCR to drive search of the scanned page images, why does it matter if there are some unnecessary line breaks in your OCR text?

Kevin

On 11/22/14 12:44 PM, scott bacon wrote:
Erica,

You may find what you need from OpenRefine: http://openrefine.org/



On Fri, Nov 21, 2014 at 5:15 PM, Erica FINDLEY <eri...@multco.us> wrote:

Greetings,

I am working on a project to digitize concert programs. These are the type
of programs you get when attending a musical concert that list performers
and details about the concert.

Since these items are text heavy we have decided to use OCR software to
output a text file that will enable full text searching in our platform.

These text files are for the most part accurate, but often have unnecessary
line breaks and pockets of extra characters and/or incorrect
capitalization. I would like to pretty them up a little bit if possible.

I am wondering if there is a script I can use on multiple files to clean
these type of things up. I don't want to have the digitization staff
manually edit each text file or have to open each one to run a macro in a
text editor.

I have been searching online and so far haven't found anything that will
work for my situation.

thanks in advance,

*Erica Findley*
Cataloging/Metadata Librarian
Multnomah County Library
Phone: 503.988.5466
eri...@multco.us
www.multcolib.org



Digital Curation Coordinator
Digital Scholarship Services
Fondren Library, Rice University

Reply via email to