Re: [CODE4LIB] Looking for a script to clean up OCR text files

Monica Rivero Sun, 23 Nov 2014 19:27:06 -0800

Hi Erica,

We are working on a similar project converting concert performancesfrom the past 20 years for our School of Music. though we use simpleOCR for PDFs (supporting full text searching), we are selectivelycleaning up OCR for metadata purposes. That is taking the first pageof PDFs, extracting text and converting said text to titles and dates.We use simple regular expressions to remove line breaks and extrawhite spacing.

Here are our working guidelines http://bit.ly/1v0c7w2. Perhaps theremight be something here that could be of help to you?


Best of luck with your project!

kind regards,
Monica

Quoting Kevin Hawkins <kevin.s.hawk...@ultraslavonic.info>:

It sounds like there are two sorts of things you need to clean up:

a) OCR errors

b) Formatting (like unnecessary line breaks)
For the former, I understand that Adobe Acrobat and ABBYY FineReaderhave tools built in to spellchecking. PrimeOCR, an expensive OCRpackage, has a related package called PrimeVerify that does this.
If you don't have any of these, you could simply open the OCR outputin a text editor with spellchecking to look for things to fix. Youcould even copy and paste into Microsoft Word and use itsspellchecker; you'd probably need to correct the source file inparallel to scanning it in Word.
As for formatting, this one is harder. But instead of trying tosolve that, I wonder if you're sure it's worth doing. If you'reonly using the OCR to drive search of the scanned page images, whydoes it matter if there are some unnecessary line breaks in your OCRtext?
Kevin

On 11/22/14 12:44 PM, scott bacon wrote:
Erica,

You may find what you need from OpenRefine: http://openrefine.org/



On Fri, Nov 21, 2014 at 5:15 PM, Erica FINDLEY <eri...@multco.us> wrote:
Greetings,

I am working on a project to digitize concert programs. These are the type
of programs you get when attending a musical concert that list performers
and details about the concert.

Since these items are text heavy we have decided to use OCR software to
output a text file that will enable full text searching in our platform.

These text files are for the most part accurate, but often have unnecessary
line breaks and pockets of extra characters and/or incorrect
capitalization. I would like to pretty them up a little bit if possible.

I am wondering if there is a script I can use on multiple files to clean
these type of things up. I don't want to have the digitization staff
manually edit each text file or have to open each one to run a macro in a
text editor.

I have been searching online and so far haven't found anything that will
work for my situation.

thanks in advance,

*Erica Findley*
Cataloging/Metadata Librarian
Multnomah County Library
Phone: 503.988.5466
eri...@multco.us
www.multcolib.org



Digital Curation Coordinator
Digital Scholarship Services
Fondren Library, Rice University

Re: [CODE4LIB] Looking for a script to clean up OCR text files

Reply via email to