Matt,

A word document does funny things to the text since it is actually html (try 
opening a .doc in a plain text editor and you will see it is html). I would try 
and get the plain ASCII text instead, and then install Cygwin which contains 
Sed and a bunch of other usful Unix/Linux commands.
see http://stackoverflow.com/a/127567/2896744 for more info.
________________________________________
From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Matt Sherman 
[matt.r.sher...@gmail.com]
Sent: Tuesday, August 04, 2015 9:09 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text

I am on Windows machines, so I don't have quite the easy access to
that useful command.  Someone had earlier put the OCR in a doc file so
I've been playing with that more than with the raw PDF OCR.

On Tue, Aug 4, 2015 at 8:19 AM, Scancella, John <j...@loc.gov> wrote:
> Matt,
>
> There are probably a dozen ways to do this, but it would be really helpful to 
> know what operating system you are on? For example, if you are using Linux, 
> you can run it through sed using
>   cat <OCR_FILE> | sed 's/\n//' >> <STRIPPED_OCR_FILE>
> see http://stackoverflow.com/a/800644/2896744 for more info
> ________________________________________
> From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Matt Sherman 
> [matt.r.sher...@gmail.com]
> Sent: Monday, August 03, 2015 10:29 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text
>
> Hi Code4Lib folks,
>
> I was wondering if anyone had some experience cleaning up OCR text.
> Particularly I am trying to figure out how I can deal with the random
> line breaks that come from OCR.  I am trying to parse out a
> bibliography with regex.  I think I've figured out which queries I
> need to run to break it up so I can make it into a tab delimited text
> file but I noticed that the text does the classic thing of OCR
> inserting line breaks where they physically are on the page.  This
> will obviously be a bit of an issue since it would break the
> annotation into a bunch of lines rather than leaving it one block so I
> can manipulate it into a database.  So I am wondering if anyone who
> has worked with OCR text before has a suggested way to clean up those
> line breaks without doing 300 + pages by hand?  Any thoughts would be
> welcome.
>
> Matt Sherman

Reply via email to