Re: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text

Matt Sherman Tue, 04 Aug 2015 06:40:29 -0700

Hm, doing a little looking on someone's suggestion it turns out I was
wrong, they are not line breaks, they are paragraph marks.


On Tue, Aug 4, 2015 at 9:21 AM, Scancella, John <j...@loc.gov> wrote:
> Matt,
>
> A word document does funny things to the text since it is actually html (try 
> opening a .doc in a plain text editor and you will see it is html). I would 
> try and get the plain ASCII text instead, and then install Cygwin which 
> contains Sed and a bunch of other usful Unix/Linux commands.
> see http://stackoverflow.com/a/127567/2896744 for more info.
> ________________________________________
> From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Matt Sherman 
> [matt.r.sher...@gmail.com]
> Sent: Tuesday, August 04, 2015 9:09 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text
>
> I am on Windows machines, so I don't have quite the easy access to
> that useful command.  Someone had earlier put the OCR in a doc file so
> I've been playing with that more than with the raw PDF OCR.
>
> On Tue, Aug 4, 2015 at 8:19 AM, Scancella, John <j...@loc.gov> wrote:
>> Matt,
>>
>> There are probably a dozen ways to do this, but it would be really helpful 
>> to know what operating system you are on? For example, if you are using 
>> Linux, you can run it through sed using
>>   cat <OCR_FILE> | sed 's/\n//' >> <STRIPPED_OCR_FILE>
>> see http://stackoverflow.com/a/800644/2896744 for more info
>> ________________________________________
>> From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Matt 
>> Sherman [matt.r.sher...@gmail.com]
>> Sent: Monday, August 03, 2015 10:29 PM
>> To: CODE4LIB@LISTSERV.ND.EDU
>> Subject: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text
>>
>> Hi Code4Lib folks,
>>
>> I was wondering if anyone had some experience cleaning up OCR text.
>> Particularly I am trying to figure out how I can deal with the random
>> line breaks that come from OCR.  I am trying to parse out a
>> bibliography with regex.  I think I've figured out which queries I
>> need to run to break it up so I can make it into a tab delimited text
>> file but I noticed that the text does the classic thing of OCR
>> inserting line breaks where they physically are on the page.  This
>> will obviously be a bit of an issue since it would break the
>> annotation into a bunch of lines rather than leaving it one block so I
>> can manipulate it into a database.  So I am wondering if anyone who
>> has worked with OCR text before has a suggested way to clean up those
>> line breaks without doing 300 + pages by hand?  Any thoughts would be
>> welcome.
>>
>> Matt Sherman

Re: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text

Reply via email to