Hm, doing a little looking on someone's suggestion it turns out I was wrong, they are not line breaks, they are paragraph marks.
On Tue, Aug 4, 2015 at 9:21 AM, Scancella, John <j...@loc.gov> wrote: > Matt, > > A word document does funny things to the text since it is actually html (try > opening a .doc in a plain text editor and you will see it is html). I would > try and get the plain ASCII text instead, and then install Cygwin which > contains Sed and a bunch of other usful Unix/Linux commands. > see http://stackoverflow.com/a/127567/2896744 for more info. > ________________________________________ > From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Matt Sherman > [matt.r.sher...@gmail.com] > Sent: Tuesday, August 04, 2015 9:09 AM > To: CODE4LIB@LISTSERV.ND.EDU > Subject: Re: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text > > I am on Windows machines, so I don't have quite the easy access to > that useful command. Someone had earlier put the OCR in a doc file so > I've been playing with that more than with the raw PDF OCR. > > On Tue, Aug 4, 2015 at 8:19 AM, Scancella, John <j...@loc.gov> wrote: >> Matt, >> >> There are probably a dozen ways to do this, but it would be really helpful >> to know what operating system you are on? For example, if you are using >> Linux, you can run it through sed using >> cat <OCR_FILE> | sed 's/\n//' >> <STRIPPED_OCR_FILE> >> see http://stackoverflow.com/a/800644/2896744 for more info >> ________________________________________ >> From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Matt >> Sherman [matt.r.sher...@gmail.com] >> Sent: Monday, August 03, 2015 10:29 PM >> To: CODE4LIB@LISTSERV.ND.EDU >> Subject: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text >> >> Hi Code4Lib folks, >> >> I was wondering if anyone had some experience cleaning up OCR text. >> Particularly I am trying to figure out how I can deal with the random >> line breaks that come from OCR. I am trying to parse out a >> bibliography with regex. I think I've figured out which queries I >> need to run to break it up so I can make it into a tab delimited text >> file but I noticed that the text does the classic thing of OCR >> inserting line breaks where they physically are on the page. This >> will obviously be a bit of an issue since it would break the >> annotation into a bunch of lines rather than leaving it one block so I >> can manipulate it into a database. So I am wondering if anyone who >> has worked with OCR text before has a suggested way to clean up those >> line breaks without doing 300 + pages by hand? Any thoughts would be >> welcome. >> >> Matt Sherman