Matt, A word document does funny things to the text since it is actually html (try opening a .doc in a plain text editor and you will see it is html). I would try and get the plain ASCII text instead, and then install Cygwin which contains Sed and a bunch of other usful Unix/Linux commands. see http://stackoverflow.com/a/127567/2896744 for more info. ________________________________________ From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Matt Sherman [matt.r.sher...@gmail.com] Sent: Tuesday, August 04, 2015 9:09 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text
I am on Windows machines, so I don't have quite the easy access to that useful command. Someone had earlier put the OCR in a doc file so I've been playing with that more than with the raw PDF OCR. On Tue, Aug 4, 2015 at 8:19 AM, Scancella, John <j...@loc.gov> wrote: > Matt, > > There are probably a dozen ways to do this, but it would be really helpful to > know what operating system you are on? For example, if you are using Linux, > you can run it through sed using > cat <OCR_FILE> | sed 's/\n//' >> <STRIPPED_OCR_FILE> > see http://stackoverflow.com/a/800644/2896744 for more info > ________________________________________ > From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Matt Sherman > [matt.r.sher...@gmail.com] > Sent: Monday, August 03, 2015 10:29 PM > To: CODE4LIB@LISTSERV.ND.EDU > Subject: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text > > Hi Code4Lib folks, > > I was wondering if anyone had some experience cleaning up OCR text. > Particularly I am trying to figure out how I can deal with the random > line breaks that come from OCR. I am trying to parse out a > bibliography with regex. I think I've figured out which queries I > need to run to break it up so I can make it into a tab delimited text > file but I noticed that the text does the classic thing of OCR > inserting line breaks where they physically are on the page. This > will obviously be a bit of an issue since it would break the > annotation into a bunch of lines rather than leaving it one block so I > can manipulate it into a database. So I am wondering if anyone who > has worked with OCR text before has a suggested way to clean up those > line breaks without doing 300 + pages by hand? Any thoughts would be > welcome. > > Matt Sherman