There is no general way of dealing with such transformations, since the output depends very much on the types of documents you're recognizing and how you want to use the results of OCR.
OCRopus gives you the full bounding box information for characters and lines, so you can write a little script that discovers and uses the indentation. I'd recommend implementing something like that as a post-processing step on the HTML/hOCR output. You can read the HTML/hOCR with an HTML parser (e.g., Python's BeautifulSoup), get the bounding box information, and modify the DOM tree any way you like based on that. Tom bailey wrote: > I would like to do the following. > > I have a png with a text column as follows. > > consectetuer adipiscing elit, sed diem > nonummy nibh euismod tincidunt ut > lacreet dolore magna aliguam erat > volutpat. Utwisis enim ad minim > veniam,quis nostrud exerci tution > nibh euismod tincidunt ut lacreet > dolore magna eniam,quis nostrud > exerci tution nibh euismod > tincidunt ut lacreet dolore > > As you can see 6 lines are indented and 3 lines are flush left > justified. I would like to distinguish between the flush words/lines > and indented words/lines, by annotating the first word on flush lines > with <title></title>. For Example. > > <title>consectetuer</title>adipiscing elit, sed diem > nonummy nibh euismod tincidunt ut > lacreet dolore magna aliguam erat > volutpat. Utwisis enim ad minim > <title>veniam</title>,quis nostrud exerci tution > nibh euismod tincidunt ut lacreet > <title>dolore</title> magna eniam,quis nostrud > exerci tution nibh euismod > tincidunt ut lacreet dolore > > Thanks for any help, > > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "ocropus" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/ocropus?hl=en -~----------~----~----~----~------~----~------~--~---
