There is no general way of dealing with such transformations, since the 
output depends very much on the types of documents you're recognizing 
and how you want to use the results of OCR.

OCRopus gives you the full bounding box information for characters and 
lines, so you can write a little script that discovers and uses the 
indentation.

I'd recommend implementing something like that as a post-processing step 
on the HTML/hOCR output.  You can read the HTML/hOCR with an HTML parser 
(e.g., Python's BeautifulSoup), get the bounding box information, and 
modify the DOM tree any way you like based on that.

Tom

bailey wrote:
> I would like to do the following.
>
> I have a png with a text column as follows.
>
> consectetuer adipiscing elit, sed diem
>   nonummy nibh euismod tincidunt ut
>   lacreet dolore magna aliguam erat
>   volutpat. Utwisis enim ad minim
> veniam,quis nostrud exerci tution
>   nibh euismod tincidunt ut lacreet
> dolore magna eniam,quis nostrud
>   exerci tution nibh euismod
>   tincidunt ut lacreet dolore
>
> As you can see 6 lines are indented and 3 lines are flush left
> justified. I would like to distinguish between the flush words/lines
> and indented words/lines, by annotating the first word on flush lines
> with <title></title>.   For Example.
>
> <title>consectetuer</title>adipiscing elit, sed diem
>   nonummy nibh euismod tincidunt ut
>   lacreet dolore magna aliguam erat
>   volutpat. Utwisis enim ad minim
> <title>veniam</title>,quis nostrud exerci tution
>   nibh euismod tincidunt ut lacreet
> <title>dolore</title> magna eniam,quis nostrud
>   exerci tution nibh euismod
>   tincidunt ut lacreet dolore
>
> Thanks for any help,
> >
>   


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"ocropus" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/ocropus?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to