You could also use the paragraph finder in UpLib, in python/uplib/
paragraphs.py.  It incorporates a number of rules, including the ones
that Tom mentions below.

http://uplib.parc.com/

Bill

On May 29, 8:38 am, Thomas Breuel <[email protected]> wrote:
> Not yet.  Paragraphs are actually tricky because there are many
> different ways of indicating them.  We may not put paragraph detection
> directly into OCRopus but leave that for an hOCR to hOCR transformer.
>
> You could write a simple such transformer and contribute it.
> Basically, you need to look at all the ocr_line elements and their
> corresponding bounding boxes.  A new paragraph starts either if the x0
> for a line is substantially indented relative to the previous and
> following line, or if there is substantially more space between a line
> and its previous line than there is on average.
>
> Tom
>
> On Fri, May 29, 2009 at 17:07, [email protected]
>
> <[email protected]> wrote:
>
> > Hello. I need paragraphs instead of lines, too.
>
> > Did you make progress yet?
>
> > Michael
>
> > On 1 Apr., 02:56, Michael Moore <[email protected]> wrote:
> >> On Tue, Mar 31, 2009 at 3:03 AM, Duncan McGregor
>
> >> <[email protected]> wrote:
>
> >> > Do you want to group text intoparagraphunits, or just lay it out
> >> > more visually convincingly?
>
> >> I'd like to group it intoparagraphunits. The output will be
> >> displayed in a browser for the user and will then be copied and pasted
> >> into a word processor. The word processor will paste the content into
> >> paragraphs if the HTML is marked up as paragraphs, otherwise it's just
> >> one huge block of text.
>
> >> The being visually convincing is nice too, but not as important for
> >> this work flow.
>
> >> Thank you,
> >> Michael Moore
>
> >> > I don't know about the former, but for the latter I had some success
> >> > in adding css to the divs to move their position to where the bbox
> >> > said they should be.
>
> >> > Duncan McGregor
> >> >www.VelOCRaptor.com
>
> >> > On Mon, Mar 30, 2009 at 11:08 PM, Michael Moore <[email protected]> 
> >> > wrote:
>
> >> >> Are there any tools or options I can use to get my hocr output with
> >> >>paragraphtags?
>
> >> >> Many thanks,
> >> >> --
> >> >> Michael Moore
>
> >> --
> >> Michael Moore
> >> -------------------------
> >> Share your families' genealogy and family history books. It's easy and
> >> free :http://bookscanned.com
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"ocropus" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/ocropus?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to