You could also use the paragraph finder in UpLib, in python/uplib/ paragraphs.py. It incorporates a number of rules, including the ones that Tom mentions below.
http://uplib.parc.com/ Bill On May 29, 8:38 am, Thomas Breuel <[email protected]> wrote: > Not yet. Paragraphs are actually tricky because there are many > different ways of indicating them. We may not put paragraph detection > directly into OCRopus but leave that for an hOCR to hOCR transformer. > > You could write a simple such transformer and contribute it. > Basically, you need to look at all the ocr_line elements and their > corresponding bounding boxes. A new paragraph starts either if the x0 > for a line is substantially indented relative to the previous and > following line, or if there is substantially more space between a line > and its previous line than there is on average. > > Tom > > On Fri, May 29, 2009 at 17:07, [email protected] > > <[email protected]> wrote: > > > Hello. I need paragraphs instead of lines, too. > > > Did you make progress yet? > > > Michael > > > On 1 Apr., 02:56, Michael Moore <[email protected]> wrote: > >> On Tue, Mar 31, 2009 at 3:03 AM, Duncan McGregor > > >> <[email protected]> wrote: > > >> > Do you want to group text intoparagraphunits, or just lay it out > >> > more visually convincingly? > > >> I'd like to group it intoparagraphunits. The output will be > >> displayed in a browser for the user and will then be copied and pasted > >> into a word processor. The word processor will paste the content into > >> paragraphs if the HTML is marked up as paragraphs, otherwise it's just > >> one huge block of text. > > >> The being visually convincing is nice too, but not as important for > >> this work flow. > > >> Thank you, > >> Michael Moore > > >> > I don't know about the former, but for the latter I had some success > >> > in adding css to the divs to move their position to where the bbox > >> > said they should be. > > >> > Duncan McGregor > >> >www.VelOCRaptor.com > > >> > On Mon, Mar 30, 2009 at 11:08 PM, Michael Moore <[email protected]> > >> > wrote: > > >> >> Are there any tools or options I can use to get my hocr output with > >> >>paragraphtags? > > >> >> Many thanks, > >> >> -- > >> >> Michael Moore > > >> -- > >> Michael Moore > >> ------------------------- > >> Share your families' genealogy and family history books. It's easy and > >> free :http://bookscanned.com --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "ocropus" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/ocropus?hl=en -~----------~----~----~----~------~----~------~--~---
