Re: missing lines after pages2lines conversion

Dave Wed, 19 Aug 2009 11:25:34 -0700

On Aug 17, 7:29 pm, tmbdev <[email protected]> wrote:
> >   I'm interested in using OCRopus for a big OCR project for some older
> > (18th century) texts.  It looks like OCRopus would work well for this,
> > especially with the learning capabilities.
>
> It probably will be, but keep in mind that OCRopus is still in alpha
> release and that 18th century texts can be very challenging;
> historical OCR is still a research topic.

Yes - we expect OCR on the texts to be a challenge, but I'm told we
may have full text for a number of the works to help with training.

>
> >   As a test, I randomly picked a page image to work with,
> > unfortunately, when I did page2lines on it, one line was skipped, and
> > there was another line that I was surprised to see wasn't split.
>
> Layout analysis errors are probably the most frequent source of errors
> in OCR systems.
>
> OCRopus currently has a number of different page layout methods
> implemented:
> SegmentWords                     segwords
>     segment words by smearing
> SegmentPageByVORONOI             segvoronoi
>     segment page by Voronoi algorithm
> SegmentPageByXYCUTS              segxy
>     segment page by XY-Cut algorithm
> SegmentPageBy1CP                 seg1cp
>     segment characters by horizontal projection (assumes single
> SegmentPageByMorphTrivial        segmorphtriv
>     segment characters by horizontal projection (assumes single
> SegmentPageByRAST                segrast
>     Segment page by RAST
>
> The default is SegmentPageByRAST.
>
I may be missing something in the documentation - I don't see how to
change the page segment method to try one of these.  Is there an
option to do this?

> > In the first case, I do notice that when I look at the 0001.pseg.png
> > image, the missing line is shown in yellow - does this mean anything?
>
> Please see the Documentation section for the format of the pseg.png
> files.
>
> > For the two lines that weren't split, the second line is an
> > attribution for a quote, and in the horizontal direction, only
> > overlaps a small section of the preceding line.
>
> >   Any suggestions on how to recover my "missing" line and get my other
> > line to split the way I expected?
>
> Please submit a bug report with sample images attached.  However,
> layout analysis is a hard problem.  It becomes even harder for noisy
> and historical documents, so you should expect errors (with any OCR
> system).
>
> We've implemented other layout analysis methods and will be trying to
> incorporate those into OCRopus over time.  If you want to know more,
> have a look at pubs.iupr.org
>
> Tom
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"ocropus" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/ocropus?hl=en
-~----------~----~----~----~------~----~------~--~---
Re: missing lines after pages2lines conversion

Reply via email to