On Aug 17, 7:29 pm, tmbdev <[email protected]> wrote:
> > I'm interested in using OCRopus for a big OCR project for some older
> > (18th century) texts. It looks like OCRopus would work well for this,
> > especially with the learning capabilities.
>
> It probably will be, but keep in mind that OCRopus is still in alpha
> release and that 18th century texts can be very challenging;
> historical OCR is still a research topic.
Yes - we expect OCR on the texts to be a challenge, but I'm told we
may have full text for a number of the works to help with training.
>
> > As a test, I randomly picked a page image to work with,
> > unfortunately, when I did page2lines on it, one line was skipped, and
> > there was another line that I was surprised to see wasn't split.
>
> Layout analysis errors are probably the most frequent source of errors
> in OCR systems.
>
> OCRopus currently has a number of different page layout methods
> implemented:
> SegmentWords segwords
> segment words by smearing
> SegmentPageByVORONOI segvoronoi
> segment page by Voronoi algorithm
> SegmentPageByXYCUTS segxy
> segment page by XY-Cut algorithm
> SegmentPageBy1CP seg1cp
> segment characters by horizontal projection (assumes single
> SegmentPageByMorphTrivial segmorphtriv
> segment characters by horizontal projection (assumes single
> SegmentPageByRAST segrast
> Segment page by RAST
>
> The default is SegmentPageByRAST.
>
I may be missing something in the documentation - I don't see how to
change the page segment method to try one of these. Is there an
option to do this?
> > In the first case, I do notice that when I look at the 0001.pseg.png
> > image, the missing line is shown in yellow - does this mean anything?
>
> Please see the Documentation section for the format of the pseg.png
> files.
>
> > For the two lines that weren't split, the second line is an
> > attribution for a quote, and in the horizontal direction, only
> > overlaps a small section of the preceding line.
>
> > Any suggestions on how to recover my "missing" line and get my other
> > line to split the way I expected?
>
> Please submit a bug report with sample images attached. However,
> layout analysis is a hard problem. It becomes even harder for noisy
> and historical documents, so you should expect errors (with any OCR
> system).
>
> We've implemented other layout analysis methods and will be trying to
> incorporate those into OCRopus over time. If you want to know more,
> have a look at pubs.iupr.org
>
> Tom
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"ocropus" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/ocropus?hl=en
-~----------~----~----~----~------~----~------~--~---