> I'm interested in using OCRopus for a big OCR project for some older
> (18th century) texts. It looks like OCRopus would work well for this,
> especially with the learning capabilities.
It probably will be, but keep in mind that OCRopus is still in alpha
release and that 18th century texts can be very challenging;
historical OCR is still a research topic.
> As a test, I randomly picked a page image to work with,
> unfortunately, when I did page2lines on it, one line was skipped, and
> there was another line that I was surprised to see wasn't split.
Layout analysis errors are probably the most frequent source of errors
in OCR systems.
OCRopus currently has a number of different page layout methods
implemented:
SegmentWords segwords
segment words by smearing
SegmentPageByVORONOI segvoronoi
segment page by Voronoi algorithm
SegmentPageByXYCUTS segxy
segment page by XY-Cut algorithm
SegmentPageBy1CP seg1cp
segment characters by horizontal projection (assumes single
SegmentPageByMorphTrivial segmorphtriv
segment characters by horizontal projection (assumes single
SegmentPageByRAST segrast
Segment page by RAST
The default is SegmentPageByRAST.
> In the first case, I do notice that when I look at the 0001.pseg.png
> image, the missing line is shown in yellow - does this mean anything?
Please see the Documentation section for the format of the pseg.png
files.
> For the two lines that weren't split, the second line is an
> attribution for a quote, and in the horizontal direction, only
> overlaps a small section of the preceding line.
>
> Any suggestions on how to recover my "missing" line and get my other
> line to split the way I expected?
Please submit a bug report with sample images attached. However,
layout analysis is a hard problem. It becomes even harder for noisy
and historical documents, so you should expect errors (with any OCR
system).
We've implemented other layout analysis methods and will be trying to
incorporate those into OCRopus over time. If you want to know more,
have a look at pubs.iupr.org
Tom
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"ocropus" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/ocropus?hl=en
-~----------~----~----~----~------~----~------~--~---