Re: Word extraction

Thomas Breuel Mon, 02 Mar 2009 18:25:26 -0800

On Mon, Mar 2, 2009 at 03:59, Leo <[email protected]> wrote:

> I am looking for an algorithm in ocropus that allows word extraction
> from an image of paragraph or line of text. At moment I using the
> make_StandardGrouper() function with CurvedCut segmentation for
> extracting the character position, however it didn't seem to work
> quite well.



StandardGrouper + CurvedCut does not give you characters, it gives you a
large collection of character hypotheses, most of which aren't characters.

If you want characters, you need to store the character hypotheses in a
lattice and then select the best path with a language model.


> Is there any word segmentation algorithm currently
> implemented in Ocropus that allows me to extract or find out the
> position of each word within an image?


There are two kinds of word segmentations: image-based and OCR output
based.  Image-based word segmentation doesn't require OCR, but it is also
not very accurate and only works for Latin scripts.  Output-based works for
all languages.  In principle, you can do both with OCRopus.

The way this is done is changing.  In the next release of OCRopus, you
should be able to get the word bounding boxes from character bounding boxes
in the hOCR output.  Right now, it's a little more complicated.  My
suggestion would be to wait a couple of weeks.

I believe Faisal wrote an image-based word segmenter; maybe he can answer
about how/whether you can use that.

Tom

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"ocropus" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/ocropus?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: Word extraction

Reply via email to