Hello, after reading some documentation about FST, in general, and OpenFST I have tried to build my simple language model. After a lot of attempt i did not get any result. Following www.openfst.org documentation and some other tutorials like http://www.stringology.org/event/CIAA2007/pres/Tue2/Riley.pdf I tried to build a FST for my recognition. I have still a lot of doubts how OCRopus use the FST.
1) If I have to recognize A, should the input label be just A?
2) the weight is defined in base a which principle?
Anyway I notice that all the final weights, in the default.fst file, are
really big like 1.99999994e+38.
I did a lot of tests but i report just the last one.
The results of other tests were: A empty string or a strange character
(Little square with 4 litter numbers inside 2x2)
The last test:
I tried to recognize a text really simple like: AA
(with just a letter that normally is often recognized).
For doing that I built a really simple FST.
For isymbols in ( test.isyms file):
> A 1
>
>
For osymbls in (test.osyms file):
A 1
In the transducer file textual representation:
> 0 1 A A 0.6
>
> 1 2 A A 0.6
>>
> 2 1.99999994e+38
>
>
Obtaining the FST in the figure.
When I run ocropus with my FST I get:
*overco...@overcomer-laptop:~/Scrivania/prova$ ocropus page AA.png
[beam search failed]*
if i change the weight of the last state (2), setting the weigth to 1, for
example, I get an empty string:
*overco...@overcomer-laptop:~/Scrivania/prova$ ocropus page AA.png
overco...@overcomer-laptop:~/Scrivania/prova$ _*
Obviously running ocropus with the default.fst file works perfectly.
What's wrong in my FST?
I hope really that someone can help me, I am doing a lot of tests, I am
proceding for attempts and I am getting crazy.
I hope that I was clear and I am sorry for the prolixity.
Thanks.
2009/5/15 Thomas Breuel <[email protected]>
> On Fri, May 15, 2009 at 10:18, Pierpaolo Monaco <
> [email protected]> wrote:
>
>> Using tesseract i can limit the output with a shell command.
>>
>> I just need to create a file in the tesseract-ocr/tessdata/configs/ that,
>> for example, I call myletters.
>> In the file i define the whitelist in this way, writing in the file:
>>
>> tessedit_char_whitelist QWERTYUIOPASDFGHJKLZXCVBNM
>>
>> After that i can process an image writing:
>>
>> $ tesseract prova.tif out nobatch myletters
>>
>> I will have just upper case letters as result. (letters from my white
>> list)
>>
>> Can I do something like that in ocropus or I need to do that whit a
>> language model?
>>
>
> You need a language model for that, but a pretty simple one. The language
> model you need is the equivalent of "[A-Z]*". You can create something as
> simple as that by hand even; you just need one or two states, plus a
> transition for each permited letter. See the OpenFST documentation (you do
> not need to use OpenFST, but OCRopus uses the same representation).
>
> If you want good recognition performance, you should also retrain the
> classifier on just your target character set.
>
> I've written an overview paper describing how all the bits and pieces of
> OCRopus fit together; I'll try and put that up publicly in a couple of
> weeks.
>
> After that, I'll revise the tutorial to conform to OCRopus 0.4.
>
> Tom
>
>
> >
>
--
-----------------------------
Pierpaolo Monaco
----------------------------
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"ocropus" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/ocropus?hl=en
-~----------~----~----~----~------~----~------~--~---
<<attachment: test.png>>
