Re: [ocropus] page level ground truth alignment in 0.6, old char model and collection of receipts?

Nathan K Mon, 25 Mar 2013 00:26:36 -0700

Thanks Tom, I like the codebase of ocropus much more - there is some really
interesting stuff in there. My language of choice is python, so there is
that too. Having said that one must use the right tool for the job. It does
seem that tesseract is giving much better results than when I last tried
it. However, this is probably because I've implemented awesome
preprocessing now :)


Over the last day I've dug though the code and thought I'd report my
findings as documentation is pretty light at the moment. IMHO this is a
major hurdle for the project as it makes it very difficult for potential
contributors to get to the point where they can submit pull
requests/patches. I'd be happy to add some documentation on my workflow
when I figure out exactly what it is :)

Documentation that was helpful:
-------
- Training examples
- Source code comments for all commands in ocropy folder
- Notebook folder - which makes use of IPythons notebook tool, which was
new to me. But trust me - much better than reading the json files. Check
which branch/tag you are looking at. I think Tom added some more notebooks
back in Dec 12.

How I started to build a character model
------
I gave up on creating ground truth at the line level in the absence of a
tool that would help me. I was hardly going to create text file for each
line, and manually populate it with data from my page level ground truth.
I'm sure I'm missing something here, but I think most people on the list
must be enjoying the weekend.

Instead I took Toms advice and turned to tesseract to generate box files. I
didn't bother editing these, as you can do that in the veeerrry nice
ocropus-cedit tool. All that was required was using the 'tess2h5' argument
to the ocropus-db command. (note: this does not show up in the help, so dig
into the source, it required specifying an -o file that was not documented
in the examples). Then running ocropus-cedit I could correct the
errors tesseract made.

And thats pretty much where I'm up to.


Other thoughts
------
- I'd love to get my head around generating the page level gt. I believe
this relied on OpenFST which I tried to get working today, but it doesn't
seem to be used any more by ocropus.

- What is the recommended way to submit changed fixes? I've got several
images that cause various components in the pipeline to fail. I've gone in
and added some try/excepts to make it fail gracefully. I'm
more familiar with github.

Okay, time for some rest.

Thanks for all your efforts developers! Its great to see how the project is
coming along 2 years on.

Cheers,

Nathan


-




On 24 March 2013 07:11, Tom Morris <[email protected]> wrote:

> I can't help with your ground truth question, but unless you're absolutely
> committed to Ocropus, I'd suggest checking out Tesseract.  My impression is
> that it's not only more mature, but it's got a much more active community
> supporting it.
>
> Tom
>
>
> On Sat, Mar 23, 2013 at 9:50 PM, Nathan K <[email protected]> wrote:
>
>> Just to clarify - looking over the examples
>>
>> fraktur-boxes says:
>>
>> "The next training step consists of retraining the model by aligning text
>> lines with ground truth (see the example in uw3-500)"
>>
>> And in the uw3-500 example data is downloaded with ground truth already
>> placed at the line level. Thus it is not clear what one should do to
>> automatically generate line level ground truth from page level ground truth
>> text files. I remember there was some tool that would enable this in the
>> past, it worked on the principle of finding a line match that was 'close
>> enough' based on a cost function. This enabled bootstrapping of a character
>> model.
>>
>> Is this approach still valid? I could generate a character model using
>> clustering and then manually review the results and then iterate. This
>> however would still not yield ground truth for determining the error, or
>> generating a language model.
>>
>> Thanks for your assistance if you're in the know! Been pulling my hair
>> out all day!
>>
>> Cheers,
>>
>> Nathan
>>
>>
>> On 23 March 2013 14:34, Nathan K <[email protected]> wrote:
>>
>>> Hey OCRopus Group,
>>> Its been awhile in here, but I've just begin to update some old hacky
>>> scripts from 0.4.4 to 0.6. I've very pleased to see the worth thats been
>>> going on. Nice to see things a mor pythonic! I can't figure out how to
>>> align the page level ground truth to a page. My memory may be failing me,
>>> but I remember this very neat process where ocropus with automagically
>>> align page lines with a text transcription of the page. My goal is to
>>> regenerate my character training model, and also a language model. Would
>>> greatly appreciate any tips to that effect.
>>>
>>> Also has there been some changes to the character models since 0.4.4 I
>>> tried to use an old one which I remember doing quite a bit of work on, and
>>> it fails to unpickle.
>>>
>>> Lastly, does anyone have/know of a collection/database of receipts that
>>> could be used for training. I've asked friends and family and have so far
>>> only received 50 documents - some quite poor quality. Perhaps a couple of
>>> people keep digital records for tax purposes and would be happy to share.
>>> Happy to keep them confidential if required.
>>>
>>> Cheers,
>>>
>>> Nathan
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "ocropus" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msg/ocropus/-/I8eeJdqGLCoJ.
>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>
>>>
>>>
>>
>>
>>
>> --
>>
>>
>>
>>
>> Nathan Keilar
>> Hunted Hive Web Studio ~ Innovative Solutions For Real World Problems
>> Technical Director and Business Manager
>>
>> EMAIL:                    [email protected]
>> PHONE:                  +61 (0) 7 3040 3065
>> SKYPE/TWITTER:  https://twitter.com/#!/madteckhead
>> FACEBOOK:           http://www.facebook.com/nathan.keilar
>> WEB:                       http://madteckhead.com
>>
>> This email (including any attachments) is confidential and may be
>> privileged. If you have received it in error, please notify the sender by
>> return email and delete this message from your system. Any unauthorised use
>> or dissemination of this message in whole or in part is strictly
>> prohibited. Please note that emails are susceptible to change and we will
>> not be liable for the improper or incomplete transmission of the
>> information contained in this communication nor for any delay in its
>> receipt or damage to your system. We do not guarantee that the integrity of
>> this communication has been maintained nor that this communication is free
>> of viruses, interceptions or interference.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "ocropus" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>>
>>
>
>


-- 




Nathan Keilar
Hunted Hive Web Studio ~ Innovative Solutions For Real World Problems
Technical Director and Business Manager

EMAIL:                    [email protected]
PHONE:                  +61 (0) 7 3040 3065
SKYPE/TWITTER:  https://twitter.com/#!/madteckhead
FACEBOOK:           http://www.facebook.com/nathan.keilar
WEB:                       http://madteckhead.com

This email (including any attachments) is confidential and may be
privileged. If you have received it in error, please notify the sender by
return email and delete this message from your system. Any unauthorised use
or dissemination of this message in whole or in part is strictly
prohibited. Please note that emails are susceptible to change and we will
not be liable for the improper or incomplete transmission of the
information contained in this communication nor for any delay in its
receipt or damage to your system. We do not guarantee that the integrity of
this communication has been maintained nor that this communication is free
of viruses, interceptions or interference.

-- 
You received this message because you are subscribed to the Google Groups 
"ocropus" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: [ocropus] page level ground truth alignment in 0.6, old char model and collection of receipts?

Reply via email to