Just to clarify - looking over the examples fraktur-boxes says:
"The next training step consists of retraining the model by aligning text lines with ground truth (see the example in uw3-500)" And in the uw3-500 example data is downloaded with ground truth already placed at the line level. Thus it is not clear what one should do to automatically generate line level ground truth from page level ground truth text files. I remember there was some tool that would enable this in the past, it worked on the principle of finding a line match that was 'close enough' based on a cost function. This enabled bootstrapping of a character model. Is this approach still valid? I could generate a character model using clustering and then manually review the results and then iterate. This however would still not yield ground truth for determining the error, or generating a language model. Thanks for your assistance if you're in the know! Been pulling my hair out all day! Cheers, Nathan On 23 March 2013 14:34, Nathan K <[email protected]> wrote: > Hey OCRopus Group, > Its been awhile in here, but I've just begin to update some old hacky > scripts from 0.4.4 to 0.6. I've very pleased to see the worth thats been > going on. Nice to see things a mor pythonic! I can't figure out how to > align the page level ground truth to a page. My memory may be failing me, > but I remember this very neat process where ocropus with automagically > align page lines with a text transcription of the page. My goal is to > regenerate my character training model, and also a language model. Would > greatly appreciate any tips to that effect. > > Also has there been some changes to the character models since 0.4.4 I > tried to use an old one which I remember doing quite a bit of work on, and > it fails to unpickle. > > Lastly, does anyone have/know of a collection/database of receipts that > could be used for training. I've asked friends and family and have so far > only received 50 documents - some quite poor quality. Perhaps a couple of > people keep digital records for tax purposes and would be happy to share. > Happy to keep them confidential if required. > > Cheers, > > Nathan > > -- > You received this message because you are subscribed to the Google Groups > "ocropus" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msg/ocropus/-/I8eeJdqGLCoJ. > For more options, visit https://groups.google.com/groups/opt_out. > > > -- Nathan Keilar Hunted Hive Web Studio ~ Innovative Solutions For Real World Problems Technical Director and Business Manager EMAIL: [email protected] PHONE: +61 (0) 7 3040 3065 SKYPE/TWITTER: https://twitter.com/#!/madteckhead FACEBOOK: http://www.facebook.com/nathan.keilar WEB: http://madteckhead.com This email (including any attachments) is confidential and may be privileged. If you have received it in error, please notify the sender by return email and delete this message from your system. Any unauthorised use or dissemination of this message in whole or in part is strictly prohibited. Please note that emails are susceptible to change and we will not be liable for the improper or incomplete transmission of the information contained in this communication nor for any delay in its receipt or damage to your system. We do not guarantee that the integrity of this communication has been maintained nor that this communication is free of viruses, interceptions or interference. -- You received this message because you are subscribed to the Google Groups "ocropus" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.
