Thanks Tom, I like the codebase of ocropus much more - there is some really interesting stuff in there. My language of choice is python, so there is that too. Having said that one must use the right tool for the job. It does seem that tesseract is giving much better results than when I last tried it. However, this is probably because I've implemented awesome preprocessing now :)
Over the last day I've dug though the code and thought I'd report my findings as documentation is pretty light at the moment. IMHO this is a major hurdle for the project as it makes it very difficult for potential contributors to get to the point where they can submit pull requests/patches. I'd be happy to add some documentation on my workflow when I figure out exactly what it is :) Documentation that was helpful: ------- - Training examples - Source code comments for all commands in ocropy folder - Notebook folder - which makes use of IPythons notebook tool, which was new to me. But trust me - much better than reading the json files. Check which branch/tag you are looking at. I think Tom added some more notebooks back in Dec 12. How I started to build a character model ------ I gave up on creating ground truth at the line level in the absence of a tool that would help me. I was hardly going to create text file for each line, and manually populate it with data from my page level ground truth. I'm sure I'm missing something here, but I think most people on the list must be enjoying the weekend. Instead I took Toms advice and turned to tesseract to generate box files. I didn't bother editing these, as you can do that in the veeerrry nice ocropus-cedit tool. All that was required was using the 'tess2h5' argument to the ocropus-db command. (note: this does not show up in the help, so dig into the source, it required specifying an -o file that was not documented in the examples). Then running ocropus-cedit I could correct the errors tesseract made. And thats pretty much where I'm up to. Other thoughts ------ - I'd love to get my head around generating the page level gt. I believe this relied on OpenFST which I tried to get working today, but it doesn't seem to be used any more by ocropus. - What is the recommended way to submit changed fixes? I've got several images that cause various components in the pipeline to fail. I've gone in and added some try/excepts to make it fail gracefully. I'm more familiar with github. Okay, time for some rest. Thanks for all your efforts developers! Its great to see how the project is coming along 2 years on. Cheers, Nathan - On 24 March 2013 07:11, Tom Morris <[email protected]> wrote: > I can't help with your ground truth question, but unless you're absolutely > committed to Ocropus, I'd suggest checking out Tesseract. My impression is > that it's not only more mature, but it's got a much more active community > supporting it. > > Tom > > > On Sat, Mar 23, 2013 at 9:50 PM, Nathan K <[email protected]> wrote: > >> Just to clarify - looking over the examples >> >> fraktur-boxes says: >> >> "The next training step consists of retraining the model by aligning text >> lines with ground truth (see the example in uw3-500)" >> >> And in the uw3-500 example data is downloaded with ground truth already >> placed at the line level. Thus it is not clear what one should do to >> automatically generate line level ground truth from page level ground truth >> text files. I remember there was some tool that would enable this in the >> past, it worked on the principle of finding a line match that was 'close >> enough' based on a cost function. This enabled bootstrapping of a character >> model. >> >> Is this approach still valid? I could generate a character model using >> clustering and then manually review the results and then iterate. This >> however would still not yield ground truth for determining the error, or >> generating a language model. >> >> Thanks for your assistance if you're in the know! Been pulling my hair >> out all day! >> >> Cheers, >> >> Nathan >> >> >> On 23 March 2013 14:34, Nathan K <[email protected]> wrote: >> >>> Hey OCRopus Group, >>> Its been awhile in here, but I've just begin to update some old hacky >>> scripts from 0.4.4 to 0.6. I've very pleased to see the worth thats been >>> going on. Nice to see things a mor pythonic! I can't figure out how to >>> align the page level ground truth to a page. My memory may be failing me, >>> but I remember this very neat process where ocropus with automagically >>> align page lines with a text transcription of the page. My goal is to >>> regenerate my character training model, and also a language model. Would >>> greatly appreciate any tips to that effect. >>> >>> Also has there been some changes to the character models since 0.4.4 I >>> tried to use an old one which I remember doing quite a bit of work on, and >>> it fails to unpickle. >>> >>> Lastly, does anyone have/know of a collection/database of receipts that >>> could be used for training. I've asked friends and family and have so far >>> only received 50 documents - some quite poor quality. Perhaps a couple of >>> people keep digital records for tax purposes and would be happy to share. >>> Happy to keep them confidential if required. >>> >>> Cheers, >>> >>> Nathan >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "ocropus" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msg/ocropus/-/I8eeJdqGLCoJ. >>> For more options, visit https://groups.google.com/groups/opt_out. >>> >>> >>> >> >> >> >> -- >> >> >> >> >> Nathan Keilar >> Hunted Hive Web Studio ~ Innovative Solutions For Real World Problems >> Technical Director and Business Manager >> >> EMAIL: [email protected] >> PHONE: +61 (0) 7 3040 3065 >> SKYPE/TWITTER: https://twitter.com/#!/madteckhead >> FACEBOOK: http://www.facebook.com/nathan.keilar >> WEB: http://madteckhead.com >> >> This email (including any attachments) is confidential and may be >> privileged. If you have received it in error, please notify the sender by >> return email and delete this message from your system. Any unauthorised use >> or dissemination of this message in whole or in part is strictly >> prohibited. Please note that emails are susceptible to change and we will >> not be liable for the improper or incomplete transmission of the >> information contained in this communication nor for any delay in its >> receipt or damage to your system. We do not guarantee that the integrity of >> this communication has been maintained nor that this communication is free >> of viruses, interceptions or interference. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "ocropus" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> For more options, visit https://groups.google.com/groups/opt_out. >> >> >> > > -- Nathan Keilar Hunted Hive Web Studio ~ Innovative Solutions For Real World Problems Technical Director and Business Manager EMAIL: [email protected] PHONE: +61 (0) 7 3040 3065 SKYPE/TWITTER: https://twitter.com/#!/madteckhead FACEBOOK: http://www.facebook.com/nathan.keilar WEB: http://madteckhead.com This email (including any attachments) is confidential and may be privileged. If you have received it in error, please notify the sender by return email and delete this message from your system. Any unauthorised use or dissemination of this message in whole or in part is strictly prohibited. Please note that emails are susceptible to change and we will not be liable for the improper or incomplete transmission of the information contained in this communication nor for any delay in its receipt or damage to your system. We do not guarantee that the integrity of this communication has been maintained nor that this communication is free of viruses, interceptions or interference. -- You received this message because you are subscribed to the Google Groups "ocropus" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.
