Actually, it takes surprisingly little data: after a few thousand lines of text, you already get pretty readable results for Latin text.
You can train on simulated data as well with good results: a tool for generating training data artificially is included (but probably requires a bit of adaptation for other scripts). Tom On Tuesday, December 23, 2014 6:40:17 PM UTC-8, Shibamouli Lahiri wrote: > > Hi Tom, > > Thanks much for the update. I'm new to Ocropus, and I had a question on > running rtrain. > > Do you know (or have an estimate of) how many lines of text does the > program take (to train) before it starts giving reasonable results? I'm > wondering because since it's neural network based, I'd hazard a guess that > it'd take more than a few thousand lines? > > More details: I'm working on gathering labeled data for Bengali (Bangla) > OCR, and needed an estimate of lines that I'll need to transcribe as a > starter. > > Regards, > Shibamouli > > > > On Wednesday, December 17, 2014 2:40:11 PM UTC-5, Tom wrote: >> >> With the new recognizer, it should be pretty easy to train. We've trained >> it for other scripts purely from generated data and gotten pretty good >> results. >> >> I'll try to create some more documentation and some simpler training >> scripts. >> >> Tom >> >> On Wednesday, December 17, 2014 5:36:34 AM UTC-8, 81+ yrsold wrote: >>> >>> Tom, >>> I am really happy - you have resumed ocropus project again. Trust this >>> time I hope Ocropus Project will support for Indic lang(Indian languages) >>> this time. >>> With warmest regards, >>> sriranga(81+yrs) >>> >>> On Wednesday, December 17, 2014 3:56:52 AM UTC+5:30, Tom wrote: >>>> >>>> I joined Google this year. Google permits me to spend time on the >>>> OCRopus project and contribute. As part of this, I moved the project to >>>> Github, because it's easier to maintain there. >>>> >>>> I just pushed out a new update of ocropy. This includes mainly >>>> faster/smaller saving of models, as well as a C++ implementation of the >>>> LSTM network. The C++ LSTM implementation is a pretty straightforward port >>>> of the Python version and runs much faster. The C++ classes have been >>>> wrapped as Python classes and are callable from Python. There are two new >>>> top-level drivers, ocropus-ltrain and ocropus-lpred, for the C++ >>>> implementation. The C++ implementation appears to be numerically close to >>>> the Python implementation and yield good recognizers when trained, but it >>>> requires more testing. >>>> >>>> As before, this is research-level software with minimal documentation >>>> (do look at the iPython Notebooks, the .ipynb files, since they contain >>>> significant information). Feel free to contribute patches, documentation, >>>> etc. using the usual Github mechanisms of merge requests. I'll try to >>>> incorporate them as time permits. >>>> >>>> Tom >>>> >>> -- You received this message because you are subscribed to the Google Groups "ocropus" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/ocropus/942031c2-34ad-415d-97f9-802ead80ba33%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
