Actually, it takes surprisingly little data: after a few thousand lines of 
text, you already get pretty readable results for Latin text.  

You can train on simulated data as well with good results: a tool for 
generating training data artificially is included (but probably requires a 
bit of adaptation for other scripts).

Tom

On Tuesday, December 23, 2014 6:40:17 PM UTC-8, Shibamouli Lahiri wrote:
>
> Hi Tom,
>
> Thanks much for the update. I'm new to Ocropus, and I had a question on 
> running rtrain.
>
> Do you know (or have an estimate of) how many lines of text does the 
> program take (to train) before it starts giving reasonable results? I'm 
> wondering because since it's neural network based, I'd hazard a guess that 
> it'd take more than a few thousand lines?
>
> More details:  I'm working on gathering labeled data for Bengali (Bangla) 
> OCR, and needed an estimate of lines that I'll need to transcribe as a 
> starter.
>
> Regards,
> Shibamouli
>
>
>
> On Wednesday, December 17, 2014 2:40:11 PM UTC-5, Tom wrote:
>>
>> With the new recognizer, it should be pretty easy to train. We've trained 
>> it for other scripts purely from generated data and gotten pretty good 
>> results.
>>
>> I'll try to create some more documentation and some simpler training 
>> scripts.
>>
>> Tom
>>
>> On Wednesday, December 17, 2014 5:36:34 AM UTC-8, 81+ yrsold wrote:
>>>
>>> Tom,
>>> I am really happy - you have resumed ocropus project again. Trust this 
>>> time I hope Ocropus Project will support for Indic lang(Indian languages) 
>>> this time.
>>> With warmest regards,
>>> sriranga(81+yrs) 
>>>
>>> On Wednesday, December 17, 2014 3:56:52 AM UTC+5:30, Tom wrote:
>>>>
>>>> I joined Google this year. Google permits me to spend time on the 
>>>> OCRopus project and contribute. As part of this, I moved the project to 
>>>> Github, because it's easier to maintain there.
>>>>
>>>> I just pushed out a new update of ocropy. This includes mainly 
>>>> faster/smaller saving of models, as well as a C++ implementation of the 
>>>> LSTM network. The C++ LSTM implementation is a pretty straightforward port 
>>>> of the Python version and runs much faster. The C++ classes have been 
>>>> wrapped as Python classes and are callable from Python. There are two new 
>>>> top-level drivers, ocropus-ltrain and ocropus-lpred, for the C++ 
>>>> implementation. The C++ implementation appears to be numerically close to 
>>>> the Python implementation and yield good recognizers when trained, but it 
>>>> requires more testing.
>>>>
>>>> As before, this is research-level software with minimal documentation 
>>>> (do look at the iPython Notebooks, the .ipynb files, since they contain 
>>>> significant information). Feel free to contribute patches, documentation, 
>>>> etc. using the usual Github mechanisms of merge requests. I'll try to 
>>>> incorporate them as time permits.
>>>>
>>>> Tom
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"ocropus" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/ocropus/942031c2-34ad-415d-97f9-802ead80ba33%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to