What is the pipeline for training ocropus to recognize a new script? Specifically, I'm interested in recognizing documents written in Cyrillic, Georgian, and Armenian. What are the changes anticipated to this pipeline once the next version is released? I'm not looking for an in-depth walkthrough, just an idea of what general steps I will need to take.
Thanks, Derek On Nov 15, 2010, at 11:02 , Tom wrote: > On Oct 18, 4:18 pm, "Robert B." <[email protected]> wrote: >> Hi all, >> >> Does anyone know if OCRopus is up to the challenge of recognizing a >> page from, say, a French-English dictionary? > >> Such a dictionary would >> feature two columns, italic characters, accented characters, French >> words, and words that would not appear in any model of the language >> (for example, a breakdown of syllables). >> >> How far off is OCRopus from recognizing such a page? What is the work >> that would need to be done? > > Layout analysis and italics are there. > > For accented characters, we just spend basically the last year doing > what's necessary to support Unicode; this prompted the move to > Python. We're slowly getting the bugs out and will basically be > releasing once that works. > > Language modeling like what you need is already fully supported > through the use of OpenFST language models; you can create weighted > combinations of, say, a dictionary and a syllabic model. > > Tom > > -- > You received this message because you are subscribed to the Google Groups > "ocropus" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/ocropus?hl=en. > -- You received this message because you are subscribed to the Google Groups "ocropus" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/ocropus?hl=en.
