Update : Through a thread in this mailing list, I have also come to know about tesseract-indic - that seems to be more promising..., I am also looking at that...
Regards, Madhura On Sat, Apr 13, 2013 at 11:42 PM, Madhura Parikh <[email protected]>wrote: > Hi, > I have spent some time going through Tesseract : > > > The current implementation of Tesseract may have already solved this > challenge. > (shirorekha,etc) > There is a publication that tries to achieve this for Hindi. But they are > able to achieve a reasonable accuracy only when they assume a predefined > font-style. Otherwise the reported accuracy is ~40%, which is of no use. > Again Tesseract uses the character segmentation approach. I believe that if > this could instead be replaced by the keyword spotting approach, in which > we bypass identifying individual characters and rather try to identify word > images, the accuracy scenario can be considerably improved. A post > processing step can be added in which we try to adapt the results based on > the feedback from words that are recognized to be the correct guess with > high probability. A language model that uses the grammar /n-gram statistics > to improve the accuracy may be used with the clustering... > I found the first few slides in this > ppt<http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=10&cad=rja&ved=0CGwQFjAJ&url=http%3A%2F%2Fwww.cs.bgu.ac.il%2F~klara%2FATCS111%2FWordSpotDTW_Yaakov_Roee.pptx&ei=Xm5pUY7XH4vOrQf3g4GIBQ&usg=AFQjCNECi20vLj434RW6n8-dgwsUUtNo2Q&sig2=lQ_KMlKO7dbQcowcWGs33w>a > good guide to this approach. > > The interesting part (for me at least) is what happens when the >> image-under-test is a fragment. For example, if the digitized document >> is of a scroll that is damaged, what would it take for an IR system to >> be able to reconstruct the word/phrase/image? >> > > Of Course this had be the true challenge to any OCR system. But if the > approach based on clustering images (above) is used would the chances be > improved rather then going by the conventional OCR approach? > >> >> > 3. Existing software : Currently Parichit is an avaiable opensource >> OCR for >> > some Indian languages. But it still has much to accomplish. A Web OCR >> has >> > been developed by TDIL and there is also Chitrankan by C-DAC but they >> both >> > are not open source. So several opportunities exist for improving the >> > scenario wrt IR. >> >> Could you check how Tesseract plays out in contrast to the above? >> Parichit seems to be in a very nascent stage currently as compared to >> Terrasect. But it offers some training datasets in a handful of Indic >> languages. There is also the OCRopus that has evolved off Tesseract, and >> integrates python and machine learning to OCR. However it currently appears >> to be less efficient than Tesseract >> > > These are some very cursory observations. I still need to explore > Tesseract more thoroughly and get back. > > > Madhura Parikh > [email protected] > >
_______________________________________________ Project-ideas mailing list [email protected] http://lists.ankur.org.in/listinfo.cgi/project-ideas-ankur.org.in
