Hi, I have spent some time going through Tesseract : The current implementation of Tesseract may have already solved this challenge. (shirorekha,etc) There is a publication that tries to achieve this for Hindi. But they are able to achieve a reasonable accuracy only when they assume a predefined font-style. Otherwise the reported accuracy is ~40%, which is of no use. Again Tesseract uses the character segmentation approach. I believe that if this could instead be replaced by the keyword spotting approach, in which we bypass identifying individual characters and rather try to identify word images, the accuracy scenario can be considerably improved. A post processing step can be added in which we try to adapt the results based on the feedback from words that are recognized to be the correct guess with high probability. A language model that uses the grammar /n-gram statistics to improve the accuracy may be used with the clustering... I found the first few slides in this ppt<http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=10&cad=rja&ved=0CGwQFjAJ&url=http%3A%2F%2Fwww.cs.bgu.ac.il%2F~klara%2FATCS111%2FWordSpotDTW_Yaakov_Roee.pptx&ei=Xm5pUY7XH4vOrQf3g4GIBQ&usg=AFQjCNECi20vLj434RW6n8-dgwsUUtNo2Q&sig2=lQ_KMlKO7dbQcowcWGs33w>a good guide to this approach.
The interesting part (for me at least) is what happens when the > image-under-test is a fragment. For example, if the digitized document > is of a scroll that is damaged, what would it take for an IR system to > be able to reconstruct the word/phrase/image? > Of Course this had be the true challenge to any OCR system. But if the approach based on clustering images (above) is used would the chances be improved rather then going by the conventional OCR approach? > > > 3. Existing software : Currently Parichit is an avaiable opensource OCR > for > > some Indian languages. But it still has much to accomplish. A Web OCR has > > been developed by TDIL and there is also Chitrankan by C-DAC but they > both > > are not open source. So several opportunities exist for improving the > > scenario wrt IR. > > Could you check how Tesseract plays out in contrast to the above? > Parichit seems to be in a very nascent stage currently as compared to > Terrasect. But it offers some training datasets in a handful of Indic > languages. There is also the OCRopus that has evolved off Tesseract, and > integrates python and machine learning to OCR. However it currently appears > to be less efficient than Tesseract > These are some very cursory observations. I still need to explore Tesseract more thoroughly and get back. Madhura Parikh [email protected]
_______________________________________________ Project-ideas mailing list [email protected] http://lists.ankur.org.in/listinfo.cgi/project-ideas-ankur.org.in
