Hi Sankarshan, As last discussed on IRC, I have been reading around the existing tools, mainly "tesseractIndic" , but it has been dormant, in fact, it has been almost a month and my membership to the google group is still pending. However, I like the approach of using separate python scripts for pre-processing and I believe the same style could be used for further improvements. The blog is really helpful. I also read about the "banglaOCR" project developed at CRBLP, BRAC university, Bangladesh, and currently going through the details. Most probably I would like to develop around either (or maybe a hybrid) of the two systems.
I have certain doubts at this point: 1. the idea objective states to improve accuracy to 98% . My doubts are, do we have some benchmark data or shall we define it for our purpose? I read about the FIRE Also, M.A.Hasnat, the developer of BanglaOCR pointed to me that the accuracy may not be same for all domains, eg., newspaper, book, typewriting docs, etc, so, domain adaptability should be considered. Personally, I feel we should focus on perfecting the system for one domain and then we can look into the other domains. I would appreciate some clarification on these points. -- -Regards, Debajyoti Nag http://twitter.com/aramis7d
_______________________________________________ Project-ideas mailing list [email protected] http://lists.ankur.org.in/listinfo.cgi/project-ideas-ankur.org.in
