Hello I am Alok Kothari. I am interested in applying to GSoc 2013 and to work with Ankur.
Background: I graduated from IIT Kharapur in 2009 and have been involved in research in IR/NLP and Machine Learning for nearly 2 years. I was interested in the project on 'Improving information retrieval methods for OCR data sets consisting of Indic scripts' 1. I was wondering whether I could have a look at or have some indicationto the quality of files available. This will give me some idea about the kinds of error 2. In the project can I assume to have access to some 'clean' corpus so that I can use that towards correcting errors in digitised corpus. for e.g. I could learn n-grams from the know 'correct' text to improve possible errors in OCR text. There are some ways to obtain such corpus. 3. Does the IR system have to be implemented on top of Lucene (or other open source software) or can be completely stand alone. Thank You! Best, Alok
_______________________________________________ Project-ideas mailing list [email protected] http://lists.ankur.org.in/listinfo.cgi/project-ideas-ankur.org.in
