In point 2. What I meant was, should I look at available corpora. A fairly large corpus for Indian languages is EMILIE http://www.lancs.ac.uk/fass/projects/corpus/emille/. Would I be able to use that? (and others available)
On Fri, Apr 19, 2013 at 11:33 AM, Alok Kothari <[email protected]>wrote: > Hello > > I am Alok Kothari. I am interested in applying to GSoc 2013 and to work > with Ankur. > > Background: I graduated from IIT Kharapur in 2009 and have been involved > in research in IR/NLP and Machine Learning for nearly 2 years. > > I was interested in the project on 'Improving information retrieval > methods for OCR data sets consisting of Indic scripts' > > 1. I was wondering whether I could have a look at or have some indicationto > the quality of files available. > This will give me some idea about the kinds of error > > 2. In the project can I assume to have access to some 'clean' corpus so > that I can use that towards correcting errors in digitised corpus. for e.g. > I could learn n-grams from the know 'correct' text to improve possible > errors in OCR text. There are some ways to obtain such corpus. > > 3. Does the IR system have to be implemented on top of Lucene (or other > open source software) or can be completely stand alone. > > Thank You! > > Best, > Alok > > > >
_______________________________________________ Project-ideas mailing list [email protected] http://lists.ankur.org.in/listinfo.cgi/project-ideas-ankur.org.in
