On Fri, Apr 19, 2013 at 11:33 AM, Alok Kothari <[email protected]> wrote: > I am Alok Kothari. I am interested in applying to GSoc 2013 and to work with > Ankur.
Awesome! > Background: I graduated from IIT Kharapur in 2009 and have been involved in > research in IR/NLP and Machine Learning for nearly 2 years. Would it be possible to provide links to any papers/presentations or, code that you have published? > I was interested in the project on 'Improving information retrieval methods > for OCR data sets consisting of Indic scripts' > > 1. I was wondering whether I could have a look at or have some indication to > the quality of files available. This will give me some idea about the kinds > of error The project idea requires the interested candidate to propose within the scope of the project the kind of errors the initial iteration/release will handle. > 2. In the project can I assume to have access to some 'clean' corpus so that > I can use that towards correcting errors in digitised corpus. for e.g. I > could learn n-grams from the know 'correct' text to improve possible errors > in OCR text. There are some ways to obtain such corpus. The FIRE team at ISI Kolkata have a set of files released which can be used as a corpus should you so want. Additionally, introducing errors in a document is a reasonably active area of discussion. I'm certain you are familiar with the methods. Continuing from your next email, the ability to use EMILIE as a training/seed corpus depends on the license under which it is made available > 3. Does the IR system have to be implemented on top of Lucene (or other open > source software) or can be completely stand alone. I was hoping that we would be able to utilize ElasticSearch or, similar. Lucene is an option too. -- sankarshan mukhopadhyay <https://twitter.com/#!/sankarshan> _______________________________________________ Project-ideas mailing list [email protected] http://lists.ankur.org.in/listinfo.cgi/project-ideas-ankur.org.in
