Re: [Project-ideas] OCR IR

Alok Kothari Thu, 18 Apr 2013 23:06:23 -0700

In point 2. What I meant was, should I look at available corpora. A fairly
large corpus for Indian languages is EMILIE
http://www.lancs.ac.uk/fass/projects/corpus/emille/. Would I be able to use
that? (and others available)


On Fri, Apr 19, 2013 at 11:33 AM, Alok Kothari <[email protected]>wrote:

> Hello
>
> I am Alok Kothari. I am interested in applying to GSoc 2013 and to work
> with Ankur.
>
> Background: I graduated from IIT Kharapur in 2009 and have been involved
> in research in IR/NLP and Machine Learning for nearly 2 years.
>
> I was interested in the project on 'Improving information retrieval
> methods for OCR data sets consisting of Indic scripts'
>
> 1. I was wondering whether I could have a look at or have some indicationto 
> the quality of files available.
> This will give me some idea about the kinds of error
>
> 2. In the project can I assume to have access to some 'clean' corpus so
> that I can use that towards correcting errors in digitised corpus. for e.g.
> I could learn n-grams from the know 'correct' text to improve possible
> errors in OCR text. There are some ways to obtain such corpus.
>
> 3. Does the IR system have to be implemented on top of Lucene (or other
> open source software) or can be completely stand alone.
>
> Thank You!
>
> Best,
> Alok
>
>
>
>

_______________________________________________
Project-ideas mailing list
[email protected]
http://lists.ankur.org.in/listinfo.cgi/project-ideas-ankur.org.in

Re: [Project-ideas] OCR IR

Reply via email to