On Sun, Apr 21, 2013 at 2:47 AM, Aarti K. Dwivedi <[email protected]> wrote: > Hi, > > I am an applicant for GSoC 2013. I am enthusiastic about working on > "Improving information retrieval methods for OCR data sets consisting of > Indic scripts." > Before I posted the proposal I wanted to discuss what I have framed on the > basis of my understanding of the project idea. > > Synopsis: > > 1. My first step would be familiarizing myself with the current methods and > algorithms that are used in retrieval of information from digitized text > and also with their shortcomings. > > 2. Figure out the reasons for shortcomings and the degradation of text. > > 3. Propose and implement a retrieval system that does not lead to > degradation, i.e., improve the text processing. > > 4. Improve the existing search algorithms by weeding out inefficiencies and > propose additions while increase efficiency. > > > Implementation details of the project: > > 1. Test the current methods of retrieval of information from digitized text > to find out specific problems and areas of shortcomings. File these as > issues. The shortcomings are described in terms of technical details of > where the search falls short. > > 2. Remove errors based on character level and make the search independent of > character level error. > > 3. Develop a system to classify documents according to tags. Addition of > tags to the documents would help in narrowing down the search. > > 4. Reduce the error by predicting words when characters are perceived to be > inaccurate. > > 5. Continue improving search implementation as the errors come out. > > > Phases/Milestones with dates: > > 1. June 17- June 27: Filter out errors in specific terms and find out their > causes. > > 2. June 27- July 7: Make the retrieval independent of character level, i.e., > improve the recognition of words as a whole. > > 3. July 7- July 24: Workaround other problems in the current methods of > standardized and structured text processing. > > 4. July 24- August 1: Implement tagging system. (The bot decides from a list > of pre-decided tags and assigns it to the documents on the basis of the > first few pages, thus reducing the amount of full text search that needs to > be done). > > 5. August 1- August 12: Implement information retrieval by text > summarization. > > 6. August 12- August 22: Implement search on the basis of text > summarization. > > 7. August 22- September 2: Implement the error correction methods to improve > performance. > > 8. September 2- September 16: Find out loopholes in the implemented system > and improve upon them. > > > Is there something that I have missed in understanding the project? I would > be happy to receive any clarifications on the project. >
Hi Aarti, Thanks for your introduction. There are many threads going on on the mailing list regarding the same subject http://lists.ankur.org.in/pipermail/project-ideas-ankur.org.in/2013-April/author.html Kindly request you to go through the same and ask questions if there are any over and above the same Regards, -- Bhavani Shankar Ubuntu Developer | www.ubuntu.com https://launchpad.net/~bhavi _______________________________________________ Project-ideas mailing list [email protected] http://lists.ankur.org.in/listinfo.cgi/project-ideas-ankur.org.in
