I have found several research papers that describe various approaches to IR from OCR'd Indic Texts.
1. The first issue when we attempt to do IR from Indic Texts is the question of how the user query should be represented. The most common way to do this is to use some standardized transliteration scheme like ISCII, of course a major project i.e. the Digital Library of India<http://www.dli.ernet.in/> uses the Om <http://www.cs.cmu.edu/~madhavi/Om/> transliteration scheme. 2. Of course step one may probably be the least of our concerns, given that several challenges need to be overcome at the OCR end itself. IR from OCR documents typically has two approaches (1) A recognition -based approach - this is based on trying to first identify layout recognition of the document followed by character segmentation followed by a post-processing step that tries to correct spelling errors, etc. While this is a good approach used for English texts it is not suitable for Indic documents ( no proper layout recognition algo, shirorekha, large number of similar looking chars, etc) (2) Recognition free approach - in which rather than trying to obtain the text form from the document image, we skip the OCR recognition step entirely, rather we search word image against word image. Thus the user query is first converted to a word image and then the word images that are similar to it in the document are retrieved, with some improvements this approach is currently best suited for Indic scripts. Thus we will actually use various image features like gabors, etc to find the best matching image of the query word, using algorithms like dynamic time warping(DTW). A more efficient algorithm using clustering with locality sensitive hashing (LSH) may also be used. 3. Existing software : Currently Parichit<http://code.google.com/p/parichit/> is an avaiable opensource OCR for some Indian languages. But it still has much to accomplish. A Web OCR<http://tdil-dc.in/index.php?option=com_vertical&parentid=77> has been developed by TDIL and there is also Chitrankan<http://www.cdac.in/html/press/archives/atjp02/prs_rl114.aspx> by C-DAC but they both are not open source. So several opportunities exist for improving the scenario wrt IR. 4. Data and Resources : One very good source to get training data for our purposes is the Information Retrieval Society of India and FIRE - that tries to achieve something similar to TREC for Indian Languges. This has very good datasets <http://www.isical.ac.in/~fire/data.html> available which may be used to evaluate our baseline model. 5. References 1. Manrnatha, R., and C. V. Jawahata. "Challenges in the Recognition and Searching of Printed Books in Indian Languages and Scripts." *Multimedia Information Extraction and Digital Heritage Preservation* 10 (2010): 119. 2. Meshesha, Million, and C. V. Jawahar. "Matching word images for content-based retrieval from printed document images." *International Journal of Document Analysis and Recognition (IJDAR)* 11.1 (2008): 29-38. 3. Govindaraju, Venu, and Srirangaraj Setlur. *Guide to OCR for Indic Scripts*. Springer, 2009. 4. Balajapally, Prashanth, et al. "Multilingual book reader: Transliteration, word-to-word translation and full-text translation." *Proceeding of the 13th Biennial Conference and Exhibition Conference of Victorian Association for Library Automation Melbourne, Feb*. 2006. Most of the work I refer to here is available by a simple query on google scholar: http://scholar.google.co.in/scholar start=10&q=retrieval+from+ocr+of+Indian+texts&hl=en&as_sdt=0,5<http://scholar.google.co.in/scholar?start=10&q=retrieval+from+ocr+of+Indian+texts&hl=en&as_sdt=0,5> I apologise for the length of the mail. I would be glad to know of further suggestions and guidance. Regards, Madhura Parikh [email protected]
_______________________________________________ Project-ideas mailing list [email protected] http://lists.ankur.org.in/listinfo.cgi/project-ideas-ankur.org.in
