Dear Access India Members,
During the Daisy Forum of India meeting held in Mumbai on 11th and 12th April 2008, I was given the responsibility to find information on the status of OCR Softwares for Indian languages. So here I am presenting the findings that I am able to research. I have prepared a White Paper on it which I posted in the PDF format on the daisy forum of India's mailing list 3 days back. But for benefit and awareness of others I am pasting content of it below this message. This will also help those who had posted queries on A I regarding this. White Paper: OCR Softwares for Indian languages Date: July 31st, 2008 Introduction : OCR softwares are available for English and other foreign languages but what is the status of OCR software availability for Indian languages? During the Daisy Forum of India meeting held in Mumbai on 11th and 12th April 2008, I was given the responsibility to find information on this. So here I am presenting the findings that I am able to research. Definitions : OCR: - Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of images of handwritten, typewritten or printed text (usually captured by a scanner) into machine-editable text. OCR Software: - OCR Software converts paper documents into electronic data, so that you can handle the information (electronic text) in your computer system. Indian Languages: - Indian Constitution recognizes Hindi in Devanāgarī script as the official language of the central government India the Constitution of India recognizes 22 languages, spoken in different parts of the country, {All definitions source is "Wikipedia") Findings : As per the research on the web highlighted one workshop / seminar organized by Rediff Centre for Indian Language Content Management On the theme of "Brainstorming Workshop on OCR for Indian Languages" on 16-17 March, 2007, at Hotel Regalis, Mysore. Reference. Link: http://www.isim.ac.in/RCILCM/index.htm Further research on Access India (mailing group for the blind) querying more on this and contact with NAB Karnataka to get more info on this theme did not throw up anything significant. Visit by Mr. Venki, rediff.com Technical Head : During a meeting with Mr. Venki at the XRCVC in the month of June 2008, Some more information about the conference was secured. This was because Mr. Venki himself was a one of the members of the organizing team from rediff. He made the following observation. "Overall the conference was good. Speakers had shared new ideas on developing Indian OCR." However further following up with regard to this conference it seems no significant progress have been made thereafter. Chennai "Print Access" Seminar Findings" Our XRCVC team member Neha learned about many technological developments from the "Print Access" conference which was held at Chennai on April 19th, 2008. She shared lot of information, contacts and links. E.g. Acharya website (http://acharya.iitm.ac.in) TTS translator in 22 languages, Ravi TTS for Telgu, C-DAC softwares like Mantra, Shruti Drishti, Shrut Lekhan and very important lead on Indian OCR software developed by C-DAC Pune. Visit to C-DAC Pune : On May 14th and 15th, the XRCVC team visited C-DAC Pune. The visit was very fruitful. A fully developed off-the-shelf product for Hindi-Devnagri Indian language software named as CHITRANKAN developed by GIST Development Team, C-DAC, Pune, Maharashtra. They demonstrated the product. The result was very good. CHITRANKAN is commercially used by 2-3 organizations in Pune. Other C-DAC resources : OCR softwares in Hindi called CHITRANKAN, in Marathi called CHITRAKSHARIKA and in Malayalam called NAYANA. About NAYANA : Source: http://www.malayalamresourcecentre.org/Mrc/products/nayana.html NAYANA is a product that enables the user to convert printed Malayalam documents to editable computer files. This system is very simple to use and requires no prior expertise. FEATURES - NAYANA processes all types of printed Malayalam Documents. - Supports TIFF and BMP image formats. - Supports document Images with resolution 300 dpi and above. - Detection and correction of document skew of -5o to +5o. - The output document can be stored in both ISCII and ISFOC form. - The output document can be saved as TXT, RTF, HTML or ACI file formats. - User friendly interface. - Recognition speed of 50 char /sec. - Conversion of printed documents to editable text. - Optical Character Recognition combined with Text–To–Speech technology can be used for text reading system. EXPANDABILITY - A layout analyzer can be added to the system to reproduce the input document in its original layout. - Can be expanded to cater to hand writ ten and old document. The linguistic resource generation tools such as Prabandhika and Vishleshika Source: http://delnet.nic.in/news-naclin-report.htm About CHITRANKAN : Source: http://www.cdac.in/html/gist/products/chitra.asp CHITRANKAN - the first OCR (Optical Character Recognition) system for Indian Languages. The OCR process involves: • Conversion of printed matter into an electronic image - the printed matter can be converted into an image using Scanner or a Digital Camera • Electronic Image Processing - this involves identifying text information by analyzing the image for noise and skew. Once text information is available another algorithm reads and recognizes the printed matter • Storing the extracted text information as an electronic data: the recognized input is converted to a standard format, which can be opened in any word processing application, facilitating the user to edit the text data. Chitrankan archives Indian Language content in electronic form through OCR. It enables the user to take a book, magazine or printed text in an Indian Language, feed it directly into an electronic computer file, and edit the file using a word processor. Once the data is in the form of electronic text it can be searched, sorted and indexed. Chitrankan saves the user the effort of typing an entire document. Chitrankan scans a document to screen by recognizing the text and other images as objects. These scanned images are flawless and can be stored or printed time and again. Exceedingly user-friendly with features that can edit, move, resize or duplicate the scanned document, Chitrankan also provides a spell check facility. The potential of Chitrankan is enormous as it enables users to harness the power of computers to access printed documents in Indian Languages. Software Advantage: •Recognizes Hindi and Marathi languages along with Embedded English Text. •Skew detection and correction for input image upto ± 15° •Grabs images directly from the scanner for processing •Automatic Text and Picture region detection •Supports all TWAIN compatible scanners and digital cameras •Supports 256 grayscale/color, .bmp/.tiff images scanned at 300 dpi as input image for recognition •Ideal for font sizes between 10 pt. and 36 pt, and all popular fonts. •Saves scanned/modified images as .BMP files •Saves recognized text in ISCII format or exporting as .RTF for editing using GIST range of software •Uses advanced DSP (Digital Signal Processing) algorithms to remove "Noise" and "Back Page Reflection" •Enables printing both - the input image as well as the recognized text. •Provided with inbuilt Flip, Rotate and Negate options for Input Image User Advantage: •Allows deletion of associated pictures from the image by using the ERASE option •Provides painting tools to join the breaks in the characters to get good results •Allows OCR to be applied on an image rotated by 180° or flipped •Applies OCR to image having text in reverse by using INVERT option •Provides inbuilt spell checking facility •Provides editing tools like cut, copy, paste, find and replace options for use on recognized text System Requirements: •Minimum Configuration: Pentium II with 64 MB RAM Virtual Memory requirement 300 MB (Swap File Space in Hard Disk) •Recommended Configuration: Pentium III with 128 MB RAM and above Virtual Memory requirement 400 MB •Operating Systems Supported: Window NT ver. 4.0, Service Pack 6.0 and above/ Windows 9X and above, Windows 2000 and Windows XP. Price: CHITRANKAN Single user license for CHITRANKAN Rs. 10,000/- Contacts: channel partner list URL - http://www.cdac.in/html/gist/ch_part.asp CHITRANKAN demo can be downloaded from http://www.cdac.in/html/gist/down/chtri_d.asp File Size: 45 MB Experimenting with CHITRANKAN at the XRCVC – findings : At the XRCVC demo of CHITRANKAN was installed and put through tests. The Rajyasabha website webpage were used for testing Hindi-Devnagri script which uses Yogesh font typeset. Its accuracy can be described as good approximately 70%. This can be improved by using font training mode. Additional documents in Hindi and Marathi were tested. Results from those were fair amount approximately 40% accuracy level. The font training module however can increase the accuracy. The software supports the Yogesh Hindi font by default. Mare fonts can be added on by training OCR using font recognition module. Font training modules enable the user to train the software to decipher documents in particular fonts. To make the software even more useful, CHITRANKAN incorporates a set of application program interfaces (APIs) which allow software developers the flexibility to build features from CHITRANKAN into their software application. You can save recognized output in RTF format and even choose recognition language as either Hindi or Marathi. Screen reader access with Chitrankan - Graphical User Interface of Chitrankan is very friendly with menus and shortcuts are available for all important options. In the workspace area it has mainly three windows such as input image window, recognized output text window and digitized image windows. However screen reader (SAFA) is not able to read the recognized text. Conclusion : One can definitely contribute to the development of the Indian language OCR through download, testing and the feedback can be given to C-DAC that would help in product enhancement. Those who are familiar with Malayalam would do well to test the NAYANA OCR software. Prashant Naik The Xavier's Resource Centre for the Visually Challenged (XRCVC) St. Xavier's College, Mumbai. ---- VISION WITHOUT ACTION IS MERELY A DREAM, ACTION WITHOUT VISION JUST PASSES THE TIME, VISION WITH ACTION CAN CHANGE THE WORLD. Join Access India convention: For updates on it visit: http://accessindia.org.in/harish/convention.htm Registration is now open! To unsubscribe send a message to [EMAIL PROTECTED] with the subject unsubscribe. To change your subscription to digest mode or make any other changes, please visit the list home page at http://accessindia.org.in/mailman/listinfo/accessindia_accessindia.org.in