I appreciate the efforts of Mr. Prashant and x.r.c.v.i. team for this comprehensive presentation. Good luck to the team. Dr. Kalpana ----- Original Message ----- From: "Prashant Naik" <[EMAIL PROTECTED]> To: <accessindia@accessindia.org.in> Sent: Sunday, August 03, 2008 1:12 PM Subject: [AI] White Paper: OCR Softwares for Indian languages
> Dear Access India Members, > > > > During the Daisy Forum of India meeting held in Mumbai on 11th and 12th > April 2008, I was given the responsibility to find information on the > status > of OCR Softwares for Indian languages. So here I am presenting the > findings > that I am able to research. I have prepared a White Paper on it which I > posted in the PDF format on the daisy forum of India's mailing list 3 days > back. But for benefit and awareness of others I am pasting content of > it > below this message. This will also help those who had posted queries on A > I > regarding this. > > > > White Paper: OCR Softwares for Indian languages > > Date: July 31st, 2008 > > Introduction : > > OCR softwares are available for English and other foreign languages but > what > is > > the status of OCR software availability for Indian languages? > > During the Daisy Forum of India meeting held in Mumbai on 11th and 12th > April > > 2008, I was given the responsibility to find information on this. So here > I > am > > presenting the findings that I am able to research. > > > > Definitions : > > OCR: - Optical character recognition, usually abbreviated to OCR, is the > > mechanical or electronic translation of images of handwritten, typewritten > or > > printed text (usually captured by a scanner) into machine-editable text. > > OCR Software: - OCR Software converts paper documents into electronic > data, > > so that you can handle the information (electronic text) in your computer > system. > > Indian Languages: - Indian Constitution recognizes Hindi in Devanāgarī > script > > as the official language of the central government India the Constitution > of > India > > recognizes 22 languages, spoken in different parts of the country, > > {All definitions source is "Wikipedia") > > > > Findings : > > As per the research on the web highlighted one workshop / seminar > organized > by > > Rediff Centre for Indian Language Content Management > > On the theme of "Brainstorming Workshop on OCR for Indian Languages" on > > 16-17 March, 2007, at Hotel Regalis, Mysore. > > Reference. Link: http://www.isim.ac.in/RCILCM/index.htm > > Further research on Access India (mailing group for the blind) querying > more > on > > this and contact with NAB Karnataka to get more info on this theme did not > throw > > up anything significant. > > > > Visit by Mr. Venki, rediff.com Technical Head : > > During a meeting with Mr. Venki at the XRCVC in the month of June 2008, > > Some more information about the conference was secured. This was because > > Mr. Venki himself was a one of the members of the organizing team from > rediff. > > He made the following observation. "Overall the conference was good. > Speakers > > had shared new ideas on developing Indian OCR." > > However further following up with regard to this conference it seems no > > significant progress have been made thereafter. > > > > Chennai "Print Access" Seminar Findings" > > Our XRCVC team member Neha learned about many technological > > developments from the "Print Access" conference which was held at Chennai > on > > April 19th, 2008. She shared lot of information, contacts and links. > > E.g. Acharya website (http://acharya.iitm.ac.in) TTS translator in 22 > languages, > > Ravi TTS for Telgu, C-DAC softwares like Mantra, Shruti Drishti, Shrut > Lekhan > > and very important lead on Indian OCR software developed by C-DAC Pune. > > > > Visit to C-DAC Pune : > > On May 14th and 15th, the XRCVC team visited C-DAC Pune. The visit was > very > > fruitful. A fully developed off-the-shelf product for Hindi-Devnagri > Indian > > language software named as CHITRANKAN developed by GIST Development > > Team, C-DAC, Pune, Maharashtra. They demonstrated the product. The result > > was very good. CHITRANKAN is commercially used by 2-3 organizations in > > Pune. > > Other C-DAC resources : > > OCR softwares in Hindi called CHITRANKAN, in Marathi called > > CHITRAKSHARIKA and in Malayalam called NAYANA. > > > > About NAYANA : > > Source: http://www.malayalamresourcecentre.org/Mrc/products/nayana.html > > NAYANA is a product that enables the user to convert printed Malayalam > > documents to editable computer files. This system is very simple to use > and > > requires no prior expertise. > > FEATURES > > - NAYANA processes all types of printed Malayalam Documents. > > - Supports TIFF and BMP image formats. > > - Supports document Images with resolution 300 dpi and above. > > - Detection and correction of document skew of -5o to +5o. > > - The output document can be stored in both ISCII and ISFOC form. > > - The output document can be saved as TXT, RTF, HTML or ACI file formats. > > - User friendly interface. > > - Recognition speed of 50 char /sec. > > - Conversion of printed documents to editable text. > > - Optical Character Recognition combined with Text–To–Speech technology > can > > be used for text reading system. > > EXPANDABILITY > > - A layout analyzer can be added to the system to reproduce the input > document > > in its original layout. > > - Can be expanded to cater to hand writ ten and old document. > > The linguistic resource generation tools such as Prabandhika and > Vishleshika > > Source: http://delnet.nic.in/news-naclin-report.htm > > > > About CHITRANKAN : > > Source: http://www.cdac.in/html/gist/products/chitra.asp > > CHITRANKAN - the first OCR (Optical Character Recognition) system for > Indian > > Languages. > > The OCR process involves: > > • Conversion of printed matter into an electronic image - the printed > matter > can > > be converted into an image using Scanner or a Digital Camera > > • Electronic Image Processing - this involves identifying text information > by > > analyzing the image for noise and skew. Once text information is available > > another algorithm reads and recognizes the printed matter > > • Storing the extracted text information as an electronic data: the > recognized > > input is converted to a standard format, which can be opened in any word > > processing application, facilitating the user to edit the text data. > > Chitrankan archives Indian Language content in electronic form through > OCR. > It > > enables the user to take a book, magazine or printed text in an Indian > Language, > > feed it directly into an electronic computer file, and edit the file using > a > word > > processor. Once the data is in the form of electronic text it can be > searched, > > sorted and indexed. > > Chitrankan saves the user the effort of typing an entire document. > > Chitrankan scans a document to screen by recognizing the text and other > images > > as objects. These scanned images are flawless and can be stored or printed > time > > and again. > > Exceedingly user-friendly with features that can edit, move, resize or > duplicate > > the scanned document, Chitrankan also provides a spell check facility. > > The potential of Chitrankan is enormous as it enables users to harness the > power > > of computers to access printed documents in Indian Languages. > > Software Advantage: > > •Recognizes Hindi and Marathi languages along with Embedded English Text. > > •Skew detection and correction for input image upto ± 15° > > •Grabs images directly from the scanner for processing > > •Automatic Text and Picture region detection > > •Supports all TWAIN compatible scanners and digital cameras > > •Supports 256 grayscale/color, .bmp/.tiff images scanned at 300 dpi as > input > > image for recognition > > •Ideal for font sizes between 10 pt. and 36 pt, and all popular fonts. > > •Saves scanned/modified images as .BMP files > > •Saves recognized text in ISCII format or exporting as .RTF for editing > using > > GIST range of software > > •Uses advanced DSP (Digital Signal Processing) algorithms to remove > "Noise" > > and "Back Page Reflection" > > •Enables printing both - the input image as well as the recognized text. > > •Provided with inbuilt Flip, Rotate and Negate options for Input Image > > User Advantage: > > •Allows deletion of associated pictures from the image by using the ERASE > > option > > •Provides painting tools to join the breaks in the characters to get good > results > > •Allows OCR to be applied on an image rotated by 180° or flipped > > •Applies OCR to image having text in reverse by using INVERT option > > •Provides inbuilt spell checking facility > > •Provides editing tools like cut, copy, paste, find and replace options > for > use on > > recognized text > > System Requirements: > > •Minimum Configuration: > > Pentium II with 64 MB RAM > > Virtual Memory requirement 300 MB (Swap File Space in Hard Disk) > > •Recommended Configuration: > > Pentium III with 128 MB RAM and above > > Virtual Memory requirement 400 MB > > •Operating Systems Supported: > > Window NT ver. 4.0, Service Pack 6.0 and above/ Windows 9X and above, > > Windows 2000 and Windows XP. > > Price: CHITRANKAN Single user license for CHITRANKAN Rs. 10,000/- > > > > Contacts: channel partner list URL - > http://www.cdac.in/html/gist/ch_part.asp > > CHITRANKAN demo can be downloaded from > > http://www.cdac.in/html/gist/down/chtri_d.asp > > File Size: 45 MB > > Experimenting with CHITRANKAN at the XRCVC – findings : > > At the XRCVC demo of CHITRANKAN was installed and put through tests. > > The Rajyasabha website webpage were used for testing Hindi-Devnagri script > > which uses Yogesh font typeset. Its accuracy can be described as good > > approximately 70%. This can be improved by using font training mode. > > Additional documents in Hindi and Marathi were tested. Results from those > were > > fair amount approximately 40% accuracy level. The font training module > > however can increase the accuracy. > > The software supports the Yogesh Hindi font by default. Mare fonts can be > > added on by training OCR using font recognition module. > > Font training modules enable the user to train the software to decipher > > documents in particular fonts. To make the software even more useful, > > CHITRANKAN incorporates a set of application program interfaces (APIs) > which > > allow software developers the flexibility to build features from > CHITRANKAN > into > > their software application. > > You can save recognized output in RTF format and even choose recognition > > language as either Hindi or Marathi. > > > > Screen reader access with Chitrankan - > > Graphical User Interface of Chitrankan is very friendly with menus and > shortcuts > > are available for all important options. > > In the workspace area it has mainly three windows such as input image > window, > > recognized output text window and digitized image windows. However screen > > reader (SAFA) is not able to read the recognized text. > > > > Conclusion : > > One can definitely contribute to the development of the Indian language > OCR > > through download, testing and the feedback can be given to C-DAC that > would > > help in product enhancement. Those who are familiar with Malayalam would > do > > well to test the NAYANA OCR software. > > Prashant Naik > > The Xavier's Resource Centre for the Visually Challenged (XRCVC) > > St. Xavier's College, Mumbai. > ---- > VISION WITHOUT ACTION IS MERELY A DREAM, > ACTION WITHOUT VISION JUST PASSES THE TIME, > VISION WITH ACTION CAN CHANGE THE WORLD. > Join Access India convention: For updates on it visit: > http://accessindia.org.in/harish/convention.htm > Registration is now open! > > To unsubscribe send a message to [EMAIL PROTECTED] > with the subject unsubscribe. > > To change your subscription to digest mode or make any other changes, > please visit the list home page at > http://accessindia.org.in/mailman/listinfo/accessindia_accessindia.org.in Join Access India convention: For updates on it visit: http://accessindia.org.in/harish/convention.htm Registration is now open! To unsubscribe send a message to [EMAIL PROTECTED] with the subject unsubscribe. To change your subscription to digest mode or make any other changes, please visit the list home page at http://accessindia.org.in/mailman/listinfo/accessindia_accessindia.org.in