Dear Access India Members,
During the Daisy Forum of India meeting held in Mumbai on 11th and 12th
April 2008, I was given the responsibility to find information on the status
of OCR Softwares for Indian languages. So here I am presenting the findings
that I am able to research. I have prepared a White Paper on it which I
posted in the PDF format on the daisy forum of India's mailing list 3 days
back. But for benefit and awareness of others I am pasting content of it
below this message. This will also help those who had posted queries on A I
regarding this.
White Paper: OCR Softwares for Indian languages
Date: July 31st, 2008
Introduction :
OCR softwares are available for English and other foreign languages but what
is
the status of OCR software availability for Indian languages?
During the Daisy Forum of India meeting held in Mumbai on 11th and 12th
April
2008, I was given the responsibility to find information on this. So here I
am
presenting the findings that I am able to research.
Definitions :
OCR: - Optical character recognition, usually abbreviated to OCR, is the
mechanical or electronic translation of images of handwritten, typewritten
or
printed text (usually captured by a scanner) into machine-editable text.
OCR Software: - OCR Software converts paper documents into electronic data,
so that you can handle the information (electronic text) in your computer
system.
Indian Languages: - Indian Constitution recognizes Hindi in Devanāgarī
script
as the official language of the central government India the Constitution of
India
recognizes 22 languages, spoken in different parts of the country,
{All definitions source is "Wikipedia")
Findings :
As per the research on the web highlighted one workshop / seminar organized
by
Rediff Centre for Indian Language Content Management
On the theme of "Brainstorming Workshop on OCR for Indian Languages" on
16-17 March, 2007, at Hotel Regalis, Mysore.
Reference. Link: http://www.isim.ac.in/RCILCM/index.htm
Further research on Access India (mailing group for the blind) querying more
on
this and contact with NAB Karnataka to get more info on this theme did not
throw
up anything significant.
Visit by Mr. Venki, rediff.com Technical Head :
During a meeting with Mr. Venki at the XRCVC in the month of June 2008,
Some more information about the conference was secured. This was because
Mr. Venki himself was a one of the members of the organizing team from
rediff.
He made the following observation. "Overall the conference was good.
Speakers
had shared new ideas on developing Indian OCR."
However further following up with regard to this conference it seems no
significant progress have been made thereafter.
Chennai "Print Access" Seminar Findings"
Our XRCVC team member Neha learned about many technological
developments from the "Print Access" conference which was held at Chennai on
April 19th, 2008. She shared lot of information, contacts and links.
E.g. Acharya website (http://acharya.iitm.ac.in) TTS translator in 22
languages,
Ravi TTS for Telgu, C-DAC softwares like Mantra, Shruti Drishti, Shrut
Lekhan
and very important lead on Indian OCR software developed by C-DAC Pune.
Visit to C-DAC Pune :
On May 14th and 15th, the XRCVC team visited C-DAC Pune. The visit was very
fruitful. A fully developed off-the-shelf product for Hindi-Devnagri Indian
language software named as CHITRANKAN developed by GIST Development
Team, C-DAC, Pune, Maharashtra. They demonstrated the product. The result
was very good. CHITRANKAN is commercially used by 2-3 organizations in
Pune.
Other C-DAC resources :
OCR softwares in Hindi called CHITRANKAN, in Marathi called
CHITRAKSHARIKA and in Malayalam called NAYANA.
About NAYANA :
Source: http://www.malayalamresourcecentre.org/Mrc/products/nayana.html
NAYANA is a product that enables the user to convert printed Malayalam
documents to editable computer files. This system is very simple to use and
requires no prior expertise.
FEATURES
- NAYANA processes all types of printed Malayalam Documents.
- Supports TIFF and BMP image formats.
- Supports document Images with resolution 300 dpi and above.
- Detection and correction of document skew of -5o to +5o.
- The output document can be stored in both ISCII and ISFOC form.
- The output document can be saved as TXT, RTF, HTML or ACI file formats.
- User friendly interface.
- Recognition speed of 50 char /sec.
- Conversion of printed documents to editable text.
- Optical Character Recognition combined with Text–To–Speech technology can
be used for text reading system.
EXPANDABILITY
- A layout analyzer can be added to the system to reproduce the input
document
in its original layout.
- Can be expanded to cater to hand writ ten and old document.
The linguistic resource generation tools such as Prabandhika and Vishleshika
Source: http://delnet.nic.in/news-naclin-report.htm
About CHITRANKAN :
Source: http://www.cdac.in/html/gist/products/chitra.asp
CHITRANKAN - the first OCR (Optical Character Recognition) system for Indian
Languages.
The OCR process involves:
• Conversion of printed matter into an electronic image - the printed matter
can
be converted into an image using Scanner or a Digital Camera
• Electronic Image Processing - this involves identifying text information
by
analyzing the image for noise and skew. Once text information is available
another algorithm reads and recognizes the printed matter
• Storing the extracted text information as an electronic data: the
recognized
input is converted to a standard format, which can be opened in any word
processing application, facilitating the user to edit the text data.
Chitrankan archives Indian Language content in electronic form through OCR.
It
enables the user to take a book, magazine or printed text in an Indian
Language,
feed it directly into an electronic computer file, and edit the file using a
word
processor. Once the data is in the form of electronic text it can be
searched,
sorted and indexed.
Chitrankan saves the user the effort of typing an entire document.
Chitrankan scans a document to screen by recognizing the text and other
images
as objects. These scanned images are flawless and can be stored or printed
time
and again.
Exceedingly user-friendly with features that can edit, move, resize or
duplicate
the scanned document, Chitrankan also provides a spell check facility.
The potential of Chitrankan is enormous as it enables users to harness the
power
of computers to access printed documents in Indian Languages.
Software Advantage:
•Recognizes Hindi and Marathi languages along with Embedded English Text.
•Skew detection and correction for input image upto ± 15°
•Grabs images directly from the scanner for processing
•Automatic Text and Picture region detection
•Supports all TWAIN compatible scanners and digital cameras
•Supports 256 grayscale/color, .bmp/.tiff images scanned at 300 dpi as input
image for recognition
•Ideal for font sizes between 10 pt. and 36 pt, and all popular fonts.
•Saves scanned/modified images as .BMP files
•Saves recognized text in ISCII format or exporting as .RTF for editing
using
GIST range of software
•Uses advanced DSP (Digital Signal Processing) algorithms to remove "Noise"
and "Back Page Reflection"
•Enables printing both - the input image as well as the recognized text.
•Provided with inbuilt Flip, Rotate and Negate options for Input Image
User Advantage:
•Allows deletion of associated pictures from the image by using the ERASE
option
•Provides painting tools to join the breaks in the characters to get good
results
•Allows OCR to be applied on an image rotated by 180° or flipped
•Applies OCR to image having text in reverse by using INVERT option
•Provides inbuilt spell checking facility
•Provides editing tools like cut, copy, paste, find and replace options for
use on
recognized text
System Requirements:
•Minimum Configuration:
Pentium II with 64 MB RAM
Virtual Memory requirement 300 MB (Swap File Space in Hard Disk)
•Recommended Configuration:
Pentium III with 128 MB RAM and above
Virtual Memory requirement 400 MB
•Operating Systems Supported:
Window NT ver. 4.0, Service Pack 6.0 and above/ Windows 9X and above,
Windows 2000 and Windows XP.
Price: CHITRANKAN Single user license for CHITRANKAN Rs. 10,000/-
Contacts: channel partner list URL -
http://www.cdac.in/html/gist/ch_part.asp
CHITRANKAN demo can be downloaded from
http://www.cdac.in/html/gist/down/chtri_d.asp
File Size: 45 MB
Experimenting with CHITRANKAN at the XRCVC – findings :
At the XRCVC demo of CHITRANKAN was installed and put through tests.
The Rajyasabha website webpage were used for testing Hindi-Devnagri script
which uses Yogesh font typeset. Its accuracy can be described as good
approximately 70%. This can be improved by using font training mode.
Additional documents in Hindi and Marathi were tested. Results from those
were
fair amount approximately 40% accuracy level. The font training module
however can increase the accuracy.
The software supports the Yogesh Hindi font by default. Mare fonts can be
added on by training OCR using font recognition module.
Font training modules enable the user to train the software to decipher
documents in particular fonts. To make the software even more useful,
CHITRANKAN incorporates a set of application program interfaces (APIs) which
allow software developers the flexibility to build features from CHITRANKAN
into
their software application.
You can save recognized output in RTF format and even choose recognition
language as either Hindi or Marathi.
Screen reader access with Chitrankan -
Graphical User Interface of Chitrankan is very friendly with menus and
shortcuts
are available for all important options.
In the workspace area it has mainly three windows such as input image
window,
recognized output text window and digitized image windows. However screen
reader (SAFA) is not able to read the recognized text.
Conclusion :
One can definitely contribute to the development of the Indian language OCR
through download, testing and the feedback can be given to C-DAC that would
help in product enhancement. Those who are familiar with Malayalam would do
well to test the NAYANA OCR software.
Prashant Naik
The Xavier's Resource Centre for the Visually Challenged (XRCVC)
St. Xavier's College, Mumbai.
----
VISION WITHOUT ACTION IS MERELY A DREAM,
ACTION WITHOUT VISION JUST PASSES THE TIME,
VISION WITH ACTION CAN CHANGE THE WORLD.
Join Access India convention: For updates on it visit:
http://accessindia.org.in/harish/convention.htm
Registration is now open!
To unsubscribe send a message to [EMAIL PROTECTED] with the subject unsubscribe.
To change your subscription to digest mode or make any other changes, please
visit the list home page at
http://accessindia.org.in/mailman/listinfo/accessindia_accessindia.org.in