[AI] White Paper: OCR Softwares for Indian languages

Prashant Naik Sun, 03 Aug 2008 00:44:32 -0700

Dear Access India Members,



During the Daisy Forum of India meeting held in Mumbai on 11th and 12th
April 2008, I was given the responsibility to find information on the status
of OCR Softwares for Indian languages.  So here I am presenting the findings
that I am able to research.  I have prepared a White Paper on it which I
posted in the PDF format on the daisy forum of India's mailing list 3 days
back.    But for benefit and awareness of others I am pasting content of it
below this message. This will also help those who had posted queries on A I
regarding this.



White Paper: OCR Softwares for Indian languages

Date: July 31st, 2008

Introduction :

OCR softwares are available for English and other foreign languages but what
is

the status of OCR software availability for Indian languages?

During the Daisy Forum of India meeting held in Mumbai on 11th and 12th
April

2008, I was given the responsibility to find information on this. So here I
am

presenting the findings that I am able to research.



Definitions :

OCR: - Optical character recognition, usually abbreviated to OCR, is the

mechanical or electronic translation of images of handwritten, typewritten
or

printed text (usually captured by a scanner) into machine-editable text.

OCR Software: - OCR Software converts paper documents into electronic data,

so that you can handle the information (electronic text) in your computer
system.

Indian Languages: - Indian Constitution recognizes Hindi in Devanāgarī
script

as the official language of the central government India the Constitution of
India

recognizes 22 languages, spoken in different parts of the country,

{All definitions source is "Wikipedia")



Findings :

As per the research on the web highlighted one workshop / seminar organized
by

Rediff Centre for Indian Language Content Management

On the theme of "Brainstorming Workshop on OCR for Indian Languages" on

16-17 March, 2007, at Hotel Regalis, Mysore.

Reference. Link: http://www.isim.ac.in/RCILCM/index.htm

Further research on Access India (mailing group for the blind) querying more
on

this and contact with NAB Karnataka to get more info on this theme did not
throw

up anything significant.



Visit by Mr. Venki, rediff.com Technical Head :

During a meeting with Mr. Venki at the XRCVC in the month of June 2008,

Some more information about the conference was secured. This was because

Mr. Venki himself was a one of the members of the organizing team from
rediff.

He made the following observation. "Overall the conference was good.
Speakers

had shared new ideas on developing Indian OCR."

However further following up with regard to this conference it seems no

significant progress have been made thereafter.



Chennai "Print Access" Seminar Findings"

Our XRCVC team member Neha learned about many technological

developments from the "Print Access" conference which was held at Chennai on

April 19th, 2008. She shared lot of information, contacts and links.

E.g. Acharya website (http://acharya.iitm.ac.in) TTS translator in 22
languages,

Ravi TTS for Telgu, C-DAC softwares like Mantra, Shruti Drishti, Shrut
Lekhan

and very important lead on Indian OCR software developed by C-DAC Pune.



Visit to C-DAC Pune :

On May 14th and 15th, the XRCVC team visited C-DAC Pune. The visit was very

fruitful. A fully developed off-the-shelf product for Hindi-Devnagri Indian

language software named as CHITRANKAN developed by GIST Development

Team, C-DAC, Pune, Maharashtra. They demonstrated the product. The result

was very good. CHITRANKAN is commercially used by 2-3 organizations in

Pune.

Other C-DAC resources :

OCR softwares in Hindi called CHITRANKAN, in Marathi called

CHITRAKSHARIKA and in Malayalam called NAYANA.



About NAYANA :

Source: http://www.malayalamresourcecentre.org/Mrc/products/nayana.html

NAYANA is a product that enables the user to convert printed Malayalam

documents to editable computer files. This system is very simple to use and

requires no prior expertise.

FEATURES

- NAYANA processes all types of printed Malayalam Documents.

- Supports TIFF and BMP image formats.

- Supports document Images with resolution 300 dpi and above.

- Detection and correction of document skew of -5o to +5o.

- The output document can be stored in both ISCII and ISFOC form.

- The output document can be saved as TXT, RTF, HTML or ACI file formats.

- User friendly interface.

- Recognition speed of 50 char /sec.

- Conversion of printed documents to editable text.

- Optical Character Recognition combined with Text–To–Speech technology can

be used for text reading system.

EXPANDABILITY

- A layout analyzer can be added to the system to reproduce the input
document

in its original layout.

- Can be expanded to cater to hand writ ten and old document.

The linguistic resource generation tools such as Prabandhika and Vishleshika

Source: http://delnet.nic.in/news-naclin-report.htm



About CHITRANKAN :

Source: http://www.cdac.in/html/gist/products/chitra.asp

CHITRANKAN - the first OCR (Optical Character Recognition) system for Indian

Languages.

The OCR process involves:

• Conversion of printed matter into an electronic image - the printed matter
can

be converted into an image using Scanner or a Digital Camera

• Electronic Image Processing - this involves identifying text information
by

analyzing the image for noise and skew. Once text information is available

another algorithm reads and recognizes the printed matter

• Storing the extracted text information as an electronic data: the
recognized

input is converted to a standard format, which can be opened in any word

processing application, facilitating the user to edit the text data.

Chitrankan archives Indian Language content in electronic form through OCR.
It

enables the user to take a book, magazine or printed text in an Indian
Language,

feed it directly into an electronic computer file, and edit the file using a
word

processor. Once the data is in the form of electronic text it can be
searched,

sorted and indexed.

Chitrankan saves the user the effort of typing an entire document.

Chitrankan scans a document to screen by recognizing the text and other
images

as objects. These scanned images are flawless and can be stored or printed
time

and again.

Exceedingly user-friendly with features that can edit, move, resize or
duplicate

the scanned document, Chitrankan also provides a spell check facility.

The potential of Chitrankan is enormous as it enables users to harness the
power

of computers to access printed documents in Indian Languages.

Software Advantage:

•Recognizes Hindi and Marathi languages along with Embedded English Text.

•Skew detection and correction for input image upto ± 15°

•Grabs images directly from the scanner for processing

•Automatic Text and Picture region detection

•Supports all TWAIN compatible scanners and digital cameras

•Supports 256 grayscale/color, .bmp/.tiff images scanned at 300 dpi as input

image for recognition

•Ideal for font sizes between 10 pt. and 36 pt, and all popular fonts.

•Saves scanned/modified images as .BMP files

•Saves recognized text in ISCII format or exporting as .RTF for editing
using

GIST range of software

•Uses advanced DSP (Digital Signal Processing) algorithms to remove "Noise"

and "Back Page Reflection"

•Enables printing both - the input image as well as the recognized text.

•Provided with inbuilt Flip, Rotate and Negate options for Input Image

User Advantage:

•Allows deletion of associated pictures from the image by using the ERASE

option

•Provides painting tools to join the breaks in the characters to get good
results

•Allows OCR to be applied on an image rotated by 180° or flipped

•Applies OCR to image having text in reverse by using INVERT option

•Provides inbuilt spell checking facility

•Provides editing tools like cut, copy, paste, find and replace options for
use on

recognized text

System Requirements:

•Minimum Configuration:

Pentium II with 64 MB RAM

Virtual Memory requirement 300 MB (Swap File Space in Hard Disk)

•Recommended Configuration:

Pentium III with 128 MB RAM and above

Virtual Memory requirement 400 MB

•Operating Systems Supported:

Window NT ver. 4.0, Service Pack 6.0 and above/ Windows 9X and above,

Windows 2000 and Windows XP.

Price: CHITRANKAN Single user license for CHITRANKAN Rs. 10,000/-



Contacts: channel partner list URL -
http://www.cdac.in/html/gist/ch_part.asp

CHITRANKAN demo can be downloaded from

http://www.cdac.in/html/gist/down/chtri_d.asp

File Size: 45 MB

Experimenting with CHITRANKAN at the XRCVC – findings :

At the XRCVC demo of CHITRANKAN was installed and put through tests.

The Rajyasabha website webpage were used for testing Hindi-Devnagri script

which uses Yogesh font typeset. Its accuracy can be described as good

approximately 70%. This can be improved by using font training mode.

Additional documents in Hindi and Marathi were tested. Results from those
were

fair amount approximately 40% accuracy level. The font training module

however can increase the accuracy.

The software supports the Yogesh Hindi font by default. Mare fonts can be

added on by training OCR using font recognition module.

Font training modules enable the user to train the software to decipher

documents in particular fonts. To make the software even more useful,

CHITRANKAN incorporates a set of application program interfaces (APIs) which

allow software developers the flexibility to build features from CHITRANKAN
into

their software application.

You can save recognized output in RTF format and even choose recognition

language as either Hindi or Marathi.



Screen reader access with Chitrankan -

Graphical User Interface of Chitrankan is very friendly with menus and
shortcuts

are available for all important options.

In the workspace area it has mainly three windows such as input image
window,

recognized output text window and digitized image windows. However screen

reader (SAFA) is not able to read the recognized text.



Conclusion :

One can definitely contribute to the development of the Indian language OCR

through download, testing and the feedback can be given to C-DAC that would

help in product enhancement. Those who are familiar with Malayalam would do

well to test the NAYANA OCR software.

Prashant Naik

The Xavier's Resource Centre for the Visually Challenged (XRCVC)

St. Xavier's College, Mumbai.
----
VISION WITHOUT ACTION IS MERELY A DREAM,
ACTION WITHOUT VISION JUST PASSES THE TIME,
VISION WITH ACTION CAN CHANGE THE WORLD.
Join Access India convention: For updates on it visit: 
http://accessindia.org.in/harish/convention.htm
Registration is now open!

To unsubscribe send a message to [EMAIL PROTECTED] with the subject unsubscribe.

To change your subscription to digest mode or make any other changes, please 
visit the list home page at
  http://accessindia.org.in/mailman/listinfo/accessindia_accessindia.org.in

[AI] White Paper: OCR Softwares for Indian languages

Reply via email to