Siragi

David Gelbart Mon, 14 May 2007 13:07:45 -0700

Hello,

While I'm not an Arabeyes contributor, I found the April 2007 thread
about Siragi OCR very interesting.  I have some comments, in three
areas: (1) OCRopus / Tesseract, (2) Summer of Code and (3) Hebrew.

OCRopus / Tesseract
--------------------------------

I was surprised that people were not more strongly in favor of joining the
project to the OCRopus or Tesseract projects. While I don't have in-depth
knowledge of any of these projects, joining seems like a good idea to me, for
these reasons:

1) Joining OCRopus or Tesseract could make it easier to collaborate with people
who are working on open source OCR for other Arabic-script languages such as
Persian and Urdu. Faisal Shafait at IUPR (http://www.iupr.org/), who is one of
the leaders of the OCRopus project, wants to add Urdu support to OCRopus:
http://groups.google.com/group/ocropus/web/ocropusurdu

2) Joining OCRopus or Tesseract could create more opportunities to get
technical advice from experts like the people at IUPR who are involved in
OCRopus. I don't think it's necessary for people to know Arabic in order to
give good advice. I think many issues that are specific to Arabic could be
quickly explained to people who don't know Arabic by referring them to a
resource like the excellent tutorial on Arabic text processing by Nizar Habash
(http://www.ccls.columbia.edu/cadim/TUTORIAL.ARABIC.NLP.pdf).

3) I expect a lot of code in an OCR system can be re-used across different
languages, even if the languages don't use same characters for writing. I
expect a lot of the code in an OCR system is tricky stuff involving difficult
algorithms, so I think it's great to be able to re-use existing
high-performance code, and I fear that an effort to build a high-accuracy
system from scratch might get bogged down. I don't think the open-source
Hebrew OCR system mentioned here in April proves otherwise, because it is not
high-accuracy (there are a lot of errors in the "300DPI Line-Art.scan" example
at http://hocr.berlios.de/examples.html), and also because Hebrew is presumably
an easier language to OCR than Arabic because the characters don't join and
there is much less shaping.

4) Since OCRopus and Tesseract are sponsored by Google, joining with
these projects may improve Siragi's chances of Summer of Code funding
in the future.

Summer of Code mentors
-----------------------------------------

For next year's Summer of Code proposal, it might be a good idea to look for a
mentor or co-mentor from outside of Arabeyes who already has a lot of
experience with OCR technology or related technology (unless you already have
this experience inside Arabeyes). Voxforge.org contacted a university
researcher about being a mentor for one of their Summer of Code ideas, and I
think the idea makes a lot of sense.

IUPR might be a good place to find someone. You could also search around on
the Internet to see who has published OCR papers, or has participated in
conferences like the DRR and ICDAR conferences mentioned at
http://en.wikipedia.org/wiki/Optical_character_recognition.

Regarding Arabic in particular, I just took a quick look at the ICDAR 2007 web
site and I see there are three people from Tunisia and one from Algeria on the
program committee. A quick Google search (alimi OR amara OR kacem OR sellami
icdar) shows they have published papers at ICDAR in the past on OCR and
handwriting recognition, including work focused on Arabic.

Also, the mailing list of the ACL SIG on Semitic Languages
(http://www.semitic.tk/) might be a good place to find potential mentors,
especially for later-stage modeling techniques such as n-grams. (By
later-stage I mean it is later than the image processing. By n-grams I mean
models of the probabilities of sequences of words or characters.) For
example, Kareem Darwish, who I see from the SIG's web site has participated in
at least one event sponsored by the SIG, did a PhD thesis on improving
later-stage modeling in Arabic OCR.

Also, maybe the recent ISCAL event on Arabic computing
(http://www.iscal.org.sa/) had something about OCR? I see that OCR is
listed in the English-language list of potential topics for that event, but I
can't read the rest of that site since I don't read Arabic.

Hebrew
------------

As was already mentioned in the thread, there is very little shaping in Hebrew.
It only happens for five letters and only at the ends of words
(http://en.wikipedia.org/wiki/Hebrew_alphabet).

However, there are some similarities between the two languages, besides RTL,
that might be relevant to open source OCR.

First, the two languages omit vowel information from writing in similar ways.
If you'd like to feed OCR output into a speech synthesizer so that a blind
person can use it, I guess you will need to recover the vowel information.
Maybe there is some potential here for code or algorithms to be re-used between
the two languages? (By the way, Hebrew with added vowel marks is sometimes
called "pointed Hebrew" or "dotted Hebrew" and the usual written Hebrew is
sometimes
called "unpointed Hebrew" or "undotted Hebrew".)

Second, the two languages have a number of similarities in grammar and
morphology (e.g., the use of three-character roots). If you want to use
word-level n-grams, I think it will work well (if there's enough training data)
to use word-level n-grams of the same type that people use for English, but it
may work even better if you use specialized n-gram algorithms that reflect
Arabic morphology. I've seen some papers on Arabic speech recognition papers
which do this. And if it
is helpful for Arabic it may be helpful for Hebrew too. Here are the papers
I've seen:

Ghaoui et al.:
http://www.tsi.enst.fr/publications/enst/inproceedings-2005-5754.pdf
or http://www.isca-speech.org/archive/interspeech_2005/i05_1281.html

Kirchhoff et al. (the link is to a PDF of her 2006 journal paper in
Computer Speech and Language):
http://www.speech.sri.com/cgi-bin/run-distill?/pubs/papers/Kirc0610-589:Morphology//document.ps.gz

Regards,
David Gelbart
www.icsi.berkeley.edu/~gelbart

---------------------------------
Ask a question on any topic and get answers from real people. Go to Yahoo!
Answers.

_______________________________________________
Developer mailing list
[email protected]
http://lists.arabeyes.org/mailman/listinfo/developer

Siragi

رد على