Hello,

While I'm not an Arabeyes contributor, I found the April 2007 thread
about Siragi OCR very interesting.  I have some comments, in three
areas: (1) OCRopus / Tesseract, (2) Summer of Code and (3) Hebrew.

OCRopus / Tesseract
--------------------------------

I was surprised that people were not more strongly in favor of joining the 
project to the OCRopus or Tesseract projects. While I don't have in-depth 
knowledge of any of these projects, joining seems like a good idea to me, for 
these reasons:

1) Joining OCRopus or Tesseract could make it easier to collaborate with people 
who are working on open source OCR for other Arabic-script languages such as 
Persian and Urdu. Faisal Shafait at IUPR (http://www.iupr.org/), who is one of 
the leaders of the OCRopus project, wants to add Urdu support to OCRopus:
http://groups.google.com/group/ocropus/web/ocropusurdu

2) Joining OCRopus or Tesseract could create more opportunities to get 
technical advice from experts like the people at IUPR who are involved in 
OCRopus.  I don't think it's necessary for people to know Arabic in order to 
give good advice.  I think many issues that are specific to Arabic could be 
quickly explained to people who don't know Arabic by referring them to a 
resource like the excellent tutorial on Arabic text processing by Nizar Habash
(http://www.ccls.columbia.edu/cadim/TUTORIAL.ARABIC.NLP.pdf).

3) I expect a lot of code in an OCR system can be re-used across different 
languages, even if the languages don't use same characters for writing.  I 
expect a lot of the code in an OCR system is tricky stuff involving difficult 
algorithms, so I think it's great to be able to re-use existing 
high-performance code, and I fear that an effort to build a high-accuracy 
system from scratch might get bogged down.  I don't think the open-source 
Hebrew OCR system mentioned here in April proves otherwise, because it is not 
high-accuracy (there are a lot of errors in the "300DPI Line-Art.scan" example 
at http://hocr.berlios.de/examples.html), and also because Hebrew is presumably 
an easier language to OCR than Arabic because the characters don't join and 
there is much less shaping.

4) Since OCRopus and Tesseract are sponsored by Google, joining with
these projects may improve Siragi's chances of Summer of Code funding
in the future.

Summer of Code mentors
-----------------------------------------

For next year's Summer of Code proposal, it might be a good idea to look for a 
mentor or co-mentor from outside of Arabeyes who already has a lot of 
experience with OCR technology or related technology (unless you already have 
this experience inside Arabeyes).  Voxforge.org contacted a university 
researcher about being a mentor for one of their Summer of Code ideas, and I 
think the idea makes a lot of sense.

IUPR might be a good place to find someone.  You could also search around on 
the Internet to see who has published OCR papers, or has participated in 
conferences like the DRR and ICDAR conferences mentioned at
http://en.wikipedia.org/wiki/Optical_character_recognition.

Regarding Arabic in particular, I just took a quick look at the ICDAR 2007 web 
site and I see there are three people from Tunisia and one from Algeria on the 
program committee.  A quick Google search (alimi OR amara OR kacem OR sellami 
icdar) shows they have published papers at ICDAR in the past on OCR and 
handwriting recognition, including work focused on Arabic.  

Also, the mailing list of the ACL SIG on Semitic Languages 
(http://www.semitic.tk/) might be a good place to find potential mentors, 
especially for later-stage modeling techniques such as n-grams.  (By 
later-stage I mean it is later than the image processing.  By n-grams I mean 
models of the probabilities of sequences of words or characters.)   For 
example, Kareem Darwish, who I see from the SIG's web site has participated in 
at least one event sponsored by the SIG, did a PhD thesis on improving 
later-stage modeling in Arabic OCR.

Also, maybe the recent ISCAL event on Arabic computing 
(http://www.iscal.org.sa/) had something about OCR?  I see that OCR is
listed in the English-language list of potential topics for that event, but I 
can't read the rest of that site since I don't read Arabic.

Hebrew
------------

As was already mentioned in the thread, there is very little shaping in Hebrew. 
It only happens for five letters and only at the ends of words 
(http://en.wikipedia.org/wiki/Hebrew_alphabet).

However, there are some similarities between the two languages, besides RTL, 
that might be relevant to open source OCR.

First, the two languages omit vowel information from writing in similar ways.  
If you'd like to feed OCR output into a speech synthesizer so that a blind 
person can use it, I guess you will need to recover the vowel information.  
Maybe there is some potential here for code or algorithms to be re-used between 
the two languages?  (By the way, Hebrew with added vowel marks is sometimes 
called "pointed Hebrew" or "dotted Hebrew" and the usual written Hebrew is 
sometimes
called "unpointed Hebrew" or "undotted Hebrew".)

Second, the two languages have a number of similarities in grammar and 
morphology (e.g., the use of three-character roots).  If you want to use 
word-level n-grams, I think it will work well (if there's enough training data) 
to use word-level n-grams of the same type that people use for English, but it 
may work even better if you use specialized n-gram algorithms that reflect 
Arabic morphology.  I've seen some papers on Arabic speech recognition papers 
which do this.  And if it
is helpful for Arabic it may be helpful for Hebrew too.  Here are the papers 
I've seen: 

Ghaoui et al.:
http://www.tsi.enst.fr/publications/enst/inproceedings-2005-5754.pdf
or http://www.isca-speech.org/archive/interspeech_2005/i05_1281.html

Kirchhoff et al. (the link is to a PDF of her 2006 journal paper in
Computer Speech and Language):
http://www.speech.sri.com/cgi-bin/run-distill?/pubs/papers/Kirc0610-589:Morphology//document.ps.gz

Regards,
David Gelbart
www.icsi.berkeley.edu/~gelbart


       
---------------------------------
Ask a question on any topic and get answers from real people. Go to Yahoo! 
Answers. 
_______________________________________________
Developer mailing list
[email protected]
http://lists.arabeyes.org/mailman/listinfo/developer

رد على