Hello all, Tesseract[1] 3.01, released last October, include an Arabic recogniser.
I just tested it with a page scanned of an old book typeset in Naskh (a fairly complex, but common font), I got ~80% of the words recognised correctly. Most of the badly recognised words contain diacritics or dots which seem to confuse it. I think training could improve it (though I've no experience with Tesseract), but there is no training module for the Arabic recogniser yet (per the release note[2]). I thought this would be of interested to people here. [1] http://code.google.com/p/tesseract-ocr/ [2] http://code.google.com/p/tesseract-ocr/wiki/ReleaseNotes Regards, Khaled _______________________________________________ Doc mailing list [email protected] http://lists.arabeyes.org/mailman/listinfo/doc

