Hi Behdad, Le Jeudi 7 Avril 2005 19:33, Behdad Esfahbod a �crit : > I don't want to be a bastard, but I really think things like tiff > handling do not belong to an OCR project.
Files dealing with tiff are in SIRAGI, simply to read image files ;-) In the same way in GOCR source directory you find files pnm.c, pcx.c and tga.c to read pnm, pcx and tga image files. Why including and excerpt of libtiff files and not simply telling developers to use libtiff ? First to simplify developement since you have all files in the same directory. Second I have made some modification in Makefile to adapt libtiff to unix and windows. Now, I'm not against removing these files from CVS if other contributors agree with that. Why working with TIFF and not another format ? Simply because all scanner drivers generate TIFF files but no one generate PNM files ! Another reason : in the domain of OCR the best format is B&W TIFF Group 4 files. Since it is efficient in space storage, there is no loss of pixels and it is well known by all softwares. > In fact, I belove > an Arabic OCR application is out of place by definition too, the > same for an Arabic editor, an Arabic spell-checker, etc. - First : This is an old discussion : why coding KDE and GNOME ? why writing GNU/Linux while BSD exist ?, etc. IMHO free softwares offer many solutions for the same problem since there is many ideas and many falvours of the same functionnality. I'm not saying that we should reinvent the wheel, no, I think if a new project give some new ideas, a new design or a new approach it should be done. - second : I already tried GOCR and I had read its documentation and look at its source code. His design is to specific to latin characters, I can't see what I can do to adapt it to arabic without all rewriting. Here some examples : * If you consider line detection, the algorithm assumes that character are written from left to right. If you want to address this issue you should rewrite the entire horizontal segmentation. That's what I have done in SIRAGI. * if you see what GOCR's author call "cluster detection". This is the tool to detect characters in a line. If you apply this algorithm to arabic OCR you will get words not characters. An OCR should recognize characters not words since there is only 28 characters but an infinity of words :-) * concerning the heart of the GOCR, the OCR engines. They are not general, but specifically designed for latin characters. There is no neural networks nore classification using a general pixel comparisons, nore vectorization. So no line of code can be adapted from these engines :-) Conclusion : I think SIRAGI-OCR is really a necessity. We have no other alternative than writing from scratch a new software with a more general design to address arabic texts. Then, later we can easily adapt it to recognize latin characters ! Anyway, thanks Behdad for alerting us not reinventing the wheel but I think, sincerly, we are not falling in this trap. Best regards Tarik
_______________________________________________ Developer mailing list [email protected] http://lists.arabeyes.org/mailman/listinfo/developer

