Hi ! Ok, I'm more and more convinced that this project would worth it and I think I'll spend some time on it. But before starting here are few questions.
1) Tesseract-OCR software has been released under Apache License 2.0 This license is know to be incompatible with the GPL because more restrictive about patents: « Apache Software License, version 2.0 This is a free software license but it is incompatible with the GPL. The Apache Software License is incompatible with the GPL because it has a specific requirement that is not in the GPL: it has certain patent termination cases that the GPL does not require. (We don't think those patent termination cases are inherently a bad idea, but nonetheless they are incompatible with the GNU GPL.) » Seen on: http://www.gnu.org/licenses/license-list.html What are the implication of this on the use of this software inside a GPLed project ? 2) Refactoring the code The main problem with Tesseract-OCR as it is now is that it has been coded with C89 standards in mind and it does not comply at all with C99 view. One obvious problem is that portability to 64bits plate-forms will require quite some work. For example: ... typedef long INT32; typedef unsigned int UINT32; ... [Excerpt from ccutils/host.h] Theses lines just demonstrate that the authors of Tesseract did apply the (wrong) 'long is an int' belief. I can hardly resist to quote 'Henry Spencer' here: « Contrary to the heresies espoused by some of the dwellers on the Western Shore, `int' and `long' are not the same type. The moment of their equivalence in size and representation is short, and the agony that awaits believers in their interchangeability shall last forever and ever once 64-bit machines become common. » -- Henry Spencer But that's not the only problem in Tesseract. After browsing the code and investigating a bit (using Doxygen to generate some extra documentation about class hierarchy), my conclusions are that: - The code is just breaking the whole C99 type system spirit and has to be redone from scratch if we want some 64bits compatibility; - Looking at the (hairy) class hierarchy did not convinced me that C++ was really required here, I would really go for C instead; - Data-structures are quite classical and should be taken from an existing library (glib or others)... but this is contradictory with the fact I want Desktop independence... So, for now, I just push this choice into the stack and hopping to not have to take a decision too soon. - As the cleaning and the refactoring of the code might take quite some time, Alan Horkan suggested to first come with some Gnome wrapper to the existing interface and to make the back-end evolve. This is probably the best way to do and in the same time to be able to keep the hope to get Tesseract in other projects. So, does all these choices appear to be ok or am I a stupid git that forgot something vital ? :) Well, that about all... As I am quite busy (and sloooow), I'll try to set up a Website and a small SVN repository around Christmas and I'll keep you informed about my progress. Regards -- Emmanuel Fleury | Office: 261 Associate Professor, | Phone: +33 (0)5 40 00 69 34 LaBRI, Domaine Universitaire | Fax: +33 (0)5 40 00 66 69 351, Cours de la Libération | email: [EMAIL PROTECTED] 33405 Talence Cedex, France | URL: http://www.labri.fr/~fleury _______________________________________________ desktop-devel-list mailing list [email protected] http://mail.gnome.org/mailman/listinfo/desktop-devel-list
