Bug#604520: tesseract-ocr: wordlist2dawg is very slow

Jakub Wilk Mon, 22 Nov 2010 07:57:17 -0800

Package: tesseract-ocr
Version: 2.04-2+b1
Severity: normal

On my machine[0] it takes almost 4 minutes to process/usr/share/dict/words. I tried to build a DAWG for a Polish dictionarywith more than 3 million words, but I gave up after 2 hours of waiting.

Unless I'm missing something building DAWGs shouldn't be *that* slow.E.g. dawgdic[1] is able to build a DAWG (in a different format, butstill...) for the Polish dictionary in a few seconds.



[0] $ cat /proc/cpuinfo | grep bogo
bogomips        : 4620.39
bogomips        : 4620.39

[1] http://code.google.com/p/dawgdic/


-- System Information:
Debian Release: squeeze/sid
  APT prefers unstable
  APT policy: (990, 'unstable'), (500, 'experimental'), (500, 'testing')
Architecture: i386 (i686)

Kernel: Linux 2.6.32-5-686 (SMP w/2 CPU cores)
Locale: LANG=C, LC_CTYPE=pl_PL.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages tesseract-ocr depends on:
ii  libc6                     2.11.2-7       Embedded GNU C Library: Shared lib
ii  libgcc1                   1:4.5.1-11     GCC support library

ii libjpeg62 6b1-1 The Independent JPEG Group's JPEGii libstdc++6 4.4.5-8 The GNU Standard C++ Library v3

ii  libtiff4                  3.9.4-5        Tag Image File Format (TIFF) libra
ii  tesseract-ocr-eng [tesser 2.00-2         tesseract-ocr language files for E
ii  tesseract-ocr-spa [tesser 2.00-2         tesseract-ocr language files for S
ii  zlib1g                    1:1.2.5.dfsg-1 compression library - runtime

--
Jakub Wilk



--
To UNSUBSCRIBE, email to [email protected]
with a subject of "unsubscribe". Trouble? Contact [email protected]

Bug#604520: tesseract-ocr: wordlist2dawg is very slow

Reply via email to