Bug#886113: ocrodjvu does not find any languages with tesseract 4.x

Jakub Wilk Tue, 02 Jan 2018 07:33:39 -0800

Hi!

Upstream (who is not overly excited with the idea of supporting randomgit snapshots of Tesseract) speaking here.


* Helmut Grohne <[email protected]>, 2018-01-02, 13:47:

But for the new tesseract the output is:

   Error opening data file /usr/share/tesseract-ocr/4.00/nonexistent.traineddata
   Please make sure the TESSDATA_PREFIX environment variable is set to your 
"tessdata" directory.
   Failed loading language 'nonexistent'
   Tesseract couldn't load any languages!
   Could not initialize tesseract.

Note in particular that the error message lacks the tessdatasubdirectory.


The commit that introduced this change seems to be:
https://github.com/tesseract-ocr/tesseract/commit/1cc511188d980a33742d2699f9927ed1c84e81de
(grep for "Try without tessdata")

The commit message doesn't explain why it was made. There's no changelogentrty for it either. Yay...


Anyway, I've implemented work-around in ocrodjvu:
https://github.com/jwilk/ocrodjvu/commit/b41f643d82f544cc15660e0d3292e31136e3d37b

In the long run, ocrodjvu should switch to using the --list-langsoption. But this is currently super slow for some reason:


  $ time tesseract --list-langs > /dev/null

  real  0m0.367s
  user  0m0.333s
  sys   0m0.032s

--
Jakub Wilk

Bug#886113: ocrodjvu does not find any languages with tesseract 4.x

Reply via email to