Hi!
Upstream (who is not overly excited with the idea of supporting random
git snapshots of Tesseract) speaking here.
* Helmut Grohne <[email protected]>, 2018-01-02, 13:47:
But for the new tesseract the output is:
Error opening data file /usr/share/tesseract-ocr/4.00/nonexistent.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your
"tessdata" directory.
Failed loading language 'nonexistent'
Tesseract couldn't load any languages!
Could not initialize tesseract.
Note in particular that the error message lacks the tessdata
subdirectory.
The commit that introduced this change seems to be:
https://github.com/tesseract-ocr/tesseract/commit/1cc511188d980a33742d2699f9927ed1c84e81de
(grep for "Try without tessdata")
The commit message doesn't explain why it was made. There's no changelog
entrty for it either. Yay...
Anyway, I've implemented work-around in ocrodjvu:
https://github.com/jwilk/ocrodjvu/commit/b41f643d82f544cc15660e0d3292e31136e3d37b
In the long run, ocrodjvu should switch to using the --list-langs
option. But this is currently super slow for some reason:
$ time tesseract --list-langs > /dev/null
real 0m0.367s
user 0m0.333s
sys 0m0.032s
--
Jakub Wilk