[Mayan EDMS: 371] Re: OCR error on .doc files

Lau Llobet Wed, 05 Dec 2012 17:07:04 -0800

Hi Charles, Roberto and Steve,

I'm loving this software, i'm actualy planning to start a business of
files digitalization for small busines and this software is the one
i'm liking more.


I'm having the same problem as you two a simple error given by the
binaries in the ocr cue.

Followint Roberto's adcvice I'm stuck at doing unpaper to a pdf with
the same error about  "the magic %P", unpaper don't handle pdf !!! So
Roberto may give us another way to check what is going inside mayan so
we can simulate it by hand.

As far as i see there's no pdf output file from the "document as an
image" in the temporary folder, just a file called IHAKtmp which is
empty so i guess the problem is at the first step which shoud be
libreoffice jpg to pdf conversion. That may make sense since we are
all using the same version of unpaper and tesseract and we may no be
using the same LibreOffice.

I'm in a hurry trying to figure which is the best software for my
company and I would happly make a donation when i'll have it working
localy.


Also, while trying to solve this issue i've came to this observations:


1
--------------------------
Tesseract has to have it's language training files in the usr/local/
in order to work

like this:

lau@lau-H61M-D2-B3:/usr/local/share/tessdata$ ls
cat.traineddata   eng.cube.fold  eng.cube.params
eng.tesseract_cube.nn
configs           eng.cube.lm    eng.cube.size       eng.traineddata
eng.cube.bigrams  eng.cube.nn    eng.cube.word-freq  tessconfigs


2
--------------------------
making tesseract to work with a .jpg from the scan has EXTREMELY
better results than giving it a ppm "cleaned" by unpaper , in the
first case only 5 words in a page where mistaken and a cleaned ppm
tesseract gave only 3 comprensible words in the whole page. No PDF
(jpg converted via libre office) is accepted by tesseract giving a :

lau@lau-H61M-D2-B3:/tmp$ tesseract tarja.pdf tessed
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
Error in pixReadStream: Unknown format: no pix returned
Error in pixRead: pix not read
Unsupported image type.

3
----------------------------
Having a metadata tag indicating a language in mayan and using this to
set the language flag of tesseract can improve results a lot ! (50
words per page) If my project is finally using mayan i would try to
program this feature.

--

[Mayan EDMS: 371] Re: OCR error on .doc files

Reply via email to