Hi Charles, Roberto and Steve, I'm loving this software, i'm actualy planning to start a business of files digitalization for small busines and this software is the one i'm liking more.
I'm having the same problem as you two a simple error given by the binaries in the ocr cue. Followint Roberto's adcvice I'm stuck at doing unpaper to a pdf with the same error about "the magic %P", unpaper don't handle pdf !!! So Roberto may give us another way to check what is going inside mayan so we can simulate it by hand. As far as i see there's no pdf output file from the "document as an image" in the temporary folder, just a file called IHAKtmp which is empty so i guess the problem is at the first step which shoud be libreoffice jpg to pdf conversion. That may make sense since we are all using the same version of unpaper and tesseract and we may no be using the same LibreOffice. I'm in a hurry trying to figure which is the best software for my company and I would happly make a donation when i'll have it working localy. Also, while trying to solve this issue i've came to this observations: 1 -------------------------- Tesseract has to have it's language training files in the usr/local/ in order to work like this: lau@lau-H61M-D2-B3:/usr/local/share/tessdata$ ls cat.traineddata eng.cube.fold eng.cube.params eng.tesseract_cube.nn configs eng.cube.lm eng.cube.size eng.traineddata eng.cube.bigrams eng.cube.nn eng.cube.word-freq tessconfigs 2 -------------------------- making tesseract to work with a .jpg from the scan has EXTREMELY better results than giving it a ppm "cleaned" by unpaper , in the first case only 5 words in a page where mistaken and a cleaned ppm tesseract gave only 3 comprensible words in the whole page. No PDF (jpg converted via libre office) is accepted by tesseract giving a : lau@lau-H61M-D2-B3:/tmp$ tesseract tarja.pdf tessed Tesseract Open Source OCR Engine v3.02.02 with Leptonica Error in pixReadStream: Unknown format: no pix returned Error in pixRead: pix not read Unsupported image type. 3 ---------------------------- Having a metadata tag indicating a language in mayan and using this to set the language flag of tesseract can improve results a lot ! (50 words per page) If my project is finally using mayan i would try to program this feature. --
