Hi Lau, I'm glad! A lot of people are doing it and I'm very happy my software keeps creating commercial opportunities!
Yeah, my recommendation is missing two steps. I had to convert to TIFF as mentioned above and UNPAPER does de-skewing which why I added it to the workflow at the time. I'm interested to test the new Tesseract to see if it can cope with skewed images better and remove UNPAPER or make it optional with a config option. Checking the Wikipedia page for Tesseract (http://en.wikipedia.org/wiki/Tesseract_%28software%29) it only added support for other files format starting from 3.00 onwards. The good thing is that it now has hOCR support which is interesting because with it, text and image highlighting is possible as it provides a correlation of image coordinates to recognized text. I added per language support at one time, but sometimes document may have more than one language in them, so implemented per page language, but sometimes pages have more than one language in the them (this happens in Puerto Rico a lot, English and Spanish are the main languages of the island and are intermixed). In my experience this yielded poor results for the language other than the one selected, so I removed the feature. But I'm open to give it another go with the new version of Tesseract. --Roberto On Wednesday, December 5, 2012 9:07:00 PM UTC-4, Lau Llobet wrote: > > Hi Charles, Roberto and Steve, > > I'm loving this software, i'm actualy planning to start a business of > files digitalization for small busines and this software is the one > i'm liking more. > > I'm having the same problem as you two a simple error given by the > binaries in the ocr cue. > > Followint Roberto's adcvice I'm stuck at doing unpaper to a pdf with > the same error about "the magic %P", unpaper don't handle pdf !!! So > Roberto may give us another way to check what is going inside mayan so > we can simulate it by hand. > > As far as i see there's no pdf output file from the "document as an > image" in the temporary folder, just a file called IHAKtmp which is > empty so i guess the problem is at the first step which shoud be > libreoffice jpg to pdf conversion. That may make sense since we are > all using the same version of unpaper and tesseract and we may no be > using the same LibreOffice. > > I'm in a hurry trying to figure which is the best software for my > company and I would happly make a donation when i'll have it working > localy. > > > Also, while trying to solve this issue i've came to this observations: > > > 1 > -------------------------- > Tesseract has to have it's language training files in the usr/local/ > in order to work > > like this: > > lau@lau-H61M-D2-B3:/usr/local/share/tessdata$ ls > cat.traineddata eng.cube.fold eng.cube.params > eng.tesseract_cube.nn > configs eng.cube.lm eng.cube.size eng.traineddata > eng.cube.bigrams eng.cube.nn eng.cube.word-freq tessconfigs > > > 2 > -------------------------- > making tesseract to work with a .jpg from the scan has EXTREMELY > better results than giving it a ppm "cleaned" by unpaper , in the > first case only 5 words in a page where mistaken and a cleaned ppm > tesseract gave only 3 comprensible words in the whole page. No PDF > (jpg converted via libre office) is accepted by tesseract giving a : > > lau@lau-H61M-D2-B3:/tmp$ tesseract tarja.pdf tessed > Tesseract Open Source OCR Engine v3.02.02 with Leptonica > Error in pixReadStream: Unknown format: no pix returned > Error in pixRead: pix not read > Unsupported image type. > > 3 > ---------------------------- > Having a metadata tag indicating a language in mayan and using this to > set the language flag of tesseract can improve results a lot ! (50 > words per page) If my project is finally using mayan i would try to > program this feature. > --
