[Mayan EDMS: 384] Re: OCR error on .doc files

Roberto Rosario Wed, 12 Dec 2012 19:04:13 -0800

Hi Lau,

I'm glad!  A lot of people are doing it and I'm very happy my software 
keeps creating commercial opportunities!

Yeah, my recommendation is missing two steps.  I had to convert to TIFF as 
mentioned above and UNPAPER does de-skewing which why I added it to the 
workflow at the time.  I'm interested to test the new Tesseract to see if 
it can cope with skewed images better and remove UNPAPER or make it 
optional with a config option.  Checking the Wikipedia page for Tesseract 
(http://en.wikipedia.org/wiki/Tesseract_%28software%29) it only added 
support for other files format starting from 3.00 onwards.  The good thing 
is that it now has hOCR support which is interesting because with it, text 
and image highlighting is possible as it provides a correlation of image 
coordinates to recognized text.

I added per language support at one time, but sometimes document may have 
more than one language in them, so implemented per page language, but 
sometimes pages have more than one language in the them (this happens in 
Puerto Rico a lot, English and Spanish are the main languages of the island 
and are intermixed).  In my experience this yielded poor results for the 
language other than the one selected, so I removed the feature.  But I'm 
open to give it another go with the new version of Tesseract.

--Roberto

On Wednesday, December 5, 2012 9:07:00 PM UTC-4, Lau Llobet wrote:
>
> Hi Charles, Roberto and Steve, 
>
> I'm loving this software, i'm actualy planning to start a business of 
> files digitalization for small busines and this software is the one 
> i'm liking more. 
>
> I'm having the same problem as you two a simple error given by the 
> binaries in the ocr cue. 
>
> Followint Roberto's adcvice I'm stuck at doing unpaper to a pdf with 
> the same error about  "the magic %P", unpaper don't handle pdf !!! So 
> Roberto may give us another way to check what is going inside mayan so 
> we can simulate it by hand. 
>
> As far as i see there's no pdf output file from the "document as an 
> image" in the temporary folder, just a file called IHAKtmp which is 
> empty so i guess the problem is at the first step which shoud be 
> libreoffice jpg to pdf conversion. That may make sense since we are 
> all using the same version of unpaper and tesseract and we may no be 
> using the same LibreOffice. 
>
> I'm in a hurry trying to figure which is the best software for my 
> company and I would happly make a donation when i'll have it working 
> localy. 
>
>
> Also, while trying to solve this issue i've came to this observations: 
>
>
> 1 
> -------------------------- 
> Tesseract has to have it's language training files in the usr/local/ 
> in order to work 
>
> like this: 
>
> lau@lau-H61M-D2-B3:/usr/local/share/tessdata$ ls 
> cat.traineddata   eng.cube.fold  eng.cube.params 
> eng.tesseract_cube.nn 
> configs           eng.cube.lm    eng.cube.size       eng.traineddata 
> eng.cube.bigrams  eng.cube.nn    eng.cube.word-freq  tessconfigs 
>
>
> 2 
> -------------------------- 
> making tesseract to work with a .jpg from the scan has EXTREMELY 
> better results than giving it a ppm "cleaned" by unpaper , in the 
> first case only 5 words in a page where mistaken and a cleaned ppm 
> tesseract gave only 3 comprensible words in the whole page. No PDF 
> (jpg converted via libre office) is accepted by tesseract giving a : 
>
> lau@lau-H61M-D2-B3:/tmp$ tesseract tarja.pdf tessed 
> Tesseract Open Source OCR Engine v3.02.02 with Leptonica 
> Error in pixReadStream: Unknown format: no pix returned 
> Error in pixRead: pix not read 
> Unsupported image type. 
>
> 3 
> ---------------------------- 
> Having a metadata tag indicating a language in mayan and using this to 
> set the language flag of tesseract can improve results a lot ! (50 
> words per page) If my project is finally using mayan i would try to 
> program this feature. 
>

--

[Mayan EDMS: 384] Re: OCR error on .doc files

Reply via email to