Re: [Mayan EDMS: 281] Re: OCR error on .doc files

Roberto Rosario Wed, 26 Sep 2012 07:42:48 -0700

Hi,

Go to the tools menu then the OCR button to see the queue of documents 
waiting for OCR processing, processed documents are deleted from the queue 
and the ones with errors remain in the queue with the message of the error 
they experienced.  If the error was internal an exception is raise and 
Mayan stores the text message of the exception, if it is an external 
executable Mayan tries to capture the text from STDERR or STDOUT if the 
executable provides anything.  If only the word 'error' appear it is most 
likely then that the error is during the external binary execution which is 
not returning any error message (typical with tesseract 2.x), and has to be 
diagnosed by hand.

Try to convert one of the document giving you error by hand doing:

libreoffice --headless --convert-to pdf <file> --outdir /tmp

if it converts correctly convert the resulting PDF file with unpaper by 
doing:

unpaper --overwrite --no-multi-pages </tmp/pdf file> </tmp>

do the OCR on the corresponing output files from unpaper:

tesseract <unpaper /tmp file input>

hopefully this should give an error message in one of these steps that will 
point in the right direction to fix it.

Also try creating a simple test .docx document (ie: lorem ipsum) and upload 
it to Mayan and see if it converts to OCR.

--Roberto

On Thursday, September 20, 2012 3:59:02 AM UTC-4, Charles McEvoy wrote:
>
> Thanks. 
> Tesseract 3.02 and unpaper 0.3 are installed. 
> Sorry to be ignorant, but I don't know how to generate or find the 
> logfiles - 
> could you point me the right way? Google hasn't helped this time! 
> Charles 
>
>  
>

--

Re: [Mayan EDMS: 281] Re: OCR error on .doc files

Reply via email to