The conversion logic is complex and had to look at the code and you are 
right, there are two steps missing from my suggestion.  The logic is:
Office doc -> PDF
PDF -> JPG
JPG -> PPM
PPM -> TIFF
TIFF -> Tesseract

Why the TIFF step?  Becuase the old Tesserct (<2.0) only supported TIFF 
files, I think the new version (3.02) supports more formats, so that is 
something to look at when the times comes to refactor the converter.

Mayan stores anything via STDERR after executing the binaries for OCR, so 
the simple 'error' message is what Mayan is getting from the command line.  
I'm already using PBS (http://pypi.python.org/pypi/pbs) in some places to 
call binaries and am planning to use it in the converter to simplify things 
and hopefully capture more error information when things go wrong on the 
command line behind the scenes.

--Roberto



On Thursday, November 29, 2012 7:59:38 AM UTC-4, Steve Kersley wrote:
>
> Following up my own post, I think the unpaper error message is a red 
> herring.
>
> As far as I can see from looking through the source of unpaper, it *only* 
> inputs/outputs ppm/pbm/pgm format files, and can't read or write a PDF.  So 
> either I'm using the wrong version of unpaper or Roberto's suggested manual 
> tests were wrong?
>
> What exactly is the file format pipeline for importing an Office file so 
> that I can check that I have all of the right tools, and they operate 
> properly?
>
> Is there any way to get more information than just 'error'?  I've tried 
> enabling DEBUG=True, but I don't appear to be getting any more output in 
> the apache error logs, and it doesn't seem to be generating any other 
> logfile that I've found.  
>
> Cheers,
> Steve.
>

-- 



Reply via email to