The conversion logic is complex and had to look at the code and you are right, there are two steps missing from my suggestion. The logic is: Office doc -> PDF PDF -> JPG JPG -> PPM PPM -> TIFF TIFF -> Tesseract
Why the TIFF step? Becuase the old Tesserct (<2.0) only supported TIFF files, I think the new version (3.02) supports more formats, so that is something to look at when the times comes to refactor the converter. Mayan stores anything via STDERR after executing the binaries for OCR, so the simple 'error' message is what Mayan is getting from the command line. I'm already using PBS (http://pypi.python.org/pypi/pbs) in some places to call binaries and am planning to use it in the converter to simplify things and hopefully capture more error information when things go wrong on the command line behind the scenes. --Roberto On Thursday, November 29, 2012 7:59:38 AM UTC-4, Steve Kersley wrote: > > Following up my own post, I think the unpaper error message is a red > herring. > > As far as I can see from looking through the source of unpaper, it *only* > inputs/outputs ppm/pbm/pgm format files, and can't read or write a PDF. So > either I'm using the wrong version of unpaper or Roberto's suggested manual > tests were wrong? > > What exactly is the file format pipeline for importing an Office file so > that I can check that I have all of the right tools, and they operate > properly? > > Is there any way to get more information than just 'error'? I've tried > enabling DEBUG=True, but I don't appear to be getting any more output in > the apache error logs, and it doesn't seem to be generating any other > logfile that I've found. > > Cheers, > Steve. > --
