So, I tried to get more to the bottom of it. @RobertoRosario would you please clarify at least my first question:
Do you also call the extraction exiting text "OCR processing"? *Yes / No?* This is what I suspect after spending some time trying to understand what happens in the code (I am not a dev), making sure python-pdfminer is installed and watching the logs for NoMIMETypeMatch and ParserError. Now I just kept the "OCR processing" enabled, watched the task manager and threw some dozen files into the upload queue -> No visible tesseract process, and everything finished much faster than real OCR processing would have. If my conclusions are right, everything works as it should, and all the time has been. But if you, dear reader, are understanding OCR as "Optical Character Recognition" (like I do) and not as "parse existing text from documents and if that fails do a real Optical Character Recognition" as I *believe* it happens here, you are very likely to waste the same amount of time when you are trying to plan things from the beginning. yes, that also was a little rant, but hopefully this clarification (if so) can be seen as a contribution, too. Now let's find out how to update to the 2.2 when it's available, and if I find it documented somewhere I might have some time left afterwards to translate some phrases into german. Motivation is there ;) -- --- You received this message because you are subscribed to the Google Groups "Mayan EDMS" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
