The OCR app will always try to parse the text of previously OCRed PDFs, office documents and text files before attempting the OCR step (https://gitlab.com/mayan-edms/mayan-edms/blob/master/mayan/apps/ocr/classes.py#L32).
Several parsers can be registered and will be tried in sequence. A Poppler and a PDFMiner parser are included by default (https://gitlab.com/mayan-edms/mayan-edms/blob/master/mayan/apps/ocr/parsers.py#L201). The PDFMiner parser could be removed if a viable, drop in replacement that supports Python 3.x is not found by the next relase. If the text is not being parsed, check the logs and make sure the package `poppler-utils` is installed. If a stable Python only PDF text parser is found these binary dependencies can be removed. On the topic of activity: The project is release free of charge with almost all rights provided to change and reuse the code. Expecting fast, on-point, free support in addition to that is unrealistic. Low participation for technical queries in forums and mailing lists is a common situation with open projects. Any suggestion or ideas to help improve on that are welcomed. Bear in mind that not all (if not most) subscribers to this list are not developers but users like yourself. Expecting professional advice from other users is unrealistic. Myself, core contributors, a few developers, devops personnel visit the list from time to time but this is not the only task we do in the project, there is also backend code, API code, frontend code, deployments (Docker, Salt, Fabric, etc), code testing, compatibility testing (database, python versions, OS, cloud environments), documentation, translations, design decisions, consulting, ticket triage, support, customization, website, social media sites, events (DjangoCon, PyCon), etc. Any help on those other areas will translate in more time for us to answer questions in the list. There are other non code decisions that occupy a lot of time researching, ie: Google Groups is showing its age and there is a discussion whether or not to ditch it and move to a proper (probably paid from our pockets) forum solution. Another matter is funding and making the project self sustaining. To this end, Mayan EDMS, LLC, was created in the USA, with the hopes that in the near future we could have paid developers working full time on the code and providing support, instead of just part time volunteers. This means a new set of tasks, documents, and legal procedures that need to be taken care. Mayan EDMS was started 6 years ago and is used by the State of California, the Government of Puerto Rico, The University of Montreal, Intel, with CEMEX and Deloitte recently joining, just to name a few known names (http://www.mayan-edms.com/cases/). It is very much alive and picking up steam :) For users or organizations needing timely response from core contributors, be it consulting or support, paid plans are available (http://www.mayan-edms.com/providers/). Customization and rebranding are also available if needed. There are many areas that are not code related where a little help goes a long way. Even stuff like spell checking or just taking the time to add additional information on a ticket or bug report helps a lot! I appreciate your concerns and opinions about the project and hope that we continue sharing and discussing them. On Tuesday, April 18, 2017 at 9:45:51 AM UTC-4, MacRobb Simpson wrote: > > Here's something that /may/ help: > > In mayan, the OCR text is located in the `ocr_documentpagecontent` table > It's per page(unfortunate, but if you don't care, you might be able to > just shove all your OCR'd text into Page 1 of each document). > > Here's a SQL query to start with: > SELECT d.label,p.page_number,p.id FROM `documents_document` as d > inner join `documents_documentversion` as v on d.id=v.document_id > inner join `documents_documentpage` as p on p.document_version_id=v.id > WHERE 1 limit 100 > > This will get you a list of document labels(you might want the ID or other > stuff), page numbers and unique page IDs. The Unique IDs are what you need > to create rows in the `ocr_documentpagecontent` table. > > It may not be a perfect solution, but you can definitely rig up some stuff > to get what you need, supported or not! > -- --- You received this message because you are subscribed to the Google Groups "Mayan EDMS" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
