Just to add a quick note: I'm sure there are many people that like me read the mailing list but don't chime in if they don't have a useful answer to offer for a question.
On Fri, Apr 21, 2017 at 1:42 AM, Roberto Rosario < [email protected]> wrote: > The OCR app will always try to parse the text of previously OCRed PDFs, > office documents and text files before attempting the OCR step ( > https://gitlab.com/mayan-edms/mayan-edms/blob/master/ > mayan/apps/ocr/classes.py#L32). > > Several parsers can be registered and will be tried in sequence. A Poppler > and a PDFMiner parser are included by default (https://gitlab.com/mayan- > edms/mayan-edms/blob/master/mayan/apps/ocr/parsers.py#L201). The PDFMiner > parser could be removed if a viable, drop in replacement that supports > Python 3.x is not found by the next relase. > > If the text is not being parsed, check the logs and make sure the package > `poppler-utils` is installed. If a stable Python only PDF text parser is > found these binary dependencies can be removed. > > On the topic of activity: > > The project is release free of charge with almost all rights provided to > change and reuse the code. Expecting fast, on-point, free support in > addition to that is unrealistic. > > Low participation for technical queries in forums and mailing lists is a > common situation with open projects. Any suggestion or ideas to help > improve on that are welcomed. > > Bear in mind that not all (if not most) subscribers to this list are not > developers but users like yourself. Expecting professional advice from > other users is unrealistic. > > Myself, core contributors, a few developers, devops personnel visit the > list from time to time but this is not the only task we do in the project, > there is also backend code, API code, frontend code, deployments (Docker, > Salt, Fabric, etc), code testing, compatibility testing (database, python > versions, OS, cloud environments), documentation, translations, design > decisions, consulting, ticket triage, support, customization, website, > social media sites, events (DjangoCon, PyCon), etc. Any help on those other > areas will translate in more time for us to answer questions in the list. > There are other non code decisions that occupy a lot of time researching, > ie: Google Groups is showing its age and there is a discussion whether or > not to ditch it and move to a proper (probably paid from our pockets) forum > solution. Another matter is funding and making the project self sustaining. > To this end, Mayan EDMS, LLC, was created in the USA, with the hopes that > in the near future we could have paid developers working full time on the > code and providing support, instead of just part time volunteers. This > means a new set of tasks, documents, and legal procedures that need to be > taken care. > > Mayan EDMS was started 6 years ago and is used by the State of California, > the Government of Puerto Rico, The University of Montreal, Intel, with > CEMEX and Deloitte recently joining, just to name a few known names ( > http://www.mayan-edms.com/cases/). It is very much alive and picking up > steam :) For users or organizations needing timely response from core > contributors, be it consulting or support, paid plans are available ( > http://www.mayan-edms.com/providers/). Customization and rebranding are > also available if needed. > > There are many areas that are not code related where a little help goes a > long way. Even stuff like spell checking or just taking the time to add > additional information on a ticket or bug report helps a lot! > > I appreciate your concerns and opinions about the project and hope that we > continue sharing and discussing them. > > On Tuesday, April 18, 2017 at 9:45:51 AM UTC-4, MacRobb Simpson wrote: >> >> Here's something that /may/ help: >> >> In mayan, the OCR text is located in the `ocr_documentpagecontent` table >> It's per page(unfortunate, but if you don't care, you might be able to >> just shove all your OCR'd text into Page 1 of each document). >> >> Here's a SQL query to start with: >> SELECT d.label,p.page_number,p.id FROM `documents_document` as d >> inner join `documents_documentversion` as v on d.id=v.document_id >> inner join `documents_documentpage` as p on p.document_version_id=v.id >> WHERE 1 limit 100 >> >> This will get you a list of document labels(you might want the ID or >> other stuff), page numbers and unique page IDs. The Unique IDs are what you >> need to create rows in the `ocr_documentpagecontent` table. >> >> It may not be a perfect solution, but you can definitely rig up some >> stuff to get what you need, supported or not! >> > -- > > --- > You received this message because you are subscribed to the Google Groups > "Mayan EDMS" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > -- --- You received this message because you are subscribed to the Google Groups "Mayan EDMS" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
