[Mayan EDMS: 1625] Re: Search for document content not OCRed within Mayan

Roberto Rosario Thu, 20 Apr 2017 15:43:09 -0700

The OCR app will always try to parse the text of previously OCRed PDFs, 
office documents and text files before attempting the OCR step 
(https://gitlab.com/mayan-edms/mayan-edms/blob/master/mayan/apps/ocr/classes.py#L32).

Several parsers can be registered and will be tried in sequence. A Poppler 
and a PDFMiner parser are included by default 
(https://gitlab.com/mayan-edms/mayan-edms/blob/master/mayan/apps/ocr/parsers.py#L201).

The PDFMiner parser could be removed if a viable, drop in replacement that 
supports Python 3.x is not found by the next relase. 

If the text is not being parsed, check the logs and make sure the package 
`poppler-utils` is installed. If a stable Python only PDF text parser is 
found these binary dependencies can be removed.

On the topic of activity: 

The project is release free of charge with almost all rights provided to 
change and reuse the code. Expecting fast, on-point, free support in 
addition to that is unrealistic.

Low participation for technical queries in forums and mailing lists is a 
common situation with open projects. Any suggestion or ideas to help 
improve on that are welcomed.

Bear in mind that not all (if not most) subscribers to this list are not 
developers but users like yourself. Expecting professional advice from 
other users is unrealistic. 

Myself, core contributors, a few developers, devops personnel visit the 
list from time to time but this is not the only task we do in the project, 
there is also backend code, API code, frontend code, deployments (Docker, 
Salt, Fabric, etc), code testing, compatibility testing (database, python 
versions, OS, cloud environments), documentation, translations, design 
decisions, consulting, ticket triage, support, customization, website, 
social media sites, events (DjangoCon, PyCon), etc. Any help on those other 
areas will translate in more time for us to answer questions in the list. 
There are other non code decisions that occupy a lot of time researching, 
ie: Google Groups is showing its age and there is a discussion whether or 
not to ditch it and move to a proper (probably paid from our pockets) forum 
solution. Another matter is funding and making the project self sustaining. 
To this end, Mayan EDMS, LLC, was created in the USA, with the hopes that 
in the near future we could have paid developers working full time on the 
code and providing support, instead of just part time volunteers. This 
means a new set of tasks, documents, and legal procedures that need to be 
taken care. 

Mayan EDMS was started 6 years ago and is used by the State of California, 
the Government of Puerto Rico, The University of Montreal, Intel, with 
CEMEX and Deloitte recently joining, just to name a few known names 
(http://www.mayan-edms.com/cases/). It is very much alive and picking up 
steam :)  For users or organizations needing timely response from core 
contributors, be it consulting or support, paid plans are available 
(http://www.mayan-edms.com/providers/). Customization and rebranding are 
also available if needed.

There are many areas that are not code related where a little help goes a 
long way. Even stuff like spell checking or just taking the time to add 
additional information on a ticket or bug report helps a lot!

I appreciate your concerns and opinions about the project and hope that we 
continue sharing and discussing them.

On Tuesday, April 18, 2017 at 9:45:51 AM UTC-4, MacRobb Simpson wrote:
>
> Here's something that /may/ help:
>
> In mayan, the OCR text is located in the `ocr_documentpagecontent` table
> It's per page(unfortunate, but if you don't care, you might be able to 
> just shove all your OCR'd text into Page 1 of each document).
>
> Here's a SQL query to start with:
> SELECT d.label,p.page_number,p.id FROM `documents_document` as d
> inner join `documents_documentversion` as v on d.id=v.document_id
> inner join `documents_documentpage` as p on p.document_version_id=v.id
> WHERE 1 limit 100
>
> This will get you a list of document labels(you might want the ID or other 
> stuff), page numbers and unique page IDs. The Unique IDs are what you need 
> to create rows in the `ocr_documentpagecontent` table.
>
> It may not be a perfect solution, but you can definitely rig up some stuff 
> to get what you need, supported or not!
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[Mayan EDMS: 1625] Re: Search for document content not OCRed within Mayan

Reply via email to