Re: [Mayan EDMS: 1629] Re: Search for document content not OCRed within Mayan

Jesaja Everling Fri, 21 Apr 2017 02:41:48 -0700

Just to add a quick note: I'm sure there are many people that like me read
the mailing list but don't chime in if they don't have a useful answer to
offer for a question.


On Fri, Apr 21, 2017 at 1:42 AM, Roberto Rosario <
[email protected]> wrote:

> The OCR app will always try to parse the text of previously OCRed PDFs,
> office documents and text files before attempting the OCR step (
> https://gitlab.com/mayan-edms/mayan-edms/blob/master/
> mayan/apps/ocr/classes.py#L32).
>
> Several parsers can be registered and will be tried in sequence. A Poppler
> and a PDFMiner parser are included by default (https://gitlab.com/mayan-
> edms/mayan-edms/blob/master/mayan/apps/ocr/parsers.py#L201). The PDFMiner
> parser could be removed if a viable, drop in replacement that supports
> Python 3.x is not found by the next relase.
>
> If the text is not being parsed, check the logs and make sure the package
> `poppler-utils` is installed. If a stable Python only PDF text parser is
> found these binary dependencies can be removed.
>
> On the topic of activity:
>
> The project is release free of charge with almost all rights provided to
> change and reuse the code. Expecting fast, on-point, free support in
> addition to that is unrealistic.
>
> Low participation for technical queries in forums and mailing lists is a
> common situation with open projects. Any suggestion or ideas to help
> improve on that are welcomed.
>
> Bear in mind that not all (if not most) subscribers to this list are not
> developers but users like yourself. Expecting professional advice from
> other users is unrealistic.
>
> Myself, core contributors, a few developers, devops personnel visit the
> list from time to time but this is not the only task we do in the project,
> there is also backend code, API code, frontend code, deployments (Docker,
> Salt, Fabric, etc), code testing, compatibility testing (database, python
> versions, OS, cloud environments), documentation, translations, design
> decisions, consulting, ticket triage, support, customization, website,
> social media sites, events (DjangoCon, PyCon), etc. Any help on those other
> areas will translate in more time for us to answer questions in the list.
> There are other non code decisions that occupy a lot of time researching,
> ie: Google Groups is showing its age and there is a discussion whether or
> not to ditch it and move to a proper (probably paid from our pockets) forum
> solution. Another matter is funding and making the project self sustaining.
> To this end, Mayan EDMS, LLC, was created in the USA, with the hopes that
> in the near future we could have paid developers working full time on the
> code and providing support, instead of just part time volunteers. This
> means a new set of tasks, documents, and legal procedures that need to be
> taken care.
>
> Mayan EDMS was started 6 years ago and is used by the State of California,
> the Government of Puerto Rico, The University of Montreal, Intel, with
> CEMEX and Deloitte recently joining, just to name a few known names (
> http://www.mayan-edms.com/cases/). It is very much alive and picking up
> steam :)  For users or organizations needing timely response from core
> contributors, be it consulting or support, paid plans are available (
> http://www.mayan-edms.com/providers/). Customization and rebranding are
> also available if needed.
>
> There are many areas that are not code related where a little help goes a
> long way. Even stuff like spell checking or just taking the time to add
> additional information on a ticket or bug report helps a lot!
>
> I appreciate your concerns and opinions about the project and hope that we
> continue sharing and discussing them.
>
> On Tuesday, April 18, 2017 at 9:45:51 AM UTC-4, MacRobb Simpson wrote:
>>
>> Here's something that /may/ help:
>>
>> In mayan, the OCR text is located in the `ocr_documentpagecontent` table
>> It's per page(unfortunate, but if you don't care, you might be able to
>> just shove all your OCR'd text into Page 1 of each document).
>>
>> Here's a SQL query to start with:
>> SELECT d.label,p.page_number,p.id FROM `documents_document` as d
>> inner join `documents_documentversion` as v on d.id=v.document_id
>> inner join `documents_documentpage` as p on p.document_version_id=v.id
>> WHERE 1 limit 100
>>
>> This will get you a list of document labels(you might want the ID or
>> other stuff), page numbers and unique page IDs. The Unique IDs are what you
>> need to create rows in the `ocr_documentpagecontent` table.
>>
>> It may not be a perfect solution, but you can definitely rig up some
>> stuff to get what you need, supported or not!
>>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "Mayan EDMS" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [Mayan EDMS: 1629] Re: Search for document content not OCRed within Mayan

Reply via email to