[Mayan EDMS: 1472] Re: Automaticall search on a document

Matthias Löblich Fri, 30 Dec 2016 09:48:44 -0800

Hi,
I am also looking for similar features. At the moment I am using an "not 
bullet prove" workaround using regular expressions 
to identify specific documents using a mayan app written by me:


https://gitlab.com/mayan-edms/document_analyzer

@Roberto: 
One thing could improve the identification: Storing the HOCR data provided 
by tesseract and not only the plain text. 
HOCR also includes layout information. So it could by possible to combine 
the regex search with an layout "query" based on the HOCR data. 
What do you think about extending the OCR App model DocumentPageContent 
with an flag indicating if the content is plain text or HOCR. 
If the content is HOCR there should be an hocr-parser extracting the plain 
text, so the new format is not impacting the other parts of mayan.
I would by happy to support the development to extend the OCR app in this 
direction.

br
Matthias
PS.: Features like that could be possible by storing the HOCR data: 
https://github.com/shsdev/hocr-parser-hadoopjob


Am Freitag, 16. Dezember 2016 10:36:25 UTC+1 schrieb [email protected]:
>
> Hello,
> I'm looking for a programm, which could read a document and extract 
> informations from it. 
> For example, I become a bill from Apple (the programm would recognize it, 
> because I would have defined if in this region, there is Apple with its 
> adress and also defined the placed which define for Apple where to find, it 
> is a bill) and I would like to extract from it for example the bill number 
> (which should always be on the same place) and the total price of the bill 
> (the place of it differ, depending on the number of articles I ordered.
>
> I unfortunatly didn't find the technical word for finding it on the web. 
> How is this called? Is this possible with Mayan EDMS? 
>
> I thank you already for replying and wish you a good day,
>
> Cheers,
>
> Sam
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[Mayan EDMS: 1472] Re: Automaticall search on a document

Reply via email to