Having a flag to differentiate between OCR text and hOCR is a good idea. 
Now that the default OCR has been updated
to use PyOCR (which exposes hOCR) this could be possible in the future.

https://gitlab.com/mayan-edms/mayan-edms/commit/6bfdb053e3abec87aa55c987e5a13a72514ee682

On Friday, December 30, 2016 at 1:48:34 PM UTC-4, Matthias Löblich wrote:
>
> Hi,
> I am also looking for similar features. At the moment I am using an "not 
> bullet prove" workaround using regular expressions 
> to identify specific documents using a mayan app written by me: 
>
> https://gitlab.com/mayan-edms/document_analyzer
>
> @Roberto: 
> One thing could improve the identification: Storing the HOCR data provided 
> by tesseract and not only the plain text. 
> HOCR also includes layout information. So it could by possible to combine 
> the regex search with an layout "query" based on the HOCR data. 
> What do you think about extending the OCR App model DocumentPageContent 
> with an flag indicating if the content is plain text or HOCR. 
> If the content is HOCR there should be an hocr-parser extracting the plain 
> text, so the new format is not impacting the other parts of mayan.
> I would by happy to support the development to extend the OCR app in this 
> direction.
>
> br
> Matthias
> PS.: Features like that could be possible by storing the HOCR data: 
> https://github.com/shsdev/hocr-parser-hadoopjob
>
>
> Am Freitag, 16. Dezember 2016 10:36:25 UTC+1 schrieb [email protected]:
>>
>> Hello,
>> I'm looking for a programm, which could read a document and extract 
>> informations from it. 
>> For example, I become a bill from Apple (the programm would recognize it, 
>> because I would have defined if in this region, there is Apple with its 
>> adress and also defined the placed which define for Apple where to find, it 
>> is a bill) and I would like to extract from it for example the bill number 
>> (which should always be on the same place) and the total price of the bill 
>> (the place of it differ, depending on the number of articles I ordered.
>>
>> I unfortunatly didn't find the technical word for finding it on the web. 
>> How is this called? Is this possible with Mayan EDMS? 
>>
>> I thank you already for replying and wish you a good day,
>>
>> Cheers,
>>
>> Sam
>>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to