Having a flag to differentiate between OCR text and hOCR is a good idea. Now that the default OCR has been updated to use PyOCR (which exposes hOCR) this could be possible in the future.
https://gitlab.com/mayan-edms/mayan-edms/commit/6bfdb053e3abec87aa55c987e5a13a72514ee682 On Friday, December 30, 2016 at 1:48:34 PM UTC-4, Matthias Löblich wrote: > > Hi, > I am also looking for similar features. At the moment I am using an "not > bullet prove" workaround using regular expressions > to identify specific documents using a mayan app written by me: > > https://gitlab.com/mayan-edms/document_analyzer > > @Roberto: > One thing could improve the identification: Storing the HOCR data provided > by tesseract and not only the plain text. > HOCR also includes layout information. So it could by possible to combine > the regex search with an layout "query" based on the HOCR data. > What do you think about extending the OCR App model DocumentPageContent > with an flag indicating if the content is plain text or HOCR. > If the content is HOCR there should be an hocr-parser extracting the plain > text, so the new format is not impacting the other parts of mayan. > I would by happy to support the development to extend the OCR app in this > direction. > > br > Matthias > PS.: Features like that could be possible by storing the HOCR data: > https://github.com/shsdev/hocr-parser-hadoopjob > > > Am Freitag, 16. Dezember 2016 10:36:25 UTC+1 schrieb [email protected]: >> >> Hello, >> I'm looking for a programm, which could read a document and extract >> informations from it. >> For example, I become a bill from Apple (the programm would recognize it, >> because I would have defined if in this region, there is Apple with its >> adress and also defined the placed which define for Apple where to find, it >> is a bill) and I would like to extract from it for example the bill number >> (which should always be on the same place) and the total price of the bill >> (the place of it differ, depending on the number of articles I ordered. >> >> I unfortunatly didn't find the technical word for finding it on the web. >> How is this called? Is this possible with Mayan EDMS? >> >> I thank you already for replying and wish you a good day, >> >> Cheers, >> >> Sam >> > -- --- You received this message because you are subscribed to the Google Groups "Mayan EDMS" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
