Been thinking about splitting the extraction of content into two separate
areas: one for recognized text data (OCR, barcodes in the future) and
embedded or parsed text (test files, PDF with text). The idea is to give
user a better expectation about the quality of the text given the area or
tab they access. OCR is expected to have error, parsed text is expected to
have low or no errors.
It would be a small split in the OCR app and some minor UI changes. That
makes sense?
On Sunday, May 7, 2017 at 3:48:06 PM UTC-4, andre wrote:
>
> Hi Dave,
>
> I wouldn't say that my requirements for OCR are totally aligned with those
> of business users. For me, the DMS is mainly a (chaotic) storage for every
> digital information I collect in my personal and business life. Getting rid
> of paper as much as possible, but I am not looking to use any workflows for
> example to manage those bills I receive. I am using the full text index to
> find everything, and that usually gives quick results with only knowing a
> few search terms, like account numbers, keywords ("bill", "food", account
> statement",...), names and so on. And it's absolutely uncritical if i spend
> two minutes instead of a few seconds for searching, because I have to try
> some different stuff.
>
> So for me, there are three, maybe four types of "information containers"
> relevant:
>
> - digital content I have created, like office docs, emails and stuff (not
> photos, they are managed separately) - no OCR necessary
> - PDFs I receive - no OCR neccessary
> - PDF from scanned paper - always OCRed. I don't go too much for 100%
> accuracy (while I would say that the results are very close), but sometimes
> there are complex documents which get some "manual attention", an example
> is that they might be bilingual.
>
> You see, everything is about the full text index content, so I do not care
> much about other metadata. But if I invested some time for better OCR
> results then of course I wouldn't want to see it go wasted by having this
> overwritten - if your question is targeted towards my initial requests
> here. And of course in this case it's relevant to know how these
> information are treated by the DMS.
>
>
>
>
> Am Donnerstag, 4. Mai 2017 00:14:31 UTC+2 schrieb Dave S:
>>
>>
>> Hi Andre,
>>
>> I am new here, though have years of experience with supporting and
>> developing a commercial Enterprise Content Management system. I am
>> installing Mayan now, so I apologize if I am missing something that will
>> become obvious upon use, but I am curious about the need for OCR for the
>> majority of documents. Would the inclusion of Document Type appropriate
>> (manually entered) Metadata allow you to find the information you are
>> searching for?
>>
>> OCR is a wonderful thing and something that I enjoy working with, though
>> there can be challenges in getting the OCR'ed data (accurately) and then
>> being able to use that information in a meaningful manner. Generally,
>> unless I need to have that information and can consistently assign (some of
>> the discreet) data to the Metadata - and I can afford the processing
>> time/expense - manual indexing or reading barcodes (a whole other
>> discussion! :-) ) meets 90+% of my needs.
>>
>> Perhaps once I start playing my question will answer itself, and I
>> certainly don't mean any offense, but I am interested in how people are
>> using the OCR'ed information (and related, has it been found to be accurate
>> in the vast majority - 95+% - of the time).
>>
>> Thanks!
>>
>> dave
>>
>>
--
---
You received this message because you are subscribed to the Google Groups
"Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.