Re: [Mayan EDMS: 816] Automatic upload from certain staging folder

Roberto Rosario Wed, 03 Sep 2014 11:48:14 -0700

I like the barcode/qrcode idea very much, would allow for batch scanning, 
for example several documents placed in a scanner with a document feeder 
and each document has a printed page with a barcode defining the metadata 
kind of like FAX cover pages. Regional OCR is a must have feature and 
usually a defining feature of the commercial offerings, I don't know how 
accurate OCR a rectangle of text would but is there is a need for the 
feature let's do it. We need a way to let users mark/highlight the fields 
they want scanned and entered as metadata. This would required some design 
decisions (do we store the cursor's x and y positions of the square to be 
scanned or the x and y % in relation to the current zoom level) and a rich 
client w/ corresponding API endpoints to talk to the backend.


On Wednesday, August 27, 2014 5:22:32 PM UTC-4, Mathias Behrle wrote:
>
> * Roberto Rosario: " Re: [Mayan EDMS: 761] Automatic upload from certain 
>   staging folder" (Wed, 30 Jul 2014 13:36:51 -0400): 
>
> > This feature was actually started some time ago ( 
> > 
> https://github.com/mayan-edms/mayan-edms/blob/master/mayan/apps/sources/models.py#L194)
>  
>
> > but is not yet enabled because it depends on some scheduling update that 
> > have not made it into the master branch. 
> > 
> > As for metadata, I came up with some ideas but none are implemented. One 
> > was to let users set default metadata values as well as document type 
> for 
> > each watch folder. Another idea was when a document is being imported 
> from 
> > a watch folder to look for a file with the same name but with the 
> .metadata 
> > extension. No design decision has been reached yet so any ideas are 
> > welcomed. 
>
> Both possibilities could have their individual use cases, for which they 
> fit 
> best. The most flexible approach is the second. 
>
> What I found when evaluating other DMS software: 
>
> - Inclusion of some identifier on the document (could be a barcode, or 
> some 
>   special formatted string, or...). This identifier must not necessarily 
> be 
>   fixed on the document, but could be the first page of a scan or some 
> paper 
>   scanned together with the document. This method applies preferably to 
> scanned 
>   documents. 
>
>
I like the barcode/qrcode idea very much, would allow for batch scanning, 
for example several documents placed in a scanner with a document feeder 
and each document has a printed page with a barcode defining the metadata 
kind of like FAX cover pages. 
 

> - Rather straightforward is a sort of recognition, where templates can be 
>   defined containing regions formatted in an individual way. E.g. if you 
> have a 
>   supplier with his custom invoice format displaying the invoice number, 
> date, 
>   amount at fixed places, they could be used on such a template and the 
> software 
>   can check, if the document contains such a region. 
>


Regional OCR is a must have feature and usually a defining feature of the 
commercial offerings, I don't know how accurate OCRing a rectangle of text 
would be but if there is a need for the feature let's do it. I see some 
requirements, we need a way to let users mark/highlight the fields they 
want scanned and entered as metadata. This would require some design 
decisions (do we store the cursor's x and y positions of the square to be 
scanned or the x and y % in relation to the current zoom level) and a rich 
client w/ corresponding API endpoints to talk to the backend.
 

>
>   Perhaps this could be used slightly modified but simpler by defining 
>   string patterns, that could be matched on the OCR result. So at last 
>   repeating patterns could be used to extract metadata. 
>
>
> In any case I would find useful a document queue containing documents 
> already processed (OCR available), but still to be completed with 
> metadata. So 
> to speak an inversion of the current workflow (where metadata are defined 
> first). 
> As already discussed in https://github.com/mayan-edms/mayan-edms/issues/9) 
> I 
> think it would be best for the manual completion of metadata to have a 
> view of 
> the document together with its OCR data available directly on the metadata 
> form. 
>
>
 

> I am imagining a staging folder, from which the documents are processed 
> immediately. If after the initial processing no metadata are available 
> for the document, it is added to the postprocessing queue. When finally 
> (manually) processed, those documents are removed from the queue. 
>
> There should be some configuration options: 
> - Which metadata are required to be filled for a document to be able to 
> leave 
>   the queue? 
> - Should only documents missing the required metadata be added to the 
> queue or 
>   just all (if postprocessing control for all processed documents is 
> desired)? 
>
>
I'm still wrapping my head around the post processing queue, do we create a 
post processing attached or detached from the watch folder?

 

> So far my brainstorming at this very moment, comments as always very 
> welcome. 
>
>
Thanks a lot Mathias!
 

> > On Jul 30, 2014 12:02 PM, "Joshua Jonah" <[email protected]> 
> > wrote 
> > 
> > > This would be a specific directory only containing a specific type of 
> > > file. 
> > > On Jul 30, 2014 12:00 PM, "Michel Lavoie" <[email protected]> 
> wrote: 
> > > 
> > >> It seems like a good idea, but how would you handle metadata? I 
> thought 
> > >> about using a script to automate uploads for a specific documents but 
> I'm 
> > >> afraid that poorly documented files would render my collection 
> useless in 
> > >> the end. 
>   
>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [Mayan EDMS: 816] Automatic upload from certain staging folder

Reply via email to