Re: [Mayan EDMS: 836] Automatic upload from certain staging folder

Roberto Rosario Fri, 05 Sep 2014 23:38:16 -0700


On Wednesday, September 3, 2014 6:48:15 PM UTC-4, Mathias Behrle wrote:
>
> * Roberto Rosario: " Re: [Mayan EDMS: 816] Automatic upload from certain 
>   staging folder" (Wed, 3 Sep 2014 11:47:52 -0700 (PDT)): 
>
> > I like the barcode/qrcode idea very much, would allow for batch 
> scanning, 
> > for example several documents placed in a scanner with a document feeder 
> > and each document has a printed page with a barcode defining the 
> metadata 
> > kind of like FAX cover pages. Regional OCR is a must have feature and 
> > usually a defining feature of the commercial offerings, I don't know how 
> > accurate OCR a rectangle of text would but is there is a need for the 
> > feature let's do it. We need a way to let users mark/highlight the 
> fields 
> > they want scanned and entered as metadata. This would required some 
> design 
> > decisions (do we store the cursor's x and y positions of the square to 
> be 
> > scanned or the x and y % in relation to the current zoom level) and a 
> rich 
> > client w/ corresponding API endpoints to talk to the backend. 
> > 
> > On Wednesday, August 27, 2014 5:22:32 PM UTC-4, Mathias Behrle wrote: 
> > > 
> > > * Roberto Rosario: " Re: [Mayan EDMS: 761] Automatic upload from 
> certain 
> > >   staging folder" (Wed, 30 Jul 2014 13:36:51 -0400): 
> > > 
> > > > This feature was actually started some time ago ( 
> > > > 
> > > 
> https://github.com/mayan-edms/mayan-edms/blob/master/mayan/apps/sources/models.py#L194)
>  
>
> > > 
> > > > but is not yet enabled because it depends on some scheduling update 
> that 
> > > > have not made it into the master branch. 
> > > > 
> > > > As for metadata, I came up with some ideas but none are implemented. 
> One 
> > > > was to let users set default metadata values as well as document 
> type 
> > > for 
> > > > each watch folder. Another idea was when a document is being 
> imported 
> > > from 
> > > > a watch folder to look for a file with the same name but with the 
> > > .metadata 
> > > > extension. No design decision has been reached yet so any ideas are 
> > > > welcomed. 
> > > 
> > > Both possibilities could have their individual use cases, for which 
> they 
> > > fit 
> > > best. The most flexible approach is the second. 
> > > 
> > > What I found when evaluating other DMS software: 
> > > 
> > > - Inclusion of some identifier on the document (could be a barcode, or 
> > > some 
> > >   special formatted string, or...). This identifier must not 
> necessarily 
> > > be 
> > >   fixed on the document, but could be the first page of a scan or some 
> > > paper 
> > >   scanned together with the document. This method applies preferably 
> to 
> > > scanned 
> > >   documents. 
> > > 
> > > 
> > I like the barcode/qrcode idea very much, would allow for batch 
> scanning, 
> > for example several documents placed in a scanner with a document feeder 
> > and each document has a printed page with a barcode defining the 
> metadata 
> > kind of like FAX cover pages. 
>
> Yes, the comparison with the Fax cover page hits the mark. 
>
> Question: 
> When batch scanning, how to determine the beginning and the end of a 
> batch?


Will each document require a 'cover page' or can such a cover page be valid 
> for 
> several documents? Perhaps the number of documents could be included on 
> the 
> cover page, but this would always require a new cover page per batch. 
>   
>

I don't we would need to specify the page count. We can come up with some 
base codes that are encoded into a qrcode and printed as a cover page. When 
Mayan detects the cover page all documents or pages detected afterwards 
inherit whatever metadadata, document type or any setting specified in the 
cover page. If another cover page is detected Mayan know this is the 
beginning of a new document or documents. Example:

* A 'set metadata' cover page with some values encoded: vendor="vendor 1"
* A 'set metadata' cover page with some values encoded: vendor="vendor 2"
* A 'new document' cover page

The physical document paper sandwich would be:

- Set metadata cover, vendor 1
- New document cover page
- Document 1 page 1
- Document 1 page 2
- New document cover page
- Document 2 page 1
- Document 2 page 2
- Set metadata cover, vendor 2
- New document cover page
- Document 3 page 1
- Document 3 page 2

All of this is scanned in one go using a paper feeder and we just scanned 
and pushed into Mayan 3 multipage documents with 2 of them using the same 
metadata and one with a different metadata. We can create more 'control 
message' for new cover page types as we need along the road and can cover 
several user scenarios. We can create an 'End document' cover page if 
needed. The cover page is just a blank page with a QR code. Control cover 
pages can be physically reused or photocopied only cover pages with dynamic 
user data like the set metadata cover page would need to be printer more 
than once if the metadata changes, but if the metadata is periodic, like 
say vendor names they can also be reused.
 

> > > - Rather straightforward is a sort of recognition, where templates can 
> be 
> > >   defined containing regions formatted in an individual way. E.g. if 
> you 
> > > have a 
> > >   supplier with his custom invoice format displaying the invoice 
> number, 
> > > date, 
> > >   amount at fixed places, they could be used on such a template and 
> the 
> > > software 
> > >   can check, if the document contains such a region. 
> > > 
> > 
> > 
> > Regional OCR is a must have feature and usually a defining feature of 
> the 
> > commercial offerings, I don't know how accurate OCRing a rectangle of 
> text 
> > would be but if there is a need for the feature let's do it. I see some 
> > requirements, we need a way to let users mark/highlight the fields they 
> > want scanned and entered as metadata. This would require some design 
> > decisions (do we store the cursor's x and y positions of the square to 
> be 
> > scanned or the x and y % in relation to the current zoom level) 
>
> The more agnostic of the zoom level, the better. So I would think x and y 
> in 
> relation to X and Y (where X and Y are the dimensions of the whole page). 
>
 

> > and a rich client w/ corresponding API endpoints to talk to the backend. 
>
> Do you mean, a separate client is needed for that purpose? 
>

I meant that some interactive javascript/jquery code will be needed on the 
template, sorry about the confusing wording :)
 

>   
> > >   Perhaps this could be used slightly modified but simpler by defining 
> > >   string patterns, that could be matched on the OCR result. So at last 
> > >   repeating patterns could be used to extract metadata. 
> > > 
> > > 
> > > In any case I would find useful a document queue containing documents 
> > > already processed (OCR available), but still to be completed with 
> > > metadata. So 
> > > to speak an inversion of the current workflow (where metadata are 
> defined 
> > > first). 
> > > As already discussed in 
> https://github.com/mayan-edms/mayan-edms/issues/9) 
> > > I 
> > > think it would be best for the manual completion of metadata to have a 
> > > view of 
> > > the document together with its OCR data available directly on the 
> metadata 
> > > form. 
> > > 
> > > 
> >   
> > 
> > > I am imagining a staging folder, from which the documents are 
> processed 
> > > immediately. If after the initial processing no metadata are available 
> > > for the document, it is added to the postprocessing queue. When 
> finally 
> > > (manually) processed, those documents are removed from the queue. 
> > > 
> > > There should be some configuration options: 
> > > - Which metadata are required to be filled for a document to be able 
> to 
> > > leave 
> > >   the queue? 
> > > - Should only documents missing the required metadata be added to the 
> > > queue or 
> > >   just all (if postprocessing control for all processed documents is 
> > > desired)? 
> > > 
> > > 
> > I'm still wrapping my head around the post processing queue, do we 
> create a 
> > post processing attached or detached from the watch folder? 
>
> Hopefully I understand your question. My vision is: 
>
> - process all documents in the watch folder: 
>   i.e. 
>   - do OCR 
>   - include the document in the database 
>   - add the document to the post processing queue 
>   - delete the document from the watch folder 
>
> so that all documents are already available in the system, but should/can 
> be 
> post-processed to control/add metadata. 
>
> So IIUC the question, the queue is independent from the watch folder. 
>
> Probably we should also add an option when uploading via API, if the 
> document 
> should be added to this queue. 
>
> Nice would be a configuration option to skip inclusion in the 
> post-processing 
> queue, if certain metadata are registered (e.g. specific metadata fields 
> contain values). 
>
>
> -- 
>
>     Mathias Behrle 
>     PGP/GnuPG key availabable from any keyserver, ID: 0x8405BBF6 
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [Mayan EDMS: 836] Automatic upload from certain staging folder

Reply via email to