I like the barcode/qrcode idea very much, would allow for batch scanning, for example several documents placed in a scanner with a document feeder and each document has a printed page with a barcode defining the metadata kind of like FAX cover pages. Regional OCR is a must have feature and usually a defining feature of the commercial offerings, I don't know how accurate OCR a rectangle of text would but is there is a need for the feature let's do it. We need a way to let users mark/highlight the fields they want scanned and entered as metadata. This would required some design decisions (do we store the cursor's x and y positions of the square to be scanned or the x and y % in relation to the current zoom level) and a rich client w/ corresponding API endpoints to talk to the backend.
On Wednesday, August 27, 2014 5:22:32 PM UTC-4, Mathias Behrle wrote: > > * Roberto Rosario: " Re: [Mayan EDMS: 761] Automatic upload from certain > staging folder" (Wed, 30 Jul 2014 13:36:51 -0400): > > > This feature was actually started some time ago ( > > > https://github.com/mayan-edms/mayan-edms/blob/master/mayan/apps/sources/models.py#L194) > > > > but is not yet enabled because it depends on some scheduling update that > > have not made it into the master branch. > > > > As for metadata, I came up with some ideas but none are implemented. One > > was to let users set default metadata values as well as document type > for > > each watch folder. Another idea was when a document is being imported > from > > a watch folder to look for a file with the same name but with the > .metadata > > extension. No design decision has been reached yet so any ideas are > > welcomed. > > Both possibilities could have their individual use cases, for which they > fit > best. The most flexible approach is the second. > > What I found when evaluating other DMS software: > > - Inclusion of some identifier on the document (could be a barcode, or > some > special formatted string, or...). This identifier must not necessarily > be > fixed on the document, but could be the first page of a scan or some > paper > scanned together with the document. This method applies preferably to > scanned > documents. > > I like the barcode/qrcode idea very much, would allow for batch scanning, for example several documents placed in a scanner with a document feeder and each document has a printed page with a barcode defining the metadata kind of like FAX cover pages. > - Rather straightforward is a sort of recognition, where templates can be > defined containing regions formatted in an individual way. E.g. if you > have a > supplier with his custom invoice format displaying the invoice number, > date, > amount at fixed places, they could be used on such a template and the > software > can check, if the document contains such a region. > Regional OCR is a must have feature and usually a defining feature of the commercial offerings, I don't know how accurate OCRing a rectangle of text would be but if there is a need for the feature let's do it. I see some requirements, we need a way to let users mark/highlight the fields they want scanned and entered as metadata. This would require some design decisions (do we store the cursor's x and y positions of the square to be scanned or the x and y % in relation to the current zoom level) and a rich client w/ corresponding API endpoints to talk to the backend. > > Perhaps this could be used slightly modified but simpler by defining > string patterns, that could be matched on the OCR result. So at last > repeating patterns could be used to extract metadata. > > > In any case I would find useful a document queue containing documents > already processed (OCR available), but still to be completed with > metadata. So > to speak an inversion of the current workflow (where metadata are defined > first). > As already discussed in https://github.com/mayan-edms/mayan-edms/issues/9) > I > think it would be best for the manual completion of metadata to have a > view of > the document together with its OCR data available directly on the metadata > form. > > > I am imagining a staging folder, from which the documents are processed > immediately. If after the initial processing no metadata are available > for the document, it is added to the postprocessing queue. When finally > (manually) processed, those documents are removed from the queue. > > There should be some configuration options: > - Which metadata are required to be filled for a document to be able to > leave > the queue? > - Should only documents missing the required metadata be added to the > queue or > just all (if postprocessing control for all processed documents is > desired)? > > I'm still wrapping my head around the post processing queue, do we create a post processing attached or detached from the watch folder? > So far my brainstorming at this very moment, comments as always very > welcome. > > Thanks a lot Mathias! > > On Jul 30, 2014 12:02 PM, "Joshua Jonah" <[email protected]> > > wrote > > > > > This would be a specific directory only containing a specific type of > > > file. > > > On Jul 30, 2014 12:00 PM, "Michel Lavoie" <[email protected]> > wrote: > > > > > >> It seems like a good idea, but how would you handle metadata? I > thought > > >> about using a script to automate uploads for a specific documents but > I'm > > >> afraid that poorly documented files would render my collection > useless in > > >> the end. > > > -- --- You received this message because you are subscribed to the Google Groups "Mayan EDMS" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
