Re: Document validation involving other documents
On Sun, Jan 24, 2010 at 12:21:25PM -0800, Chris Anderson wrote: The problem with this approach is that validation is run during replication as well, so any multi-doc data dependencies become problematic in ad-hoc clusters. But not every application makes sense as an ad-hoc cluster. In tight-knit clusters the databases trust each other and you want the data to be as coherent as possible, so you'd run replication as a user which has permit-everything rights in validate_doc_update. In these models you're more interested in validating the data once at its point of entry, not at every point of replication. It would be horrendous to have a document in instance 1 but not in instance 2, just because it was accepted initially according to some set of rules, but failed to replicate because the rules had changed in the mean time. This is especially true if the rules themselves are documents, and hence may be a bit stale. At worst you may accept an update which would be invalid if you had the most up-to-date rules, or reject one which would be valid, but the fact that you *did* accept or reject it should be consistent throughout the cluster.
Document validation involving other documents
I've been trying to eliminate, or at least vastly reduce the need for middleware in a specific class of applications, but this problem probably cuts that task short. We have incoming documents that need to be validated against data stored in other docs (rate tables, technical abbreviations, inspection codes). There is a high procedural cost to allowing documents in and validating them after saving (queue processing), but validation functions don't allow bringing in external data (correct me if I'm wrong, please.) Without writing middleware or sideware (an external handler), I don't see a way to do this. What are the arguments against including a native 'lookup' function available to validation functions only. This seems to be a side-effect-free usage case to me. Reads are cheap, correct? If my memory hasn't completely failed, this was a significant problem in Lotus Notes until such a function was added, somewhere around version 3.3 or 4.0. David Richardson
Re: Document validation involving other documents
That's exactly what creates the high associated procedure cost - the documents are submitted by mobile workers who connect for very short times, sometimes very infrequently. If the document is accepted with invalid entries and processed later, the opportunity to alert the author of the problem(s) during the request/response cycle is lost. This means someone else must fix the mistake (extra work), and a very unwieldy process (usually manual) must be instituted to alert the author (training/feedback is harder and less effective). I'm now willing to accept extra moving parts living in front of the db - there are other things that are easier to accomplish there too. The cron async process smells like Greenspun's 10th law, extrapolated to transaction managers and message brokers ;-) David On Sun, Jan 24, 2010 at 12:06 PM, David Richardson techni...@enquora.com wrote: I've been trying to eliminate, or at least vastly reduce the need for middleware in a specific class of applications, but this problem probably cuts that task short. We have incoming documents that need to be validated against data stored in other docs (rate tables, technical abbreviations, inspection codes). There is a high procedural cost to allowing documents in and validating them after saving (queue processing), but validation functions don't allow bringing in external data (correct me if I'm wrong, please.) Without writing middleware or sideware (an external handler), I don't see a way to do this. What are the arguments against including a native 'lookup' function available to validation functions only. This seems to be a side-effect-free usage case to me. Reads are cheap, correct? If my memory hasn't completely failed, this was a significant problem in Lotus Notes until such a function was added, somewhere around version 3.3 or 4.0. David Richardson The problem with this approach is that validation is run during replication as well, so any multi-doc data dependencies become problematic in ad-hoc clusters. The way to do this without middleware is to have a backend asynchronous process (maybe node.js) that consumes _changes and acts on particular updates. So a user saves the doc with a pending state and the _changes handler sees that and initiates validation. (One trick you can do with this is use _changes heartbeat to generate a cron trigger, so your async process can also do things based on time.) I'd like to see this async/cron functionality bundled with Couch itself, but there's only so much time in the day and I don't think it's worth delaying any releases for. Patches extremely welcome, but I think this one might require more discussion before anyone would be ready to run off and write it. Chris -- Chris Anderson http://jchrisa.net http://couch.io
Re: Document validation involving other documents
On Sun, Jan 24, 2010 at 2:25 PM, David Richardson techni...@enquora.com wrote: That's exactly what creates the high associated procedure cost - the documents are submitted by mobile workers who connect for very short times, sometimes very infrequently. If the document is accepted with invalid entries and processed later, the opportunity to alert the author of the problem(s) during the request/response cycle is lost. This means someone else must fix the mistake (extra work), and a very unwieldy process (usually manual) must be instituted to alert the author (training/feedback is harder and less effective). I'm now willing to accept extra moving parts living in front of the db - there are other things that are easier to accomplish there too. The cron async process smells like Greenspun's 10th law, extrapolated to transaction managers and message brokers ;-) The advantage of having an asynchronous document state machine handler is a whole lot of decoupling, as the queue is just a side effect of the database. I've been trying to eliminate, or at least vastly reduce the need for middleware in a specific class of applications, but this problem probably cuts that task short. I'm wary of the procedural-controller middleware approach because I'm afraid it encourages you to program as though you have transactions. (Eg: change these 4 documents at the same time and hope for the best.) If you are careful with asynchronous handlers, you can be careful to pass the state through the database _changes feeds in a way that doesn't give devs a false sense of multi-document transactions. This is why I think an asynchronous node.js _changes handler (or browser ajax handlers) or event based programming in any language is such a good fit for CouchDB. If you transmit all of your important events through the database, even real-time apps can scale by using replication. Each doc update is atomic, so there's room to model larger transactions as a series of atomic updates. Chris On Sun, Jan 24, 2010 at 12:06 PM, David Richardson techni...@enquora.com wrote: I've been trying to eliminate, or at least vastly reduce the need for middleware in a specific class of applications, but this problem probably cuts that task short. We have incoming documents that need to be validated against data stored in other docs (rate tables, technical abbreviations, inspection codes). There is a high procedural cost to allowing documents in and validating them after saving (queue processing), but validation functions don't allow bringing in external data (correct me if I'm wrong, please.) Without writing middleware or sideware (an external handler), I don't see a way to do this. What are the arguments against including a native 'lookup' function available to validation functions only. This seems to be a side-effect-free usage case to me. Reads are cheap, correct? If my memory hasn't completely failed, this was a significant problem in Lotus Notes until such a function was added, somewhere around version 3.3 or 4.0. David Richardson The problem with this approach is that validation is run during replication as well, so any multi-doc data dependencies become problematic in ad-hoc clusters. The way to do this without middleware is to have a backend asynchronous process (maybe node.js) that consumes _changes and acts on particular updates. So a user saves the doc with a pending state and the _changes handler sees that and initiates validation. (One trick you can do with this is use _changes heartbeat to generate a cron trigger, so your async process can also do things based on time.) I'd like to see this async/cron functionality bundled with Couch itself, but there's only so much time in the day and I don't think it's worth delaying any releases for. Patches extremely welcome, but I think this one might require more discussion before anyone would be ready to run off and write it. Chris -- Chris Anderson http://jchrisa.net http://couch.io -- Chris Anderson http://jchrisa.net http://couch.io
Re: Document validation involving other documents
On Sun, Jan 24, 2010 at 5:28 PM, David Richardson techni...@enquora.com wrote: I'm wary of the procedural-controller middleware approach because I'm afraid it encourages you to program as though you have transactions. (Eg: change these 4 documents at the same time and hope for the best.) I fear I haven't communicated the usage case adequately. We're talking about validating a single document using a formula which depends on data residing in one of several documents already in the database. Think rates, crew lists, technical abbreviations, prior documents. We aren't changing anything on either the incoming document or the data documents used for validation. Simply determining whether the state of the incoming document is valid. The validating data *could* be compiled into the validation function, but that function would need to be updated every time the validation data changes. Possible, but inelegant. If you are careful with asynchronous handlers, you can be careful to pass the state through the database _changes feeds in a way that doesn't give devs a false sense of multi-document transactions. If we don't validate and communicate failures back to author during the request/response cycle, we may lose the chance to do so for a week or more. It also has financial implications - I've seen documents worth $ 75,000 in billings that could not be processed for an extra week until things were straightened out. It gets tiresome in a hurry. The documents arrive from clients that are usually not connected to the internet. This is the common state of affairs in the mobile world. By definition an asynchronous handler is outside the request/response cycle, so the communication opportunity is lost. Additionally, the validation process is no longer atomic. If that is the eventual path, I'll hitch this wagon to a proper message broker/transaction manager any day. The validation wouldn't be atomic to the original write, but the state machine model can eg: ignore pending docs in views. I do see what you are saying about mobile browsers. One thing would be to use couchdb validations to handle wellformedness and an async process to handle dependencies with other docs. There's also plans for a security object which might do the trick for you. The security object would be a _local document sent to the validation function along with updates. This is exactly the usage scenario WebMachine was designed for, methinks. I've seen sporadic references to webmachine in couchdb, including one last week I thought. Is anything happening there? I don't think we'd be planning to embed arbitrary webmachine controllers. We really want to avoid any kind of multi-document transaction-like api. We also don't want writes to be blocked by other queries. For these reasons, the async update process makes sense. For the mobile need to respond now use case, I think something like node.js would be perfect. It would sit in front of the couch and submit the update. Then it would listen to the changes feed until it sees the document move from pending to valid, and response to the client with OK. Does that make sense? I know it's not CouchDB only, but it does do the trick. Another option would be to write your own mochiweb or couchdb handler and link it in using the config api. That is simple if you can handle running on a custom-configured couchdb. Chris cheers, David p.s. I hope this is of some interest to other people. I hear there's been a recent small uptick in network activity by mobile clients on unreliable, non-homogeneous networks ;-) -- Chris Anderson http://jchrisa.net http://couch.io