On Thu, Feb 19, 2009 at 6:39 PM, Ben Browning <[email protected]> wrote: > Overall the model sounds very similar to what I was thinking. I just > have a few comments. > >> In this model documents are saved to a leaf node depending on a hash >> of the docid. This means that lookups are easy, and need only to touch >> the leaf node which holds the doc. Redundancy can be provided by >> maintaining R replicas of every leaf node. > > There are several use-cases where a true hash of the docid won't be the > optimal partitioning key. The simple case is where you want to partition > your data by user and in most non-trivial cases you won't be storing > all of a user's data under one document with the user's id as the docid. > A fairly simple solution would be allowing the developer to specify a > javascript > function somewhere (not sure where this should live...) that takes a docid and > spits out a partition key. Then I could just prefix all my doc ids for > a specific user > with that user's id and write the appropriate partition function. > >> >> View queries, on the other hand, must be handled by every node. The >> requests are proxied down the tree to leaf nodes, which respond >> normally. Each proxy node then runs a merge sort algorithm (which can >> sort in constant space proportional to # of input streams) on the view >> results. This can happen recursively if the tree is deep. > > If the developer has control over partition keys as suggested above, it's > entirely possible to have applications where view queries only need data > from one partition. It would be great if we could do something smart here or > have a way for the developer to indicate to Couch that all the data should > be on only one partition. > > These are just nice-to-have features and the described cluster setup could > still be extremely useful without them.
I think they are both sensible optimizations. Damien's described the JS partition function before on IRC, so I think it fits into the model. As far as restricting view queries to just those docs within a particular id range, it might make sense to partition by giving each user their own database, rather than logic on the docid. In the case where you need data in a single db, but still have some queries that can be partitioned, its still a good optimization. Luckily even in the unoptimized case, if a node has no rows to contribute to the final view result than it should have a low impact on total resources needed to generate the result. Chris -- Chris Anderson http://jchris.mfdz.com
