Re: Shard level querying in CouchDB Proposal

Geoffrey Cox Thu, 08 Feb 2018 09:23:03 -0800

Thanks for the clarification, Mike. I figured that having multiple shard
keys per doc would be a big change, but I was hoping I was wrong ;). I
still think your proposed solution will add a lot of value to CouchDB.
Unfortunately, it isn't going to be part of the silver bullet that makes it
feasible to step away from db-per-user in many cases as in the end you need
shards/partitions that correspond with your user's queries or else it will
require visiting each shard when a query is issued, something that wouldn't
be very scalable with a very large database with lots of nodes.


In my mind, either the database has to "replicate" the data to the right
shards/partitions or you have to do this manually.

On Wed, Feb 7, 2018 at 2:43 AM Mike Rhodes <mrho...@linux.vnet.ibm.com>
wrote:

> Geoff,
>
> Apologies for taking ages to reply.
>
> You can only have a single shard key per document, because the shard key
> directly affects where the document ends up being written to disk and,
> modulo shard replicas, a document is only in one place within the primary
> data shards.
>
> I think what you are thinking about is a major change to global view
> queries (i.e., what couchdb has now) which affects how indexes are sharded.
> Instead of an index shard being essentially the index for a given primary
> data shard, you shard the index based on the keys emitted in the view map
> function. This enables global view queries to be directed to single shard
> based on the keys involved.
>
> This is a massive change, and probably affects a bunch of the assumptions
> built into the way queries are executed by the clustering layer as well as
> the actual mechanics of indexing primary data on one node but writing
> indexes data to another node (as the shard key for a document is no longer
> the same as a shard key for the indexed data, which is implicitly is in the
> current clustering model).
>
> This proposal avoids this kettle of fish, but it is a really nice way to
> speed up queries in a simpler global model. I think it ends up being
> something that works well with my proposal but which is pretty orthogonal
> in terms of implementation.
>
> Mike.
>
> > On 23 Jan 2018, at 20:27, Adam Kocoloski <kocol...@apache.org> wrote:
> >
> > The way I understand the proposal, you could satisfy at most one of
> those requests (probably the *username* one) with a local query. The other
> one would have to be a global query, but the proposal does allow for a mix
> of local and global queries against the same dataset.
> >
> > Adam
> >
> >> On Jan 22, 2018, at 9:27 AM, Geoffrey Cox <redge...@gmail.com> wrote:
> >>
> >> Hey Mike,
> >>
> >> I've been thinking more about your proposal above and when it is
> combined
> >> with the new access-per-db enhancement it should greatly reduce the need
> >> for db-per-user. One thing that I'm left wondering though is whether
> there
> >> is consideration for different shard keys per doc. From what I gather in
> >> your notes above, each doc would only have a single shard key and I
> think
> >> implementing this alone will take significant work. However, if there
> was a
> >> way to have multiple shard keys per doc then you could avoid having
> >> duplicated data.
> >>
> >> For example, assume a database of student work:
> >>
> >>  1. Each doc has a `*username`* that corresponds with the owner of the
> doc
> >>  2. Each doc has a `*classId`* that corresponds with the class for which
> >>  the assignment was submitted
> >>
> >> Ideally, you'd be able to issue a query with a shard key specific to
> the `
> >> *username`* to get a student's work and yet another query with a shard
> key
> >> specific to the `*classId` *to get the work from a teacher's
> >> perspective. Would your proposal allow for something like this?
> >>
> >> If not, I think you'd have to do something like duplicate the data, e.g.
> >> add another doc that has the username of the teacher so that you could
> >> query from the teacher's perspective. This of course could get pretty
> messy
> >> when you consider more complicated scenarios as you could easily end up
> >> with a lot of duplicated data.
> >>
> >> Thanks!
> >>
> >> Geoff
> >>
> >> On Tue, Nov 28, 2017 at 5:35 AM Mike Rhodes <mrho...@linux.vnet.ibm.com
> >
> >> wrote:
> >>
> >>>
> >>>> On 25 Nov 2017, at 15:45, Adam Kocoloski <kocol...@apache.org> wrote:
> >>>>
> >>>> Yes indeed Jan :) Thanks Mike for writing this up! A couple of
> comments
> >>> on the proposal:
> >>>>
> >>>>>    • For databases where this is enabled, every document needs a
> >>> shard key.
> >>>>
> >>>> What would happen if this constraint were relaxed, and documents
> without
> >>> a “:” in their ID simply used the full ID as the shard key as is done
> now?
> >>>
> >>> I think that practically it's not that awful. Documents without shard
> keys
> >>> end up spread reasonably, albeit uncontrollably, across shards.
> >>>
> >>> But I think from a usability perspective, forcing this to be all or
> >>> nothing for a database makes sense. It makes sure that every document
> in
> >>> the database behaves the same way rather than having a bunch of stuff
> that
> >>> behaves one way and a bunch of stuff that behaves a different way
> (i.e.,
> >>> you can find some documents via shard local queries, whereas others are
> >>> only visible at a global level).
> >>>
> >>> I think that if people want documents to behave that differently,
> >>> enforcing different databases is helpful. It reinforces the point that
> >>> these databases work well for use-cases where partitioning data using
> the
> >>> shard key makes sense, which is a different method of data modelling
> than
> >>> having one huge undifferentiated pool. Perhaps there are heretofore
> >>> unthought of optimisations that only make sense if we can make this
> >>> assumption too :)
> >>>
> >>>>
> >>>>>    • Query results are restricted to documents with the shard key
> >>> specified. Which makes things harder but leaves the door open for
> future
> >>> things like shard-splitting without changing result sets. And it seems
> like
> >>> what one would expect!
> >>>>
> >>>> I agree this is important. It took me a minute to remember the point
> >>> here, which is that a query specifying a shard key needs to filter out
> >>> results from different shard keys that happen to be colocated on the
> same
> >>> shard.
> >>>>
> >>>> Does the current query functionality still work as it did before in a
> >>> database without shard keys? That is, can I still issue a query without
> >>> specifying a shard key and have it collate a response from the full
> >>> dataset? I think this is worth addressing explicitly. My assumption is
> that
> >>> it does, although I’m worried that there may be a problematic
> interaction
> >>> if one tried to use the same physical index to satisfy both a “global”
> >>> query and a query specifying a shard key.
> >>>
> >>> I think this is an interesting question.
> >>>
> >>> To start with, I guess the basic thing is that to efficiently use an
> index
> >>> you'd imagine that you'd prefix the index's columns with the shard key
> --
> >>> at least that's the thing I've been thinking, which likely means
> cleverer
> >>> options are available :)
> >>>
> >>> My first thought is that the naive approach to filtering documents not
> >>> matching a shard key is just that -- a node hosting a replica of a
> shard
> >>> does a query on an index as normal and then there's some extra code
> that
> >>> filters based on ID. Not actually super-awful -- we don't have to
> actually
> >>> read the document itself for example -- but for any use-case where
> there
> >>> are many shard keys associated with a given shard it feels like one
> can do
> >>> better. But as long as the node querying the index is doing it, it
> feels
> >>> pretty fast.
> >>>
> >>> I would wonder whether some more generally useful work on Mango could
> help
> >>> reduce the amount of special case code going on:
> >>>
> >>> - Push index selection down to each shard.
> >>> - Allow Mango to use multiple indexes to satisfy a query (even if this
> is
> >>> simply for AND relationships).
> >>>
> >>> Then for any database with the shard key bit set true, the shards also
> >>> create a JSON index based on the shard key, and we can append an `AND
> >>> shardkey=foo` to the users' Mango selector. As our shard keys are in
> the
> >>> doc ID, I don't think this is any faster at all. It would be if the
> shard
> >>> key was more complicated, say a field in the doc, so we didn't have it
> to
> >>> hand all the time. But it would certainly make the alteration for the
> shard
> >>> local path much more contained and have very wide utility beyond this
> case.
> >>>
> >>> For views, I'm less sure there's anything smart you can do that doesn't
> >>> add tonnes of overhead -- like making two indexes per view, one that's
> >>> prefixed with the shard key and one which is not. This approach has all
> >>> sorts of nasty interactions with things like reverse=true I imagine,
> >>> however.
> >>>
> >>> Mike.
> >>>
> >
>
>

Re: Shard level querying in CouchDB Proposal

Reply via email to