Thanks for kicking this off Garren! I have a few questions / thoughts:

1. Will these indexes continue to have associated design documents? I
assume this aspect wouldn't change, but it would be good to be explicit.
Aside from being required for replication, it's useful to be able to query
the /db/_design/<ddoc>/_info endpoint to get the index size etc. Perhaps
this also addresses the issue around how to get the set of available
indexes?

2. Does the ICU sort key have a bounded length? Mostly I'm wondering
whether we can guarantee that the generated keys will fit within the
maximum FDB key length or if there needs to be some thought as to the
failure mode / workaround. As Adam mentioned, it seems fine to store an
encoded key given Mango (currently) always fetches the associated document
/ fields from the primary index to filter on anyway. It might even be
beneficial to have an additional layer of indirection and allow multiple
docs to be associated with each row so that we can maintain compact keys.

3. I don't immediately see how you clear previous values from the index
when a doc is updated, but I could easily be missing something obvious :)

4. Regarding "Index on write" behaviour, is there something in the existing
design (Mango overlaying mrview / lucene) that would prevent this? I can
see some benefit for certain workloads (and headaches for others) but I
don't see that it's necessarily coupled to the Mango design given
background indexing of new/changed indexes needs to be supported anyway.

5. I have a few concerns about the performance given we'll no longer be
able to push down filtering to the data storage tier. It sounds [1] like a
key thing will be the extent to which we can concurrently fetch/filter
non-overlapping key ranges from FDB - likely something that will come up in
relation to views as well. This gets a bit more fun when considering
multi-tenancy, and is perhaps something to stew on whilst getting something
functional together.

Will

[1] https://forums.foundationdb.org/t/secondary-indexing-approaches/792/4


On Thu, 28 Mar 2019 at 17:48, Adam Kocoloski <kocol...@apache.org> wrote:

> Hi Garren, cool, this is a good start.
>
> On the ICU side of things, Russell pointed out that sort keys are a
> one-way trip; i.e., there’s no way to recover the original string from a
> sort key. For the initial pass at Mango I think that’s OK, as we’re reading
> the indexed documents anyway. When we get to views I guess the design will
> need to store the original string in the value so that we can return it as
> the “key” field in the response.
>
> Adam
>
> > On Mar 28, 2019, at 7:01 AM, Garren Smith <gar...@apache.org> wrote:
> >
> > Hi everyone,
> >
> >
> > I want to start a discussion, with the aim of an RFC, around implementing
> > Mango JSON indexes for FoundationDB. Currently Mango indexes are a layer
> > above CouchDB map/reduce indexes, but with FoundationDB we can make them
> > separate indexes in FoundationDB. This gives us the possibility of being
> > able to update the indexes in the same transaction that a document is
> being
> > saved in. Later we can look at adding specific mango like covering
> indexes.
> >
> >
> > Lets dive into the data model. Currently a user defines an index like
> this:
> >
> >
> > {
> >
> >  name: ‘view-name’ - optional will be auto-generated
> >
> >  index: {
> >
> >    fields: [‘fieldA’, ‘fieldB’]
> >
> >  },
> >
> >  partial_filter_selector {} - optional
> >
> > }
> >
> >
> > For query planning we need to be able to access the list of available
> > indexes. So we would have a index_definitions subspace with the following
> > content:
> >
> >
> > (<fieldname1>, …<rest of fields>) = (<index_name>,
> > <partial_filter_selector>)
> >
> >
> > Otherwise we could just store the index definitions as:
> >
> > (index_name) = ((fields), partial_filter_selector).
> >
> >
> > At this stage, I can’t think of a fancy way of storing the index
> > definitions so that when we need to select an index for a query there
> would
> > be a fast way to only fetch a subset of the indexes. I think the best is
> to
> > rather fetch them all like we currently do and process them. However, we
> > can look at caching these index definitions in the application layer, and
> > using FoundationDB watches[0] to notify us when a definition has changed
> so
> > we can update the cached definitions.
> >
> >
> > Then each index definition will have its own dedicated subspace for the
> > actual built index key/values. Keys in this subspace would be the fields
> > defined in the index with the doc id at the end of the tuple, e.g for an
> > index with fields name and age, it would be:
> >
> >
> > (“john”, 40, “doc-id-1) = null
> >
> > (“mary”, 30, “doc-id-2) = null
> >
> >
> > This follows the same key format that document layer[1] does for its
> > indexes. One point to make here is that the doc id is kept in the key
> part
> > so that we can avoid duplicate keys.
> >
> >
> > Then in terms of sorting the keys, current CouchDB uses ICU to sort all
> > secondary indexes. We would need to use ICU to sort the indexes for FDB
> but
> > we would have to do it differently. We will not be able to use ICU
> > collation operations directly, instead, we are going to have to look at
> > using ICU’s sort key[1] to generate a sort key ahead of time. At the same
> > time we need to look at creating binary encoding to capture the way that
> > CouchDB currently sorts object, array and numbers. This would most likely
> > be a sort of key prefix that we add to each key field along with the sort
> > key generated from ICU.
> >
> >
> > In terms of keeping mango indexes up to date, we should be able to update
> > all existing indexes in the same transaction as a document is
> > updated/created, this means we shouldn’t have to have any background
> > process keeping mango indexes updated. Though I imagine we going to have
> to
> > look at a background process that does update and build new indexes on an
> > existing index. We will have to do some decent performance testing around
> > this to determine the best solution, but looking at document layer they
> > seem to recommend updating the indexes in the transaction rather than in
> a
> > background process.
> >
> >
> > In the future, we could look at using the value space to store covering
> > indexed or materialized views. That way we would not need to always read
> > from the by_id when quering with Mango. Which would be a nice performance
> > improvement.
> >
> >
> >
> > Please let me know any thoughts, improvements, suggestions or questions
> > around this.
> >
> >
> >
> > [0] https://apple.github.io/foundationdb/features.html#watches
> >
> > [1] https://github.com/FoundationDB/fdb-document-layer
> >
> > [2] http://userguide.icu-project.org/collation/api#TOC-Sort-Key-Features
>
>

Reply via email to