Re: [DISCUSS] CEP-7 Storage Attached Index

Caleb Rackliffe Tue, 07 Sep 2021 10:40:53 -0700

So this thread stalled almost a year ago. (Wow, time flies when you're
trying to release 4.0.) My synthesis of the conversation to this point is
that while there are some open questions about testing
methodology/"definition of done" and our choice of particular on-disk data
structures, neither of these should be a serious obstacle to moving forward
w/ a vote. Having said that, is there anything left around the CEP that we
feel should prevent it from moving to a vote?


In terms of how we would proceed from the point a vote passes, it seems
like there have been enough concerns around the proposed/necessary breaking
changes to the 2i API, that we will start development by introducing
components as incrementally as possible into a long-running feature branch
off trunk. (This work would likely start w/ *CASSANDRA-16092*
<https://issues.apache.org/jira/browse/CASSANDRA-16092>, which we could
resolve as a sub-task of the SAI epic without interfering with other trunk
development likely destined for a 4.x minor, etc.)

On Thu, Sep 24, 2020 at 2:47 AM Jasonstack Zhao Yang <
jasonstack.z...@gmail.com> wrote:

> >> Question is: is this planned as a next step?
> >> If yes, how are we going to mark SAI as experimental until it gets
> >> row offsets? Also, it is likely that index format is going to change
> when
> >> row offsets are added, so my concern is that we may have to support two
> >> versions of a format for a smooth migration.
>
> The goal is to support row-level index when merging SAI, I will update the
> CEP about it.
>
> >> I think switching to row
> >> offsets also has a huge impact on interaction with SPRC and has some
> >> potential for optimisations.
>
> Can you share more details on the optimizations?
>
>
>
> On Thu, 24 Sep 2020 at 15:20, Oleksandr Petrov <oleksandr.pet...@gmail.com
> >
> wrote:
>
> > > But for improving overall index read performance, I think improving
> base
> > table read perf  (because SAI/SASI executes LOTS of
> > SinglePartitionReadCommand after searching on-disk index) is more
> effective
> > than switching from Trie to Prefix BTree.
> >
> > I haven't suggested switching to Prefix B-Tree or any other structure,
> the
> > question was about rationale and motivation of picking one over the
> other,
> > which I am curious about for personal reasons/interests that lie outside
> of
> > Cassandra. Having this listed in CEP could have been helpful for future
> > guidance. It's ok if this question is outside of the CEP scope.
> >
> > I also agree that there are many areas that require improvement around
> the
> > read/write path and 2i, many of which (even outside of base table format
> or
> > read perf) can yield positive performance results.
> >
> > > FWIW, I personally look forward to receiving that contribution when the
> > time is right.
> >
> > I am very excited for this contribution, too, and it looks like very
> solid
> > work.
> >
> > I have one more question, about "Upon resolving partition keys, rows are
> > loaded using Cassandra’s internal partition read command across SSTables
> > and are post filtered". One of the criticisms of SASI and reasons for
> > marking it as experimental was CASSANDRA-11990. I think switching to row
> > offsets also has a huge impact on interaction with SPRC and has some
> > potential for optimisations. Question is: is this planned as a next step?
> > If yes, how are we going to mark SAI as experimental until it gets
> > row offsets? Also, it is likely that index format is going to change when
> > row offsets are added, so my concern is that we may have to support two
> > versions of a format for a smooth migration.
> >
> >
> >
> > On Thu, Sep 24, 2020 at 6:53 AM Jasonstack Zhao Yang <
> > jasonstack.z...@gmail.com> wrote:
> >
> > > >> I think CEP should be more upfront with "eventually replace
> > > >>  it" bit, since it raises the question about what the people who are
> > > using
> > > >> other index implementations can expect.
> > >
> > > Will update the CEP to emphasize: SAI will replace other indexes.
> > >
> > > >> Unfortunately, I do not have an
> > > >> implementation sitting around for a direct comparison, but I can
> > imagine
> > > >> situations when B-Trees may perform better because of simpler
> > > construction.
> > > >> Maybe we should even consider prototyping a prefix B-Tree to have a
> > more
> > > >> fair comparison.
> > >
> > > As long as prefix BTree supports range/prefix aggregation (which is
> used
> > to
> > > speed up
> > > range/prefix query when matching entire subtree), we can plug it in and
> > > compare. It won't
> > > affect the CEP design which focuses on sharing data across indexes and
> > > posting aggregation.
> > >
> > > But for improving overall index read performance, I think improving
> base
> > > table read perf
> > >  (because SAI/SASI executes LOTS of SinglePartitionReadCommand after
> > > searching on-disk index)
> > > is more effective than switching from Trie to Prefix BTree.
> > >
> > >
> > >
> > > On Thu, 24 Sep 2020 at 05:33, Benedict Elliott Smith <
> > bened...@apache.org>
> > > wrote:
> > >
> > > > FWIW, I personally look forward to receiving that contribution when
> the
> > > > time is right.
> > > >
> > > > On 23/09/2020, 18:45, "Josh McKenzie" <jmcken...@apache.org> wrote:
> > > >
> > > >     talking about that would involve some bits of information
> DataStax
> > > > might
> > > >     not be ready to share?
> > > >
> > > >     At the risk of derailing, I've been poking and prodding this week
> > at
> > > we
> > > >     contributors at DS getting our act together w/a draft CEP for
> > > donating
> > > > the
> > > >     trie-based indices to the ASF project.
> > > >
> > > >     More to come; the intention is certainly to contribute that code.
> > The
> > > > lack
> > > >     of a destination to merge it into (i.e. no 5.0-dev branch) is
> > > removing
> > > >     significant urgency from the process as well (not to open a 3rd
> > > > Pandora's
> > > >     box), but there's certainly an interrelatedness to the
> > conversations
> > > > going
> > > >     on.
> > > >
> > > >     ---
> > > >     Josh McKenzie
> > > >
> > > >
> > > >     Sent via Superhuman <https://sprh.mn/?vip=jmcken...@apache.org>
> > > >
> > > >
> > > >     On Wed, Sep 23, 2020 at 12:48 PM, Caleb Rackliffe <
> > > > calebrackli...@gmail.com>
> > > >     wrote:
> > > >
> > > >     > As long as we can construct the on-disk indexes
> > > efficiently/directly
> > > > from
> > > >     > a Memtable-attached index on flush, there's room to try other
> > data
> > > >     > structures. Most of the innovation in SAI is around the layout
> of
> > > > postings
> > > >     > (something we can expand on if people are interested) and
> having
> > a
> > > >     > natively row-oriented design that scales w/ multiple indexed
> > > columns
> > > > on
> > > >     > single SSTables. There are some broader implications of using
> the
> > > > trie that
> > > >     > reach outside SAI itself, but talking about that would involve
> > some
> > > > bits of
> > > >     > information DataStax might not be ready to share?
> > > >     >
> > > >     > On Wed, Sep 23, 2020 at 11:00 AM Jeremiah D Jordan <
> > > jeremiah.jordan@
> > > >     > gmail.com> wrote:
> > > >     >
> > > >     > Short question: looking forward, how are we going to maintain
> > three
> > > > 2i
> > > >     > implementations: SASI, SAI, and 2i?
> > > >     >
> > > >     > I think one of the goals stated in the CEP is for SAI to have
> > > parity
> > > > with
> > > >     > 2i such that it could eventually replace it.
> > > >     >
> > > >     > On Sep 23, 2020, at 10:34 AM, Oleksandr Petrov <
> > > >     >
> > > >     > oleksandr.pet...@gmail.com> wrote:
> > > >     >
> > > >     > Short question: looking forward, how are we going to maintain
> > three
> > > > 2i
> > > >     > implementations: SASI, SAI, and 2i?
> > > >     >
> > > >     > Another thing I think this CEP is missing is rationale and
> > > motivation
> > > >     > about why trie-based indexes were chosen over, say, B-Tree. We
> > did
> > > > have a
> > > >     > short discussion about this on Slack, but both arguments that
> > I've
> > > > heard
> > > >     > (space-saving and keeping a small subset of nodes in memory)
> work
> > > > only
> > > >     >
> > > >     > for
> > > >     >
> > > >     > the most primitive implementation of a B-Tree. Fully-occupied
> > > prefix
> > > >     >
> > > >     > B-Tree
> > > >     >
> > > >     > can have similar properties. There's been a lot of research on
> > > > B-Trees
> > > >     >
> > > >     > and
> > > >     >
> > > >     > optimisations in those. Unfortunately, I do not have an
> > > > implementation
> > > >     > sitting around for a direct comparison, but I can imagine
> > > situations
> > > > when
> > > >     > B-Trees may perform better because of simpler
> > > >     >
> > > >     > construction.
> > > >     >
> > > >     > Maybe we should even consider prototyping a prefix B-Tree to
> > have a
> > > > more
> > > >     > fair comparison.
> > > >     >
> > > >     > Thank you,
> > > >     > -- Alex
> > > >     >
> > > >     > On Thu, Sep 10, 2020 at 9:12 AM Jasonstack Zhao Yang <
> > > > jasonstack.zhao@
> > > >     > gmail.com> wrote:
> > > >     >
> > > >     > Thank you Patrick for hosting Cassandra Contributor Meeting for
> > > CEP-7
> > > >     >
> > > >     > SAI.
> > > >     >
> > > >     > The recorded video is available here:
> > > >     >
> > > >     > https://cwiki.apache.org/confluence/display/CASSANDRA/
> > > >     > 2020-09-01+Apache+Cassandra+Contributor+Meeting
> > > >     >
> > > >     > On Tue, 1 Sep 2020 at 14:34, Jasonstack Zhao Yang <
> > > > jasonstack.zhao@gmail.
> > > >     > com>
> > > >     > wrote:
> > > >     >
> > > >     > Thank you, Charles and Patrick
> > > >     >
> > > >     > On Tue, 1 Sep 2020 at 04:56, Charles Cao <caohair...@gmail.com
> >
> > > > wrote:
> > > >     >
> > > >     > Thank you, Patrick!
> > > >     >
> > > >     > On Mon, Aug 31, 2020 at 12:59 PM Patrick McFadin <
> > > pmcfa...@gmail.com
> > > > >
> > > >     > wrote:
> > > >     >
> > > >     > I just moved it to 8AM for this meeting to better accommodate
> > APAC.
> > > >     >
> > > >     > Please
> > > >     >
> > > >     > see the update here:
> > > >     >
> > > >     > https://cwiki.apache.org/confluence/display/CASSANDRA/
> > > >     > 2020-08-01+Apache+Cassandra+Contributor+Meeting
> > > >     >
> > > >     > Patrick
> > > >     >
> > > >     > On Mon, Aug 31, 2020 at 10:04 AM Charles Cao <
> > caohair...@gmail.com
> > > >
> > > >     >
> > > >     > wrote:
> > > >     >
> > > >     > Patrick,
> > > >     >
> > > >     > 11AM PST is a bad time for the people in the APAC timezone. Can
> > we
> > > > move it
> > > >     > to 7 or 8AM PST in the morning to accommodate their needs ?
> > > >     >
> > > >     > ~Charles
> > > >     >
> > > >     > On Fri, Aug 28, 2020 at 4:37 PM Patrick McFadin <
> > > pmcfa...@gmail.com
> > > >     >
> > > >     > wrote:
> > > >     >
> > > >     > Meeting scheduled.
> > > >     >
> > > >     > https://cwiki.apache.org/confluence/display/CASSANDRA/
> > > >     > 2020-08-01+Apache+Cassandra+Contributor+Meeting
> > > >     >
> > > >     > Tuesday September 1st, 11AM PST. I added a basic bullet for the
> > > >     >
> > > >     > agenda
> > > >     >
> > > >     > but
> > > >     >
> > > >     > if there is more, edit away.
> > > >     >
> > > >     > Patrick
> > > >     >
> > > >     > On Thu, Aug 27, 2020 at 11:31 AM Jasonstack Zhao Yang <
> > > > jasonstack.zhao@
> > > >     > gmail.com> wrote:
> > > >     >
> > > >     > +1
> > > >     >
> > > >     > On Thu, 27 Aug 2020 at 04:52, Ekaterina Dimitrova <
> > > >     >
> > > >     > e.dimitr...@gmail.com>
> > > >     >
> > > >     > wrote:
> > > >     >
> > > >     > +1
> > > >     >
> > > >     > On Wed, 26 Aug 2020 at 16:48, Caleb Rackliffe <
> > > >     >
> > > >     > calebrackli...@gmail.com>
> > > >     >
> > > >     > wrote:
> > > >     >
> > > >     > +1
> > > >     >
> > > >     > On Wed, Aug 26, 2020, 3:45 PM Patrick McFadin <
> > > >     >
> > > >     > pmcfa...@gmail.com>
> > > >     >
> > > >     > wrote:
> > > >     >
> > > >     > This is related to the discussion Jordan and I had about
> > > >     >
> > > >     > the
> > > >     >
> > > >     > contributor
> > > >     >
> > > >     > Zoom call. Instead of open mic for any issue, call it
> > > >     >
> > > >     > based
> > > >     >
> > > >     > on a
> > > >     >
> > > >     > discussion
> > > >     >
> > > >     > thread or threads for higher bandwidth discussion.
> > > >     >
> > > >     > I would be happy to schedule on for next week to
> > > >     >
> > > >     > specifically
> > > >     >
> > > >     > discuss
> > > >     >
> > > >     > CEP-7. I can attach the recorded call to the CEP after.
> > > >     >
> > > >     > +1 or -1?
> > > >     >
> > > >     > Patrick
> > > >     >
> > > >     > On Tue, Aug 25, 2020 at 7:03 AM Joshua McKenzie <
> > > >     >
> > > >     > jmcken...@apache.org>
> > > >     >
> > > >     > wrote:
> > > >     >
> > > >     > Does community plan to open another discussion or CEP
> > > >     >
> > > >     > on
> > > >     >
> > > >     > modularization?
> > > >     >
> > > >     > We probably should have a discussion on the ML or
> > > >     >
> > > >     > monthly
> > > >     >
> > > >     > contrib
> > > >     >
> > > >     > call
> > > >     >
> > > >     > about it first to see how aligned the interested
> > > >     >
> > > >     > contributors
> > > >     >
> > > >     > are.
> > > >     >
> > > >     > Could
> > > >     >
> > > >     > do
> > > >     >
> > > >     > that through CEP as well but CEP's (at least thus far
> > > >     >
> > > >     > sans k8s
> > > >     >
> > > >     > operator)
> > > >     >
> > > >     > tend to start with a strong, deeply thought out point of
> > > >     >
> > > >     > view
> > > >     >
> > > >     > being
> > > >     >
> > > >     > expressed.
> > > >     >
> > > >     > On Tue, Aug 25, 2020 at 3:26 AM Jasonstack Zhao Yang <
> > > >     >
> > > >     > jasonstack.z...@gmail.com> wrote:
> > > >     >
> > > >     > SASI's performance, specifically the search in the
> > > >     >
> > > >     > B+
> > > >     >
> > > >     > tree
> > > >     >
> > > >     > component,
> > > >     >
> > > >     > depends a lot on the component file's header being
> > > >     >
> > > >     > available
> > > >     >
> > > >     > in
> > > >     >
> > > >     > the
> > > >     >
> > > >     > pagecache. SASI benefits from (needs) nodes with
> > > >     >
> > > >     > lots of
> > > >     >
> > > >     > RAM.
> > > >     >
> > > >     > Is
> > > >     >
> > > >     > SAI
> > > >     >
> > > >     > bound
> > > >     >
> > > >     > to this same or similar limitation?
> > > >     >
> > > >     > SAI also benefits from larger memory because SAI puts
> > > >     >
> > > >     > block
> > > >     >
> > > >     > info
> > > >     >
> > > >     > on
> > > >     >
> > > >     > heap
> > > >     >
> > > >     > for searching on-disk components and having
> > > >     >
> > > >     > cross-index
> > > >     >
> > > >     > files on
> > > >     >
> > > >     > page
> > > >     >
> > > >     > cache
> > > >     >
> > > >     > improves read performance of different indexes on the
> > > >     >
> > > >     > same
> > > >     >
> > > >     > table.
> > > >     >
> > > >     > Flushing of SASI can be CPU+IO intensive, to the
> > > >     >
> > > >     > point of
> > > >     >
> > > >     > saturation,
> > > >     >
> > > >     > pauses, and crashes on the node. SSDs are a must,
> > > >     >
> > > >     > along
> > > >     >
> > > >     > with
> > > >     >
> > > >     > a
> > > >     >
> > > >     > bit
> > > >     >
> > > >     > of
> > > >     >
> > > >     > tuning, just to avoid bringing down your cluster.
> > > >     >
> > > >     > Beyond
> > > >     >
> > > >     > reducing
> > > >     >
> > > >     > space
> > > >     >
> > > >     > requirements, does SAI improve on these things?
> > > >     >
> > > >     > Like
> > > >     >
> > > >     > SASI how
> > > >     >
> > > >     > does
> > > >     >
> > > >     > SAI,
> > > >     >
> > > >     > in
> > > >     >
> > > >     > its own way, change/narrow the recommendations on
> > > >     >
> > > >     > node
> > > >     >
> > > >     > hardware
> > > >     >
> > > >     > specs?
> > > >     >
> > > >     > SAI won't crash the node during compaction and
> > > >     >
> > > >     > requires
> > > >     >
> > > >     > less
> > > >     >
> > > >     > CPU/IO.
> > > >     >
> > > >     > * SAI defines global memory limit for compaction
> > > >     >
> > > >     > instead of
> > > >     >
> > > >     > per-index
> > > >     >
> > > >     > memory limit used by SASI.
> > > >     >
> > > >     > For example, compactions are running on 10 tables
> > > >     >
> > > >     > and
> > > >     >
> > > >     > each
> > > >     >
> > > >     > has
> > > >     >
> > > >     > 10
> > > >     >
> > > >     > indexes. SAI will cap the
> > > >     >
> > > >     > memory usage with global limit while SASI may use up
> > > >     >
> > > >     > to
> > > >     >
> > > >     > 100 *
> > > >     >
> > > >     > per-index
> > > >     >
> > > >     > limit.
> > > >     >
> > > >     > * After flushing in-memory segments to disk, SAI won't
> > > >     >
> > > >     > merge
> > > >     >
> > > >     > on-disk
> > > >     >
> > > >     > segments while SASI
> > > >     >
> > > >     > attempts to merge them at the end.
> > > >     >
> > > >     > There are pros and cons of not merging segments:
> > > >     >
> > > >     > ** Pros: compaction runs faster and requires fewer
> > > >     >
> > > >     > resources.
> > > >     >
> > > >     > ** Cons: small segments reduce compression ratio.
> > > >     >
> > > >     > * SAI on-disk format with row ids compresses better.
> > > >     >
> > > >     > I understand the desire in keeping out of scope
> > > >     >
> > > >     > the
> > > >     >
> > > >     > longer
> > > >     >
> > > >     > term
> > > >     >
> > > >     > deprecation
> > > >     >
> > > >     > and migration plan, but… if SASI provides
> > > >     >
> > > >     > functionality
> > > >     >
> > > >     > that
> > > >     >
> > > >     > SAI
> > > >     >
> > > >     > doesn't,
> > > >     >
> > > >     > like tokenisation and DelimiterAnalyzer, yet
> > > >     >
> > > >     > introduces a
> > > >     >
> > > >     > body
> > > >     >
> > > >     > of
> > > >     >
> > > >     > code
> > > >     >
> > > >     > ~somewhat similar, shouldn't we be roughly
> > > >     >
> > > >     > sketching out
> > > >     >
> > > >     > how
> > > >     >
> > > >     > to
> > > >     >
> > > >     > reduce
> > > >     >
> > > >     > the
> > > >     >
> > > >     > maintenance surface area?
> > > >     >
> > > >     > Agreed that we should reduce maintenance area if
> > > >     >
> > > >     > possible,
> > > >     >
> > > >     > but
> > > >     >
> > > >     > only
> > > >     >
> > > >     > very
> > > >     >
> > > >     > limited
> > > >     >
> > > >     > code base (eg. RangeIterator, QueryPlan) can be
> > > >     >
> > > >     > shared.
> > > >     >
> > > >     > The
> > > >     >
> > > >     > rest
> > > >     >
> > > >     > of
> > > >     >
> > > >     > the
> > > >     >
> > > >     > code base
> > > >     >
> > > >     > is quite different because of on-disk format and
> > > >     >
> > > >     > cross-index
> > > >     >
> > > >     > files.
> > > >     >
> > > >     > The goal of this CEP is to get community buy-in on
> > > >     >
> > > >     > SAI's
> > > >     >
> > > >     > design.
> > > >     >
> > > >     > Tokenization,
> > > >     >
> > > >     > DelimiterAnalyzer should be straightforward to
> > > >     >
> > > >     > implement on
> > > >     >
> > > >     > top
> > > >     >
> > > >     > of
> > > >     >
> > > >     > SAI.
> > > >     >
> > > >     > Can we list what configurations of SASI will
> > > >     >
> > > >     > become
> > > >     >
> > > >     > deprecated
> > > >     >
> > > >     > once
> > > >     >
> > > >     > SAI
> > > >     >
> > > >     > becomes non-experimental?
> > > >     >
> > > >     > Except for "Like", "Tokenisation",
> > > >     >
> > > >     > "DelimiterAnalyzer",
> > > >     >
> > > >     > the
> > > >     >
> > > >     > rest
> > > >     >
> > > >     > of
> > > >     >
> > > >     > SASI
> > > >     >
> > > >     > can
> > > >     >
> > > >     > be replaced by SAI.
> > > >     >
> > > >     > Given a few bugs are open against 2i and SASI, can
> > > >     >
> > > >     > we
> > > >     >
> > > >     > provide
> > > >     >
> > > >     > some
> > > >     >
> > > >     > overview, or rough indication, of how many of them
> > > >     >
> > > >     > we
> > > >     >
> > > >     > could
> > > >     >
> > > >     > "triage
> > > >     >
> > > >     > away"?
> > > >     >
> > > >     > I believe most of the known bugs in 2i/SASI either
> > > >     >
> > > >     > have
> > > >     >
> > > >     > been
> > > >     >
> > > >     > addressed
> > > >     >
> > > >     > in
> > > >     >
> > > >     > SAI or
> > > >     >
> > > >     > don't apply to SAI.
> > > >     >
> > > >     > And, is it time for the project to start
> > > >     >
> > > >     > introducing new
> > > >     >
> > > >     > SPI
> > > >     >
> > > >     > implementations as separate sub-modules and jar
> > > >     >
> > > >     > files
> > > >     >
> > > >     > that
> > > >     >
> > > >     > are
> > > >     >
> > > >     > only
> > > >     >
> > > >     > loaded
> > > >     >
> > > >     > at runtime based on configuration settings? (sorry
> > > >     >
> > > >     > for
> > > >     >
> > > >     > the
> > > >     >
> > > >     > conflation
> > > >     >
> > > >     > on
> > > >     >
> > > >     > this one, but maybe it's the right time to raise
> > > >     >
> > > >     > it
> > > >     >
> > > >     > :shrug:)
> > > >     >
> > > >     > Agreed that modularization is the way to go and will
> > > >     >
> > > >     > speed up
> > > >     >
> > > >     > module
> > > >     >
> > > >     > development speed.
> > > >     >
> > > >     > Does community plan to open another discussion or CEP
> > > >     >
> > > >     > on
> > > >     >
> > > >     > modularization?
> > > >     >
> > > >     > On Mon, 24 Aug 2020 at 16:43, Mick Semb Wever <
> > > >     >
> > > >     > m...@apache.org>
> > > >     >
> > > >     > wrote:
> > > >     >
> > > >     > Adding to Duy's questions…
> > > >     >
> > > >     > * Hardware specs
> > > >     >
> > > >     > SASI's performance, specifically the search in the
> > > >     >
> > > >     > B+
> > > >     >
> > > >     > tree
> > > >     >
> > > >     > component,
> > > >     >
> > > >     > depends a lot on the component file's header being
> > > >     >
> > > >     > available in
> > > >     >
> > > >     > the
> > > >     >
> > > >     > pagecache. SASI benefits from (needs) nodes with
> > > >     >
> > > >     > lots
> > > >     >
> > > >     > of
> > > >     >
> > > >     > RAM.
> > > >     >
> > > >     > Is
> > > >     >
> > > >     > SAI
> > > >     >
> > > >     > bound
> > > >     >
> > > >     > to this same or similar limitation?
> > > >     >
> > > >     > Flushing of SASI can be CPU+IO intensive, to the
> > > >     >
> > > >     > point of
> > > >     >
> > > >     > saturation,
> > > >     >
> > > >     > pauses, and crashes on the node. SSDs are a must,
> > > >     >
> > > >     > along
> > > >     >
> > > >     > with a
> > > >     >
> > > >     > bit
> > > >     >
> > > >     > of
> > > >     >
> > > >     > tuning, just to avoid bringing down your cluster.
> > > >     >
> > > >     > Beyond
> > > >     >
> > > >     > reducing
> > > >     >
> > > >     > space
> > > >     >
> > > >     > requirements, does SAI improve on these things? Like
> > > >     >
> > > >     > SASI
> > > >     >
> > > >     > how
> > > >     >
> > > >     > does
> > > >     >
> > > >     > SAI,
> > > >     >
> > > >     > in
> > > >     >
> > > >     > its own way, change/narrow the recommendations on
> > > >     >
> > > >     > node
> > > >     >
> > > >     > hardware
> > > >     >
> > > >     > specs?
> > > >     >
> > > >     > * Code Maintenance
> > > >     >
> > > >     > I understand the desire in keeping out of scope the
> > > >     >
> > > >     > longer
> > > >     >
> > > >     > term
> > > >     >
> > > >     > deprecation
> > > >     >
> > > >     > and migration plan, but… if SASI provides
> > > >     >
> > > >     > functionality
> > > >     >
> > > >     > that
> > > >     >
> > > >     > SAI
> > > >     >
> > > >     > doesn't,
> > > >     >
> > > >     > like tokenisation and DelimiterAnalyzer, yet
> > > >     >
> > > >     > introduces a
> > > >     >
> > > >     > body
> > > >     >
> > > >     > of
> > > >     >
> > > >     > code
> > > >     >
> > > >     > ~somewhat similar, shouldn't we be roughly sketching
> > > >     >
> > > >     > out
> > > >     >
> > > >     > how to
> > > >     >
> > > >     > reduce
> > > >     >
> > > >     > the
> > > >     >
> > > >     > maintenance surface area?
> > > >     >
> > > >     > Can we list what configurations of SASI will become
> > > >     >
> > > >     > deprecated
> > > >     >
> > > >     > once
> > > >     >
> > > >     > SAI
> > > >     >
> > > >     > becomes non-experimental?
> > > >     >
> > > >     > Given a few bugs are open against 2i and SASI, can
> > > >     >
> > > >     > we
> > > >     >
> > > >     > provide
> > > >     >
> > > >     > some
> > > >     >
> > > >     > overview, or rough indication, of how many of them
> > > >     >
> > > >     > we
> > > >     >
> > > >     > could
> > > >     >
> > > >     > "triage
> > > >     >
> > > >     > away"?
> > > >     >
> > > >     > And, is it time for the project to start introducing
> > > >     >
> > > >     > new
> > > >     >
> > > >     > SPI
> > > >     >
> > > >     > implementations as separate sub-modules and jar
> > > >     >
> > > >     > files
> > > >     >
> > > >     > that
> > > >     >
> > > >     > are
> > > >     >
> > > >     > only
> > > >     >
> > > >     > loaded
> > > >     >
> > > >     > at runtime based on configuration settings? (sorry
> > > >     >
> > > >     > for the
> > > >     >
> > > >     > conflation
> > > >     >
> > > >     > on
> > > >     >
> > > >     > this one, but maybe it's the right time to raise it
> > > >     >
> > > >     > :shrug:)
> > > >     >
> > > >     > regards,
> > > >     >
> > > >     > Mick
> > > >     >
> > > >     > On Tue, 18 Aug 2020 at 13:05, DuyHai Doan <
> > > >     >
> > > >     > doanduy...@gmail.com>
> > > >     >
> > > >     > wrote:
> > > >     >
> > > >     > Thank you Zhao Yang for starting this topic
> > > >     >
> > > >     > After reading the short design doc, I have a few
> > > >     >
> > > >     > questions
> > > >     >
> > > >     > 1) SASI was pretty inefficient indexing wide
> > > >     >
> > > >     > partitions
> > > >     >
> > > >     > because
> > > >     >
> > > >     > the
> > > >     >
> > > >     > index
> > > >     >
> > > >     > structure only retains the partition token, not
> > > >     >
> > > >     > the
> > > >     >
> > > >     > clustering
> > > >     >
> > > >     > colums.
> > > >     >
> > > >     > As
> > > >     >
> > > >     > per design doc SAI has row id mapping to partition
> > > >     >
> > > >     > offset,
> > > >     >
> > > >     > can
> > > >     >
> > > >     > we
> > > >     >
> > > >     > hope
> > > >     >
> > > >     > that
> > > >     >
> > > >     > indexing wide partition will be more efficient
> > > >     >
> > > >     > with
> > > >     >
> > > >     > SAI
> > > >     >
> > > >     > ? One
> > > >     >
> > > >     > detail
> > > >     >
> > > >     > that
> > > >     >
> > > >     > worries me is that in the beggining of the design
> > > >     >
> > > >     > doc,
> > > >     >
> > > >     > it is
> > > >     >
> > > >     > said
> > > >     >
> > > >     > that
> > > >     >
> > > >     > the
> > > >     >
> > > >     > matching rows are post filtered while scanning the
> > > >     >
> > > >     > partition.
> > > >     >
> > > >     > Can
> > > >     >
> > > >     > you
> > > >     >
> > > >     > confirm or infirm that SAI is efficient with wide
> > > >     >
> > > >     > partitions
> > > >     >
> > > >     > and
> > > >     >
> > > >     > provides
> > > >     >
> > > >     > the partition offsets to the matching rows ?
> > > >     >
> > > >     > 2) About space efficiency, one of the biggest
> > > >     >
> > > >     > drawback of
> > > >     >
> > > >     > SASI
> > > >     >
> > > >     > was
> > > >     >
> > > >     > the
> > > >     >
> > > >     > huge
> > > >     >
> > > >     > space required for index structure when using
> > > >     >
> > > >     > CONTAINS
> > > >     >
> > > >     > logic
> > > >     >
> > > >     > because
> > > >     >
> > > >     > of
> > > >     >
> > > >     > the
> > > >     >
> > > >     > decomposition of text columns into n-grams. Will
> > > >     >
> > > >     > SAI
> > > >     >
> > > >     > suffer
> > > >     >
> > > >     > from
> > > >     >
> > > >     > the
> > > >     >
> > > >     > same
> > > >     >
> > > >     > issue in future iterations ? I'm anticipating a
> > > >     >
> > > >     > bit
> > > >     >
> > > >     > 3) If I'm querying using SAI and providing
> > > >     >
> > > >     > complete
> > > >     >
> > > >     > partition
> > > >     >
> > > >     > key,
> > > >     >
> > > >     > will
> > > >     >
> > > >     > it
> > > >     >
> > > >     > be more efficient than querying without partition
> > > >     >
> > > >     > key. In
> > > >     >
> > > >     > other
> > > >     >
> > > >     > words,
> > > >     >
> > > >     > does
> > > >     >
> > > >     > SAI provide any optimisation when partition key is
> > > >     >
> > > >     > specified
> > > >     >
> > > >     > ?
> > > >     >
> > > >     > Regards
> > > >     >
> > > >     > Duy Hai DOAN
> > > >     >
> > > >     > Le mar. 18 août 2020 à 11:39, Mick Semb Wever <
> > > >     >
> > > >     > m...@apache.org>
> > > >     >
> > > >     > a
> > > >     >
> > > >     > écrit :
> > > >     >
> > > >     > We are looking forward to the community's
> > > >     >
> > > >     > feedback
> > > >     >
> > > >     > and
> > > >     >
> > > >     > suggestions.
> > > >     >
> > > >     > What comes immediately to mind is testing
> > > >     >
> > > >     > requirements. It
> > > >     >
> > > >     > has
> > > >     >
> > > >     > been
> > > >     >
> > > >     > mentioned already that the project's testability
> > > >     >
> > > >     > and QA
> > > >     >
> > > >     > guidelines
> > > >     >
> > > >     > are
> > > >     >
> > > >     > inadequate to successfully introduce new
> > > >     >
> > > >     > features
> > > >     >
> > > >     > and
> > > >     >
> > > >     > refactorings
> > > >     >
> > > >     > to
> > > >     >
> > > >     > the
> > > >     >
> > > >     > codebase. During the 4.0 beta phase this was
> > > >     >
> > > >     > intended
> > > >     >
> > > >     > to be
> > > >     >
> > > >     > addressed,
> > > >     >
> > > >     > i.e.
> > > >     >
> > > >     > defining more specific QA guidelines for 4.0-rc.
> > > >     >
> > > >     > This
> > > >     >
> > > >     > would
> > > >     >
> > > >     > be
> > > >     >
> > > >     > an
> > > >     >
> > > >     > important
> > > >     >
> > > >     > step towards QA guidelines for all changes and
> > > >     >
> > > >     > CEPs
> > > >     >
> > > >     > post-4.0.
> > > >     >
> > > >     > Questions from me
> > > >     >
> > > >     > - How will this be tested, how will its QA
> > > >     >
> > > >     > status and
> > > >     >
> > > >     > lifecycle
> > > >     >
> > > >     > be
> > > >     >
> > > >     > defined? (per above)
> > > >     >
> > > >     > - With existing C* code needing to be changed,
> > > >     >
> > > >     > what
> > > >     >
> > > >     > is the
> > > >     >
> > > >     > proposed
> > > >     >
> > > >     > plan
> > > >     >
> > > >     > for making those changes ensuring maintained QA,
> > > >     >
> > > >     > e.g.
> > > >     >
> > > >     > is
> > > >     >
> > > >     > there
> > > >     >
> > > >     > separate
> > > >     >
> > > >     > QA
> > > >     >
> > > >     > cycles planned for altering the SPI before
> > > >     >
> > > >     > adding
> > > >     >
> > > >     > a
> > > >     >
> > > >     > new SPI
> > > >     >
> > > >     > implementation?
> > > >     >
> > > >     > - Despite being out of scope, it would be nice
> > > >     >
> > > >     > to have
> > > >     >
> > > >     > some
> > > >     >
> > > >     > idea
> > > >     >
> > > >     > from
> > > >     >
> > > >     > the
> > > >     >
> > > >     > CEP author of when users might still choose
> > > >     >
> > > >     > afresh 2i
> > > >     >
> > > >     > or
> > > >     >
> > > >     > SASI
> > > >     >
> > > >     > over
> > > >     >
> > > >     > SAI,
> > > >     >
> > > >     > - Who fills the roles involved? Who are the
> > > >     >
> > > >     > contributors
> > > >     >
> > > >     > in
> > > >     >
> > > >     > this
> > > >     >
> > > >     > DataStax
> > > >     >
> > > >     > team? Who is the shepherd? Are there other
> > > >     >
> > > >     > stakeholders
> > > >     >
> > > >     > willing
> > > >     >
> > > >     > to
> > > >     >
> > > >     > be
> > > >     >
> > > >     > involved?
> > > >     >
> > > >     > - Is there a preference to use gdoc instead of
> > > >     >
> > > >     > the
> > > >     >
> > > >     > project's
> > > >     >
> > > >     > wiki,
> > > >     >
> > > >     > and
> > > >     >
> > > >     > why? (the CEP process suggest a wiki page, and
> > > >     >
> > > >     > feedback on
> > > >     >
> > > >     > why
> > > >     >
> > > >     > another
> > > >     >
> > > >     > approach is considered better helps evolve the
> > > >     >
> > > >     > CEP
> > > >     >
> > > >     > process
> > > >     >
> > > >     > itself)
> > > >     >
> > > >     > cheers,
> > > >     >
> > > >     > Mick
> > > >     >
> > > >     >
> > > ---------------------------------------------------------------------
> > > >     >
> > > >     > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For
> > > >     > additional commands, e-mail: dev-h...@cassandra.apache.org
> > > >     >
> > > >     >
> > > > ---------------------------------------------------------------------
> > To
> > > >     > unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For
> > > > additional
> > > >     > commands, e-mail: dev-h...@cassandra.apache.org
> > > >     >
> > > >     > --
> > > >     > alex p
> > > >     >
> > > >     >
> > > > ---------------------------------------------------------------------
> > To
> > > >     > unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For
> > > > additional
> > > >     > commands, e-mail: dev-h...@cassandra.apache.org
> > > >     >
> > > >     >
> > > >
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > > > For additional commands, e-mail: dev-h...@cassandra.apache.org
> > > >
> > > >
> > >
> >
> >
> > --
> > alex p
> >
>

Re: [DISCUSS] CEP-7 Storage Attached Index

Reply via email to