Re: [DISCUSS] CEP-7 Storage Attached Index

Jasonstack Zhao Yang Thu, 24 Sep 2020 00:47:39 -0700

>> Question is: is this planned as a next step?
>> If yes, how are we going to mark SAI as experimental until it gets
>> row offsets? Also, it is likely that index format is going to change when
>> row offsets are added, so my concern is that we may have to support two
>> versions of a format for a smooth migration.


The goal is to support row-level index when merging SAI, I will update the
CEP about it.

>> I think switching to row
>> offsets also has a huge impact on interaction with SPRC and has some
>> potential for optimisations.

Can you share more details on the optimizations?



On Thu, 24 Sep 2020 at 15:20, Oleksandr Petrov <oleksandr.pet...@gmail.com>
wrote:

> > But for improving overall index read performance, I think improving base
> table read perf  (because SAI/SASI executes LOTS of
> SinglePartitionReadCommand after searching on-disk index) is more effective
> than switching from Trie to Prefix BTree.
>
> I haven't suggested switching to Prefix B-Tree or any other structure, the
> question was about rationale and motivation of picking one over the other,
> which I am curious about for personal reasons/interests that lie outside of
> Cassandra. Having this listed in CEP could have been helpful for future
> guidance. It's ok if this question is outside of the CEP scope.
>
> I also agree that there are many areas that require improvement around the
> read/write path and 2i, many of which (even outside of base table format or
> read perf) can yield positive performance results.
>
> > FWIW, I personally look forward to receiving that contribution when the
> time is right.
>
> I am very excited for this contribution, too, and it looks like very solid
> work.
>
> I have one more question, about "Upon resolving partition keys, rows are
> loaded using Cassandra’s internal partition read command across SSTables
> and are post filtered". One of the criticisms of SASI and reasons for
> marking it as experimental was CASSANDRA-11990. I think switching to row
> offsets also has a huge impact on interaction with SPRC and has some
> potential for optimisations. Question is: is this planned as a next step?
> If yes, how are we going to mark SAI as experimental until it gets
> row offsets? Also, it is likely that index format is going to change when
> row offsets are added, so my concern is that we may have to support two
> versions of a format for a smooth migration.
>
>
>
> On Thu, Sep 24, 2020 at 6:53 AM Jasonstack Zhao Yang <
> jasonstack.z...@gmail.com> wrote:
>
> > >> I think CEP should be more upfront with "eventually replace
> > >>  it" bit, since it raises the question about what the people who are
> > using
> > >> other index implementations can expect.
> >
> > Will update the CEP to emphasize: SAI will replace other indexes.
> >
> > >> Unfortunately, I do not have an
> > >> implementation sitting around for a direct comparison, but I can
> imagine
> > >> situations when B-Trees may perform better because of simpler
> > construction.
> > >> Maybe we should even consider prototyping a prefix B-Tree to have a
> more
> > >> fair comparison.
> >
> > As long as prefix BTree supports range/prefix aggregation (which is used
> to
> > speed up
> > range/prefix query when matching entire subtree), we can plug it in and
> > compare. It won't
> > affect the CEP design which focuses on sharing data across indexes and
> > posting aggregation.
> >
> > But for improving overall index read performance, I think improving base
> > table read perf
> >  (because SAI/SASI executes LOTS of SinglePartitionReadCommand after
> > searching on-disk index)
> > is more effective than switching from Trie to Prefix BTree.
> >
> >
> >
> > On Thu, 24 Sep 2020 at 05:33, Benedict Elliott Smith <
> bened...@apache.org>
> > wrote:
> >
> > > FWIW, I personally look forward to receiving that contribution when the
> > > time is right.
> > >
> > > On 23/09/2020, 18:45, "Josh McKenzie" <jmcken...@apache.org> wrote:
> > >
> > >     talking about that would involve some bits of information DataStax
> > > might
> > >     not be ready to share?
> > >
> > >     At the risk of derailing, I've been poking and prodding this week
> at
> > we
> > >     contributors at DS getting our act together w/a draft CEP for
> > donating
> > > the
> > >     trie-based indices to the ASF project.
> > >
> > >     More to come; the intention is certainly to contribute that code.
> The
> > > lack
> > >     of a destination to merge it into (i.e. no 5.0-dev branch) is
> > removing
> > >     significant urgency from the process as well (not to open a 3rd
> > > Pandora's
> > >     box), but there's certainly an interrelatedness to the
> conversations
> > > going
> > >     on.
> > >
> > >     ---
> > >     Josh McKenzie
> > >
> > >
> > >     Sent via Superhuman <https://sprh.mn/?vip=jmcken...@apache.org>
> > >
> > >
> > >     On Wed, Sep 23, 2020 at 12:48 PM, Caleb Rackliffe <
> > > calebrackli...@gmail.com>
> > >     wrote:
> > >
> > >     > As long as we can construct the on-disk indexes
> > efficiently/directly
> > > from
> > >     > a Memtable-attached index on flush, there's room to try other
> data
> > >     > structures. Most of the innovation in SAI is around the layout of
> > > postings
> > >     > (something we can expand on if people are interested) and having
> a
> > >     > natively row-oriented design that scales w/ multiple indexed
> > columns
> > > on
> > >     > single SSTables. There are some broader implications of using the
> > > trie that
> > >     > reach outside SAI itself, but talking about that would involve
> some
> > > bits of
> > >     > information DataStax might not be ready to share?
> > >     >
> > >     > On Wed, Sep 23, 2020 at 11:00 AM Jeremiah D Jordan <
> > jeremiah.jordan@
> > >     > gmail.com> wrote:
> > >     >
> > >     > Short question: looking forward, how are we going to maintain
> three
> > > 2i
> > >     > implementations: SASI, SAI, and 2i?
> > >     >
> > >     > I think one of the goals stated in the CEP is for SAI to have
> > parity
> > > with
> > >     > 2i such that it could eventually replace it.
> > >     >
> > >     > On Sep 23, 2020, at 10:34 AM, Oleksandr Petrov <
> > >     >
> > >     > oleksandr.pet...@gmail.com> wrote:
> > >     >
> > >     > Short question: looking forward, how are we going to maintain
> three
> > > 2i
> > >     > implementations: SASI, SAI, and 2i?
> > >     >
> > >     > Another thing I think this CEP is missing is rationale and
> > motivation
> > >     > about why trie-based indexes were chosen over, say, B-Tree. We
> did
> > > have a
> > >     > short discussion about this on Slack, but both arguments that
> I've
> > > heard
> > >     > (space-saving and keeping a small subset of nodes in memory) work
> > > only
> > >     >
> > >     > for
> > >     >
> > >     > the most primitive implementation of a B-Tree. Fully-occupied
> > prefix
> > >     >
> > >     > B-Tree
> > >     >
> > >     > can have similar properties. There's been a lot of research on
> > > B-Trees
> > >     >
> > >     > and
> > >     >
> > >     > optimisations in those. Unfortunately, I do not have an
> > > implementation
> > >     > sitting around for a direct comparison, but I can imagine
> > situations
> > > when
> > >     > B-Trees may perform better because of simpler
> > >     >
> > >     > construction.
> > >     >
> > >     > Maybe we should even consider prototyping a prefix B-Tree to
> have a
> > > more
> > >     > fair comparison.
> > >     >
> > >     > Thank you,
> > >     > -- Alex
> > >     >
> > >     > On Thu, Sep 10, 2020 at 9:12 AM Jasonstack Zhao Yang <
> > > jasonstack.zhao@
> > >     > gmail.com> wrote:
> > >     >
> > >     > Thank you Patrick for hosting Cassandra Contributor Meeting for
> > CEP-7
> > >     >
> > >     > SAI.
> > >     >
> > >     > The recorded video is available here:
> > >     >
> > >     > https://cwiki.apache.org/confluence/display/CASSANDRA/
> > >     > 2020-09-01+Apache+Cassandra+Contributor+Meeting
> > >     >
> > >     > On Tue, 1 Sep 2020 at 14:34, Jasonstack Zhao Yang <
> > > jasonstack.zhao@gmail.
> > >     > com>
> > >     > wrote:
> > >     >
> > >     > Thank you, Charles and Patrick
> > >     >
> > >     > On Tue, 1 Sep 2020 at 04:56, Charles Cao <caohair...@gmail.com>
> > > wrote:
> > >     >
> > >     > Thank you, Patrick!
> > >     >
> > >     > On Mon, Aug 31, 2020 at 12:59 PM Patrick McFadin <
> > pmcfa...@gmail.com
> > > >
> > >     > wrote:
> > >     >
> > >     > I just moved it to 8AM for this meeting to better accommodate
> APAC.
> > >     >
> > >     > Please
> > >     >
> > >     > see the update here:
> > >     >
> > >     > https://cwiki.apache.org/confluence/display/CASSANDRA/
> > >     > 2020-08-01+Apache+Cassandra+Contributor+Meeting
> > >     >
> > >     > Patrick
> > >     >
> > >     > On Mon, Aug 31, 2020 at 10:04 AM Charles Cao <
> caohair...@gmail.com
> > >
> > >     >
> > >     > wrote:
> > >     >
> > >     > Patrick,
> > >     >
> > >     > 11AM PST is a bad time for the people in the APAC timezone. Can
> we
> > > move it
> > >     > to 7 or 8AM PST in the morning to accommodate their needs ?
> > >     >
> > >     > ~Charles
> > >     >
> > >     > On Fri, Aug 28, 2020 at 4:37 PM Patrick McFadin <
> > pmcfa...@gmail.com
> > >     >
> > >     > wrote:
> > >     >
> > >     > Meeting scheduled.
> > >     >
> > >     > https://cwiki.apache.org/confluence/display/CASSANDRA/
> > >     > 2020-08-01+Apache+Cassandra+Contributor+Meeting
> > >     >
> > >     > Tuesday September 1st, 11AM PST. I added a basic bullet for the
> > >     >
> > >     > agenda
> > >     >
> > >     > but
> > >     >
> > >     > if there is more, edit away.
> > >     >
> > >     > Patrick
> > >     >
> > >     > On Thu, Aug 27, 2020 at 11:31 AM Jasonstack Zhao Yang <
> > > jasonstack.zhao@
> > >     > gmail.com> wrote:
> > >     >
> > >     > +1
> > >     >
> > >     > On Thu, 27 Aug 2020 at 04:52, Ekaterina Dimitrova <
> > >     >
> > >     > e.dimitr...@gmail.com>
> > >     >
> > >     > wrote:
> > >     >
> > >     > +1
> > >     >
> > >     > On Wed, 26 Aug 2020 at 16:48, Caleb Rackliffe <
> > >     >
> > >     > calebrackli...@gmail.com>
> > >     >
> > >     > wrote:
> > >     >
> > >     > +1
> > >     >
> > >     > On Wed, Aug 26, 2020, 3:45 PM Patrick McFadin <
> > >     >
> > >     > pmcfa...@gmail.com>
> > >     >
> > >     > wrote:
> > >     >
> > >     > This is related to the discussion Jordan and I had about
> > >     >
> > >     > the
> > >     >
> > >     > contributor
> > >     >
> > >     > Zoom call. Instead of open mic for any issue, call it
> > >     >
> > >     > based
> > >     >
> > >     > on a
> > >     >
> > >     > discussion
> > >     >
> > >     > thread or threads for higher bandwidth discussion.
> > >     >
> > >     > I would be happy to schedule on for next week to
> > >     >
> > >     > specifically
> > >     >
> > >     > discuss
> > >     >
> > >     > CEP-7. I can attach the recorded call to the CEP after.
> > >     >
> > >     > +1 or -1?
> > >     >
> > >     > Patrick
> > >     >
> > >     > On Tue, Aug 25, 2020 at 7:03 AM Joshua McKenzie <
> > >     >
> > >     > jmcken...@apache.org>
> > >     >
> > >     > wrote:
> > >     >
> > >     > Does community plan to open another discussion or CEP
> > >     >
> > >     > on
> > >     >
> > >     > modularization?
> > >     >
> > >     > We probably should have a discussion on the ML or
> > >     >
> > >     > monthly
> > >     >
> > >     > contrib
> > >     >
> > >     > call
> > >     >
> > >     > about it first to see how aligned the interested
> > >     >
> > >     > contributors
> > >     >
> > >     > are.
> > >     >
> > >     > Could
> > >     >
> > >     > do
> > >     >
> > >     > that through CEP as well but CEP's (at least thus far
> > >     >
> > >     > sans k8s
> > >     >
> > >     > operator)
> > >     >
> > >     > tend to start with a strong, deeply thought out point of
> > >     >
> > >     > view
> > >     >
> > >     > being
> > >     >
> > >     > expressed.
> > >     >
> > >     > On Tue, Aug 25, 2020 at 3:26 AM Jasonstack Zhao Yang <
> > >     >
> > >     > jasonstack.z...@gmail.com> wrote:
> > >     >
> > >     > SASI's performance, specifically the search in the
> > >     >
> > >     > B+
> > >     >
> > >     > tree
> > >     >
> > >     > component,
> > >     >
> > >     > depends a lot on the component file's header being
> > >     >
> > >     > available
> > >     >
> > >     > in
> > >     >
> > >     > the
> > >     >
> > >     > pagecache. SASI benefits from (needs) nodes with
> > >     >
> > >     > lots of
> > >     >
> > >     > RAM.
> > >     >
> > >     > Is
> > >     >
> > >     > SAI
> > >     >
> > >     > bound
> > >     >
> > >     > to this same or similar limitation?
> > >     >
> > >     > SAI also benefits from larger memory because SAI puts
> > >     >
> > >     > block
> > >     >
> > >     > info
> > >     >
> > >     > on
> > >     >
> > >     > heap
> > >     >
> > >     > for searching on-disk components and having
> > >     >
> > >     > cross-index
> > >     >
> > >     > files on
> > >     >
> > >     > page
> > >     >
> > >     > cache
> > >     >
> > >     > improves read performance of different indexes on the
> > >     >
> > >     > same
> > >     >
> > >     > table.
> > >     >
> > >     > Flushing of SASI can be CPU+IO intensive, to the
> > >     >
> > >     > point of
> > >     >
> > >     > saturation,
> > >     >
> > >     > pauses, and crashes on the node. SSDs are a must,
> > >     >
> > >     > along
> > >     >
> > >     > with
> > >     >
> > >     > a
> > >     >
> > >     > bit
> > >     >
> > >     > of
> > >     >
> > >     > tuning, just to avoid bringing down your cluster.
> > >     >
> > >     > Beyond
> > >     >
> > >     > reducing
> > >     >
> > >     > space
> > >     >
> > >     > requirements, does SAI improve on these things?
> > >     >
> > >     > Like
> > >     >
> > >     > SASI how
> > >     >
> > >     > does
> > >     >
> > >     > SAI,
> > >     >
> > >     > in
> > >     >
> > >     > its own way, change/narrow the recommendations on
> > >     >
> > >     > node
> > >     >
> > >     > hardware
> > >     >
> > >     > specs?
> > >     >
> > >     > SAI won't crash the node during compaction and
> > >     >
> > >     > requires
> > >     >
> > >     > less
> > >     >
> > >     > CPU/IO.
> > >     >
> > >     > * SAI defines global memory limit for compaction
> > >     >
> > >     > instead of
> > >     >
> > >     > per-index
> > >     >
> > >     > memory limit used by SASI.
> > >     >
> > >     > For example, compactions are running on 10 tables
> > >     >
> > >     > and
> > >     >
> > >     > each
> > >     >
> > >     > has
> > >     >
> > >     > 10
> > >     >
> > >     > indexes. SAI will cap the
> > >     >
> > >     > memory usage with global limit while SASI may use up
> > >     >
> > >     > to
> > >     >
> > >     > 100 *
> > >     >
> > >     > per-index
> > >     >
> > >     > limit.
> > >     >
> > >     > * After flushing in-memory segments to disk, SAI won't
> > >     >
> > >     > merge
> > >     >
> > >     > on-disk
> > >     >
> > >     > segments while SASI
> > >     >
> > >     > attempts to merge them at the end.
> > >     >
> > >     > There are pros and cons of not merging segments:
> > >     >
> > >     > ** Pros: compaction runs faster and requires fewer
> > >     >
> > >     > resources.
> > >     >
> > >     > ** Cons: small segments reduce compression ratio.
> > >     >
> > >     > * SAI on-disk format with row ids compresses better.
> > >     >
> > >     > I understand the desire in keeping out of scope
> > >     >
> > >     > the
> > >     >
> > >     > longer
> > >     >
> > >     > term
> > >     >
> > >     > deprecation
> > >     >
> > >     > and migration plan, but… if SASI provides
> > >     >
> > >     > functionality
> > >     >
> > >     > that
> > >     >
> > >     > SAI
> > >     >
> > >     > doesn't,
> > >     >
> > >     > like tokenisation and DelimiterAnalyzer, yet
> > >     >
> > >     > introduces a
> > >     >
> > >     > body
> > >     >
> > >     > of
> > >     >
> > >     > code
> > >     >
> > >     > ~somewhat similar, shouldn't we be roughly
> > >     >
> > >     > sketching out
> > >     >
> > >     > how
> > >     >
> > >     > to
> > >     >
> > >     > reduce
> > >     >
> > >     > the
> > >     >
> > >     > maintenance surface area?
> > >     >
> > >     > Agreed that we should reduce maintenance area if
> > >     >
> > >     > possible,
> > >     >
> > >     > but
> > >     >
> > >     > only
> > >     >
> > >     > very
> > >     >
> > >     > limited
> > >     >
> > >     > code base (eg. RangeIterator, QueryPlan) can be
> > >     >
> > >     > shared.
> > >     >
> > >     > The
> > >     >
> > >     > rest
> > >     >
> > >     > of
> > >     >
> > >     > the
> > >     >
> > >     > code base
> > >     >
> > >     > is quite different because of on-disk format and
> > >     >
> > >     > cross-index
> > >     >
> > >     > files.
> > >     >
> > >     > The goal of this CEP is to get community buy-in on
> > >     >
> > >     > SAI's
> > >     >
> > >     > design.
> > >     >
> > >     > Tokenization,
> > >     >
> > >     > DelimiterAnalyzer should be straightforward to
> > >     >
> > >     > implement on
> > >     >
> > >     > top
> > >     >
> > >     > of
> > >     >
> > >     > SAI.
> > >     >
> > >     > Can we list what configurations of SASI will
> > >     >
> > >     > become
> > >     >
> > >     > deprecated
> > >     >
> > >     > once
> > >     >
> > >     > SAI
> > >     >
> > >     > becomes non-experimental?
> > >     >
> > >     > Except for "Like", "Tokenisation",
> > >     >
> > >     > "DelimiterAnalyzer",
> > >     >
> > >     > the
> > >     >
> > >     > rest
> > >     >
> > >     > of
> > >     >
> > >     > SASI
> > >     >
> > >     > can
> > >     >
> > >     > be replaced by SAI.
> > >     >
> > >     > Given a few bugs are open against 2i and SASI, can
> > >     >
> > >     > we
> > >     >
> > >     > provide
> > >     >
> > >     > some
> > >     >
> > >     > overview, or rough indication, of how many of them
> > >     >
> > >     > we
> > >     >
> > >     > could
> > >     >
> > >     > "triage
> > >     >
> > >     > away"?
> > >     >
> > >     > I believe most of the known bugs in 2i/SASI either
> > >     >
> > >     > have
> > >     >
> > >     > been
> > >     >
> > >     > addressed
> > >     >
> > >     > in
> > >     >
> > >     > SAI or
> > >     >
> > >     > don't apply to SAI.
> > >     >
> > >     > And, is it time for the project to start
> > >     >
> > >     > introducing new
> > >     >
> > >     > SPI
> > >     >
> > >     > implementations as separate sub-modules and jar
> > >     >
> > >     > files
> > >     >
> > >     > that
> > >     >
> > >     > are
> > >     >
> > >     > only
> > >     >
> > >     > loaded
> > >     >
> > >     > at runtime based on configuration settings? (sorry
> > >     >
> > >     > for
> > >     >
> > >     > the
> > >     >
> > >     > conflation
> > >     >
> > >     > on
> > >     >
> > >     > this one, but maybe it's the right time to raise
> > >     >
> > >     > it
> > >     >
> > >     > :shrug:)
> > >     >
> > >     > Agreed that modularization is the way to go and will
> > >     >
> > >     > speed up
> > >     >
> > >     > module
> > >     >
> > >     > development speed.
> > >     >
> > >     > Does community plan to open another discussion or CEP
> > >     >
> > >     > on
> > >     >
> > >     > modularization?
> > >     >
> > >     > On Mon, 24 Aug 2020 at 16:43, Mick Semb Wever <
> > >     >
> > >     > m...@apache.org>
> > >     >
> > >     > wrote:
> > >     >
> > >     > Adding to Duy's questions…
> > >     >
> > >     > * Hardware specs
> > >     >
> > >     > SASI's performance, specifically the search in the
> > >     >
> > >     > B+
> > >     >
> > >     > tree
> > >     >
> > >     > component,
> > >     >
> > >     > depends a lot on the component file's header being
> > >     >
> > >     > available in
> > >     >
> > >     > the
> > >     >
> > >     > pagecache. SASI benefits from (needs) nodes with
> > >     >
> > >     > lots
> > >     >
> > >     > of
> > >     >
> > >     > RAM.
> > >     >
> > >     > Is
> > >     >
> > >     > SAI
> > >     >
> > >     > bound
> > >     >
> > >     > to this same or similar limitation?
> > >     >
> > >     > Flushing of SASI can be CPU+IO intensive, to the
> > >     >
> > >     > point of
> > >     >
> > >     > saturation,
> > >     >
> > >     > pauses, and crashes on the node. SSDs are a must,
> > >     >
> > >     > along
> > >     >
> > >     > with a
> > >     >
> > >     > bit
> > >     >
> > >     > of
> > >     >
> > >     > tuning, just to avoid bringing down your cluster.
> > >     >
> > >     > Beyond
> > >     >
> > >     > reducing
> > >     >
> > >     > space
> > >     >
> > >     > requirements, does SAI improve on these things? Like
> > >     >
> > >     > SASI
> > >     >
> > >     > how
> > >     >
> > >     > does
> > >     >
> > >     > SAI,
> > >     >
> > >     > in
> > >     >
> > >     > its own way, change/narrow the recommendations on
> > >     >
> > >     > node
> > >     >
> > >     > hardware
> > >     >
> > >     > specs?
> > >     >
> > >     > * Code Maintenance
> > >     >
> > >     > I understand the desire in keeping out of scope the
> > >     >
> > >     > longer
> > >     >
> > >     > term
> > >     >
> > >     > deprecation
> > >     >
> > >     > and migration plan, but… if SASI provides
> > >     >
> > >     > functionality
> > >     >
> > >     > that
> > >     >
> > >     > SAI
> > >     >
> > >     > doesn't,
> > >     >
> > >     > like tokenisation and DelimiterAnalyzer, yet
> > >     >
> > >     > introduces a
> > >     >
> > >     > body
> > >     >
> > >     > of
> > >     >
> > >     > code
> > >     >
> > >     > ~somewhat similar, shouldn't we be roughly sketching
> > >     >
> > >     > out
> > >     >
> > >     > how to
> > >     >
> > >     > reduce
> > >     >
> > >     > the
> > >     >
> > >     > maintenance surface area?
> > >     >
> > >     > Can we list what configurations of SASI will become
> > >     >
> > >     > deprecated
> > >     >
> > >     > once
> > >     >
> > >     > SAI
> > >     >
> > >     > becomes non-experimental?
> > >     >
> > >     > Given a few bugs are open against 2i and SASI, can
> > >     >
> > >     > we
> > >     >
> > >     > provide
> > >     >
> > >     > some
> > >     >
> > >     > overview, or rough indication, of how many of them
> > >     >
> > >     > we
> > >     >
> > >     > could
> > >     >
> > >     > "triage
> > >     >
> > >     > away"?
> > >     >
> > >     > And, is it time for the project to start introducing
> > >     >
> > >     > new
> > >     >
> > >     > SPI
> > >     >
> > >     > implementations as separate sub-modules and jar
> > >     >
> > >     > files
> > >     >
> > >     > that
> > >     >
> > >     > are
> > >     >
> > >     > only
> > >     >
> > >     > loaded
> > >     >
> > >     > at runtime based on configuration settings? (sorry
> > >     >
> > >     > for the
> > >     >
> > >     > conflation
> > >     >
> > >     > on
> > >     >
> > >     > this one, but maybe it's the right time to raise it
> > >     >
> > >     > :shrug:)
> > >     >
> > >     > regards,
> > >     >
> > >     > Mick
> > >     >
> > >     > On Tue, 18 Aug 2020 at 13:05, DuyHai Doan <
> > >     >
> > >     > doanduy...@gmail.com>
> > >     >
> > >     > wrote:
> > >     >
> > >     > Thank you Zhao Yang for starting this topic
> > >     >
> > >     > After reading the short design doc, I have a few
> > >     >
> > >     > questions
> > >     >
> > >     > 1) SASI was pretty inefficient indexing wide
> > >     >
> > >     > partitions
> > >     >
> > >     > because
> > >     >
> > >     > the
> > >     >
> > >     > index
> > >     >
> > >     > structure only retains the partition token, not
> > >     >
> > >     > the
> > >     >
> > >     > clustering
> > >     >
> > >     > colums.
> > >     >
> > >     > As
> > >     >
> > >     > per design doc SAI has row id mapping to partition
> > >     >
> > >     > offset,
> > >     >
> > >     > can
> > >     >
> > >     > we
> > >     >
> > >     > hope
> > >     >
> > >     > that
> > >     >
> > >     > indexing wide partition will be more efficient
> > >     >
> > >     > with
> > >     >
> > >     > SAI
> > >     >
> > >     > ? One
> > >     >
> > >     > detail
> > >     >
> > >     > that
> > >     >
> > >     > worries me is that in the beggining of the design
> > >     >
> > >     > doc,
> > >     >
> > >     > it is
> > >     >
> > >     > said
> > >     >
> > >     > that
> > >     >
> > >     > the
> > >     >
> > >     > matching rows are post filtered while scanning the
> > >     >
> > >     > partition.
> > >     >
> > >     > Can
> > >     >
> > >     > you
> > >     >
> > >     > confirm or infirm that SAI is efficient with wide
> > >     >
> > >     > partitions
> > >     >
> > >     > and
> > >     >
> > >     > provides
> > >     >
> > >     > the partition offsets to the matching rows ?
> > >     >
> > >     > 2) About space efficiency, one of the biggest
> > >     >
> > >     > drawback of
> > >     >
> > >     > SASI
> > >     >
> > >     > was
> > >     >
> > >     > the
> > >     >
> > >     > huge
> > >     >
> > >     > space required for index structure when using
> > >     >
> > >     > CONTAINS
> > >     >
> > >     > logic
> > >     >
> > >     > because
> > >     >
> > >     > of
> > >     >
> > >     > the
> > >     >
> > >     > decomposition of text columns into n-grams. Will
> > >     >
> > >     > SAI
> > >     >
> > >     > suffer
> > >     >
> > >     > from
> > >     >
> > >     > the
> > >     >
> > >     > same
> > >     >
> > >     > issue in future iterations ? I'm anticipating a
> > >     >
> > >     > bit
> > >     >
> > >     > 3) If I'm querying using SAI and providing
> > >     >
> > >     > complete
> > >     >
> > >     > partition
> > >     >
> > >     > key,
> > >     >
> > >     > will
> > >     >
> > >     > it
> > >     >
> > >     > be more efficient than querying without partition
> > >     >
> > >     > key. In
> > >     >
> > >     > other
> > >     >
> > >     > words,
> > >     >
> > >     > does
> > >     >
> > >     > SAI provide any optimisation when partition key is
> > >     >
> > >     > specified
> > >     >
> > >     > ?
> > >     >
> > >     > Regards
> > >     >
> > >     > Duy Hai DOAN
> > >     >
> > >     > Le mar. 18 août 2020 à 11:39, Mick Semb Wever <
> > >     >
> > >     > m...@apache.org>
> > >     >
> > >     > a
> > >     >
> > >     > écrit :
> > >     >
> > >     > We are looking forward to the community's
> > >     >
> > >     > feedback
> > >     >
> > >     > and
> > >     >
> > >     > suggestions.
> > >     >
> > >     > What comes immediately to mind is testing
> > >     >
> > >     > requirements. It
> > >     >
> > >     > has
> > >     >
> > >     > been
> > >     >
> > >     > mentioned already that the project's testability
> > >     >
> > >     > and QA
> > >     >
> > >     > guidelines
> > >     >
> > >     > are
> > >     >
> > >     > inadequate to successfully introduce new
> > >     >
> > >     > features
> > >     >
> > >     > and
> > >     >
> > >     > refactorings
> > >     >
> > >     > to
> > >     >
> > >     > the
> > >     >
> > >     > codebase. During the 4.0 beta phase this was
> > >     >
> > >     > intended
> > >     >
> > >     > to be
> > >     >
> > >     > addressed,
> > >     >
> > >     > i.e.
> > >     >
> > >     > defining more specific QA guidelines for 4.0-rc.
> > >     >
> > >     > This
> > >     >
> > >     > would
> > >     >
> > >     > be
> > >     >
> > >     > an
> > >     >
> > >     > important
> > >     >
> > >     > step towards QA guidelines for all changes and
> > >     >
> > >     > CEPs
> > >     >
> > >     > post-4.0.
> > >     >
> > >     > Questions from me
> > >     >
> > >     > - How will this be tested, how will its QA
> > >     >
> > >     > status and
> > >     >
> > >     > lifecycle
> > >     >
> > >     > be
> > >     >
> > >     > defined? (per above)
> > >     >
> > >     > - With existing C* code needing to be changed,
> > >     >
> > >     > what
> > >     >
> > >     > is the
> > >     >
> > >     > proposed
> > >     >
> > >     > plan
> > >     >
> > >     > for making those changes ensuring maintained QA,
> > >     >
> > >     > e.g.
> > >     >
> > >     > is
> > >     >
> > >     > there
> > >     >
> > >     > separate
> > >     >
> > >     > QA
> > >     >
> > >     > cycles planned for altering the SPI before
> > >     >
> > >     > adding
> > >     >
> > >     > a
> > >     >
> > >     > new SPI
> > >     >
> > >     > implementation?
> > >     >
> > >     > - Despite being out of scope, it would be nice
> > >     >
> > >     > to have
> > >     >
> > >     > some
> > >     >
> > >     > idea
> > >     >
> > >     > from
> > >     >
> > >     > the
> > >     >
> > >     > CEP author of when users might still choose
> > >     >
> > >     > afresh 2i
> > >     >
> > >     > or
> > >     >
> > >     > SASI
> > >     >
> > >     > over
> > >     >
> > >     > SAI,
> > >     >
> > >     > - Who fills the roles involved? Who are the
> > >     >
> > >     > contributors
> > >     >
> > >     > in
> > >     >
> > >     > this
> > >     >
> > >     > DataStax
> > >     >
> > >     > team? Who is the shepherd? Are there other
> > >     >
> > >     > stakeholders
> > >     >
> > >     > willing
> > >     >
> > >     > to
> > >     >
> > >     > be
> > >     >
> > >     > involved?
> > >     >
> > >     > - Is there a preference to use gdoc instead of
> > >     >
> > >     > the
> > >     >
> > >     > project's
> > >     >
> > >     > wiki,
> > >     >
> > >     > and
> > >     >
> > >     > why? (the CEP process suggest a wiki page, and
> > >     >
> > >     > feedback on
> > >     >
> > >     > why
> > >     >
> > >     > another
> > >     >
> > >     > approach is considered better helps evolve the
> > >     >
> > >     > CEP
> > >     >
> > >     > process
> > >     >
> > >     > itself)
> > >     >
> > >     > cheers,
> > >     >
> > >     > Mick
> > >     >
> > >     >
> > ---------------------------------------------------------------------
> > >     >
> > >     > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For
> > >     > additional commands, e-mail: dev-h...@cassandra.apache.org
> > >     >
> > >     >
> > > ---------------------------------------------------------------------
> To
> > >     > unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For
> > > additional
> > >     > commands, e-mail: dev-h...@cassandra.apache.org
> > >     >
> > >     > --
> > >     > alex p
> > >     >
> > >     >
> > > ---------------------------------------------------------------------
> To
> > >     > unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For
> > > additional
> > >     > commands, e-mail: dev-h...@cassandra.apache.org
> > >     >
> > >     >
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > > For additional commands, e-mail: dev-h...@cassandra.apache.org
> > >
> > >
> >
>
>
> --
> alex p
>

Re: [DISCUSS] CEP-7 Storage Attached Index

Reply via email to