Re: [DISCUSS] CEP-7 Storage Attached Index

Patrick McFadin Wed, 26 Aug 2020 13:45:29 -0700

This is related to the discussion Jordan and I had about the contributor
Zoom call. Instead of open mic for any issue, call it based on a discussion
thread or threads for higher bandwidth discussion.


I would be happy to schedule on for next week to specifically discuss
CEP-7. I can attach the recorded call to the CEP after.

+1 or -1?

Patrick

On Tue, Aug 25, 2020 at 7:03 AM Joshua McKenzie <jmcken...@apache.org>
wrote:

> >
> > Does community plan to open another discussion or CEP on modularization?
>
> We probably should have a discussion on the ML or monthly contrib call
> about it first to see how aligned the interested contributors are. Could do
> that through CEP as well but CEP's (at least thus far sans k8s operator)
> tend to start with a strong, deeply thought out point of view being
> expressed.
>
> On Tue, Aug 25, 2020 at 3:26 AM Jasonstack Zhao Yang <
> jasonstack.z...@gmail.com> wrote:
>
> > >>> SASI's performance, specifically the search in the B+ tree component,
> > >>> depends a lot on the component file's header being available in the
> > >>> pagecache. SASI benefits from (needs) nodes with lots of RAM. Is SAI
> > bound
> > >>> to this same or similar limitation?
> >
> > SAI also benefits from larger memory because SAI puts block info on heap
> > for searching on-disk components and having cross-index files on page
> cache
> > improves read performance of different indexes on the same table.
> >
> >
> > >>> Flushing of SASI can be CPU+IO intensive, to the point of saturation,
> > >>> pauses, and crashes on the node. SSDs are a must, along with a bit of
> > >>> tuning, just to avoid bringing down your cluster. Beyond reducing
> space
> > >>> requirements, does SAI improve on these things? Like SASI how does
> SAI,
> > in
> > >>> its own way, change/narrow the recommendations on node hardware
> specs?
> >
> > SAI won't crash the node during compaction and requires less CPU/IO.
> >
> > * SAI defines global memory limit for compaction instead of per-index
> > memory limit used by SASI.
> >   For example, compactions are running on 10 tables and each has 10
> > indexes. SAI will cap the
> >   memory usage with global limit while SASI may use up to 100 * per-index
> > limit.
> >
> > * After flushing in-memory segments to disk, SAI won't merge on-disk
> > segments while SASI
> >   attempts to merge them at the end.
> >
> >   There are pros and cons of not merging segments:
> >     ** Pros: compaction runs faster and requires fewer resources.
> >     ** Cons: small segments reduce compression ratio.
> >
> > * SAI on-disk format with row ids compresses better.
> >
> >
> > >>> I understand the desire in keeping out of scope the longer term
> > deprecation
> > >>> and migration plan, but… if SASI provides functionality that SAI
> > doesn't,
> > >>> like tokenisation and DelimiterAnalyzer, yet introduces a body of
> code
> > >>> ~somewhat similar, shouldn't we be roughly sketching out how to
> reduce
> > the
> > >>> maintenance surface area?
> >
> > Agreed that we should reduce maintenance area if possible, but only very
> > limited
> > code base (eg. RangeIterator, QueryPlan) can be shared. The rest of the
> > code base
> > is quite different because of on-disk format and cross-index files.
> >
> > The goal of this CEP is to get community buy-in on SAI's design.
> > Tokenization,
> > DelimiterAnalyzer should be straightforward to implement on top of SAI.
> >
> > >>> Can we list what configurations of SASI will become deprecated once
> SAI
> > >>> becomes non-experimental?
> >
> > Except for "Like", "Tokenisation", "DelimiterAnalyzer", the rest of SASI
> > can
> > be replaced by SAI.
> >
> > >>> Given a few bugs are open against 2i and SASI, can we provide some
> > >>> overview, or rough indication, of how many of them we could "triage
> > away"?
> >
> > I believe most of the known bugs in 2i/SASI either have been addressed in
> > SAI or
> > don't apply to SAI.
> >
> > >>> And, is it time for the project to start introducing new SPI
> > >>> implementations as separate sub-modules and jar files that are only
> > loaded
> > >>> at runtime based on configuration settings? (sorry for the conflation
> > on
> > >>> this one, but maybe it's the right time to raise it :shrug:)
> >
> > Agreed that modularization is the way to go and will speed up module
> > development speed.
> >
> > Does community plan to open another discussion or CEP on modularization?
> >
> >
> > On Mon, 24 Aug 2020 at 16:43, Mick Semb Wever <m...@apache.org> wrote:
> >
> > > Adding to Duy's questions…
> > >
> > >
> > > * Hardware specs
> > >
> > > SASI's performance, specifically the search in the B+ tree component,
> > > depends a lot on the component file's header being available in the
> > > pagecache. SASI benefits from (needs) nodes with lots of RAM. Is SAI
> > bound
> > > to this same or similar limitation?
> > >
> > > Flushing of SASI can be CPU+IO intensive, to the point of saturation,
> > > pauses, and crashes on the node. SSDs are a must, along with a bit of
> > > tuning, just to avoid bringing down your cluster. Beyond reducing space
> > > requirements, does SAI improve on these things? Like SASI how does SAI,
> > in
> > > its own way, change/narrow the recommendations on node hardware specs?
> > >
> > >
> > > * Code Maintenance
> > >
> > > I understand the desire in keeping out of scope the longer term
> > deprecation
> > > and migration plan, but… if SASI provides functionality that SAI
> doesn't,
> > > like tokenisation and DelimiterAnalyzer, yet introduces a body of code
> > > ~somewhat similar, shouldn't we be roughly sketching out how to reduce
> > the
> > > maintenance surface area?
> > >
> > > Can we list what configurations of SASI will become deprecated once SAI
> > > becomes non-experimental?
> > >
> > > Given a few bugs are open against 2i and SASI, can we provide some
> > > overview, or rough indication, of how many of them we could "triage
> > away"?
> > >
> > > And, is it time for the project to start introducing new SPI
> > > implementations as separate sub-modules and jar files that are only
> > loaded
> > > at runtime based on configuration settings? (sorry for the conflation
> on
> > > this one, but maybe it's the right time to raise it :shrug:)
> > >
> > > regards,
> > > Mick
> > >
> > >
> > > On Tue, 18 Aug 2020 at 13:05, DuyHai Doan <doanduy...@gmail.com>
> wrote:
> > >
> > > > Thank you Zhao Yang for starting this topic
> > > >
> > > > After reading the short design doc, I have a few questions
> > > >
> > > > 1) SASI was pretty inefficient indexing wide partitions because the
> > index
> > > > structure only retains the partition token, not the clustering
> colums.
> > As
> > > > per design doc SAI has row id mapping to partition offset, can we
> hope
> > > that
> > > > indexing wide partition will be more efficient with SAI ? One detail
> > that
> > > > worries me is that in the beggining of the design doc, it is said
> that
> > > the
> > > > matching rows are post filtered while scanning the partition. Can you
> > > > confirm or infirm that SAI is efficient with wide partitions and
> > provides
> > > > the partition offsets to the matching rows ?
> > > >
> > > > 2) About space efficiency, one of the biggest drawback of SASI was
> the
> > > huge
> > > > space required for index structure when using CONTAINS logic because
> of
> > > the
> > > > decomposition of text columns into n-grams. Will SAI suffer from the
> > same
> > > > issue in future iterations ? I'm anticipating a bit
> > > >
> > > > 3) If I'm querying using SAI and providing complete partition key,
> will
> > > it
> > > > be more efficient than querying without partition key. In other
> words,
> > > does
> > > > SAI provide any optimisation when partition key is specified ?
> > > >
> > > > Regards
> > > >
> > > > Duy Hai DOAN
> > > >
> > > > Le mar. 18 août 2020 à 11:39, Mick Semb Wever <m...@apache.org> a
> > écrit :
> > > >
> > > > > >
> > > > > > We are looking forward to the community's feedback and
> suggestions.
> > > > > >
> > > > >
> > > > >
> > > > > What comes immediately to mind is testing requirements. It has been
> > > > > mentioned already that the project's testability and QA guidelines
> > are
> > > > > inadequate to successfully introduce new features and refactorings
> to
> > > the
> > > > > codebase. During the 4.0 beta phase this was intended to be
> > addressed,
> > > > i.e.
> > > > > defining more specific QA guidelines for 4.0-rc. This would be an
> > > > important
> > > > > step towards QA guidelines for all changes and CEPs post-4.0.
> > > > >
> > > > > Questions from me
> > > > >  - How will this be tested, how will its QA status and lifecycle be
> > > > > defined? (per above)
> > > > >  - With existing C* code needing to be changed, what is the
> proposed
> > > plan
> > > > > for making those changes ensuring maintained QA, e.g. is there
> > separate
> > > > QA
> > > > > cycles planned for altering the SPI before adding a new SPI
> > > > implementation?
> > > > >  - Despite being out of scope, it would be nice to have some idea
> > from
> > > > the
> > > > > CEP author of when users might still choose afresh 2i or SASI over
> > SAI,
> > > > >  - Who fills the roles involved? Who are the contributors in this
> > > > DataStax
> > > > > team? Who is the shepherd? Are there other stakeholders willing to
> be
> > > > > involved?
> > > > >  - Is there a preference to use gdoc instead of the project's wiki,
> > and
> > > > > why? (the CEP process suggest a wiki page, and feedback on why
> > another
> > > > > approach is considered better helps evolve the CEP process itself)
> > > > >
> > > > > cheers,
> > > > > Mick
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] CEP-7 Storage Attached Index

Reply via email to