Re: [DISCUSS] CEP-7 Storage Attached Index

Mick Semb Wever Mon, 24 Aug 2020 01:43:42 -0700

Adding to Duy's questions…

* Hardware specs

SASI's performance, specifically the search in the B+ tree component,
depends a lot on the component file's header being available in the
pagecache. SASI benefits from (needs) nodes with lots of RAM. Is SAI bound
to this same or similar limitation?

Flushing of SASI can be CPU+IO intensive, to the point of saturation,
pauses, and crashes on the node. SSDs are a must, along with a bit of
tuning, just to avoid bringing down your cluster. Beyond reducing space
requirements, does SAI improve on these things? Like SASI how does SAI, in
its own way, change/narrow the recommendations on node hardware specs?

* Code Maintenance

I understand the desire in keeping out of scope the longer term deprecation
and migration plan, but… if SASI provides functionality that SAI doesn't,
like tokenisation and DelimiterAnalyzer, yet introduces a body of code
~somewhat similar, shouldn't we be roughly sketching out how to reduce the
maintenance surface area?

Can we list what configurations of SASI will become deprecated once SAI
becomes non-experimental?

Given a few bugs are open against 2i and SASI, can we provide some
overview, or rough indication, of how many of them we could "triage away"?

And, is it time for the project to start introducing new SPI
implementations as separate sub-modules and jar files that are only loaded
at runtime based on configuration settings? (sorry for the conflation on
this one, but maybe it's the right time to raise it :shrug:)

regards,
Mick

On Tue, 18 Aug 2020 at 13:05, DuyHai Doan <[email protected]> wrote:

> Thank you Zhao Yang for starting this topic
>
> After reading the short design doc, I have a few questions
>
> 1) SASI was pretty inefficient indexing wide partitions because the index
> structure only retains the partition token, not the clustering colums. As
> per design doc SAI has row id mapping to partition offset, can we hope that
> indexing wide partition will be more efficient with SAI ? One detail that
> worries me is that in the beggining of the design doc, it is said that the
> matching rows are post filtered while scanning the partition. Can you
> confirm or infirm that SAI is efficient with wide partitions and provides
> the partition offsets to the matching rows ?
>
> 2) About space efficiency, one of the biggest drawback of SASI was the huge
> space required for index structure when using CONTAINS logic because of the
> decomposition of text columns into n-grams. Will SAI suffer from the same
> issue in future iterations ? I'm anticipating a bit
>
> 3) If I'm querying using SAI and providing complete partition key, will it
> be more efficient than querying without partition key. In other words, does
> SAI provide any optimisation when partition key is specified ?
>
> Regards
>
> Duy Hai DOAN
>
> Le mar. 18 août 2020 à 11:39, Mick Semb Wever <[email protected]> a écrit :
>
> > >
> > > We are looking forward to the community's feedback and suggestions.
> > >
> >
> >
> > What comes immediately to mind is testing requirements. It has been
> > mentioned already that the project's testability and QA guidelines are
> > inadequate to successfully introduce new features and refactorings to the
> > codebase. During the 4.0 beta phase this was intended to be addressed,
> i.e.
> > defining more specific QA guidelines for 4.0-rc. This would be an
> important
> > step towards QA guidelines for all changes and CEPs post-4.0.
> >
> > Questions from me
> >  - How will this be tested, how will its QA status and lifecycle be
> > defined? (per above)
> >  - With existing C* code needing to be changed, what is the proposed plan
> > for making those changes ensuring maintained QA, e.g. is there separate
> QA
> > cycles planned for altering the SPI before adding a new SPI
> implementation?
> >  - Despite being out of scope, it would be nice to have some idea from
> the
> > CEP author of when users might still choose afresh 2i or SASI over SAI,
> >  - Who fills the roles involved? Who are the contributors in this
> DataStax
> > team? Who is the shepherd? Are there other stakeholders willing to be
> > involved?
> >  - Is there a preference to use gdoc instead of the project's wiki, and
> > why? (the CEP process suggest a wiki page, and feedback on why another
> > approach is considered better helps evolve the CEP process itself)
> >
> > cheers,
> > Mick
> >
>

Re: [DISCUSS] CEP-7 Storage Attached Index

Reply via email to