Hi, Just to keep this thread up to date with development progress, we will be adding row-aware support to SAI in the next few weeks. This is currently going through the final stages of review and testing.
This feature also adds on-disk versioning to SAI. This allows SAI to support multiple on-disk formats during upgrades. I am mentioning this now because the CEP mentions “Partition Based Iteration” as an initial feature. We will change that to “Row Based Iteration” when the feature is merged. MikeA > On 15 Sep 2021, at 19:42, Caleb Rackliffe <[email protected]> wrote: > > Hey there, > > In the spirit of trying to get as many possible objections to a successful > vote out of the way, I've added a "Challenges" section to the CEP: > > https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-7%3A+Storage+Attached+Index#CEP7:StorageAttachedIndex-Challenges > > <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-7%3A+Storage+Attached+Index#CEP7:StorageAttachedIndex-Challenges> > > Most of you will be familiar with these, but I think we need to be as > open/candid as possible about the potential risk they pose to SAI's broader > usability. I've described them from the point of view that they are not > intractable, but if anyone thinks they are, let's hash that disagreement > out. > > Thanks! > > On Thu, Sep 9, 2021 at 11:13 AM Patrick McFadin <[email protected] > <mailto:[email protected]>> wrote: > >> +1 on introducing this in an incremental manner and after reading through >> CASSANDRA-16092 that seems like a perfect place to start. I see that work >> on that Jira has stopped until direction for CEP-7 has been voted in. >> >> I say start the vote and let's get this really valuable developer feature >> underway. >> >> Patrick >> >> On Tue, Sep 7, 2021 at 10:40 AM Caleb Rackliffe <[email protected]> >> wrote: >> >>> So this thread stalled almost a year ago. (Wow, time flies when you're >>> trying to release 4.0.) My synthesis of the conversation to this point is >>> that while there are some open questions about testing >>> methodology/"definition of done" and our choice of particular on-disk >> data >>> structures, neither of these should be a serious obstacle to moving >> forward >>> w/ a vote. Having said that, is there anything left around the CEP that >> we >>> feel should prevent it from moving to a vote? >>> >>> In terms of how we would proceed from the point a vote passes, it seems >>> like there have been enough concerns around the proposed/necessary >> breaking >>> changes to the 2i API, that we will start development by introducing >>> components as incrementally as possible into a long-running feature >> branch >>> off trunk. (This work would likely start w/ *CASSANDRA-16092* >>> <https://issues.apache.org/jira/browse/CASSANDRA-16092>, which we could >>> resolve as a sub-task of the SAI epic without interfering with other >> trunk >>> development likely destined for a 4.x minor, etc.) >>> >>> On Thu, Sep 24, 2020 at 2:47 AM Jasonstack Zhao Yang < >>> [email protected]> wrote: >>> >>>>>> Question is: is this planned as a next step? >>>>>> If yes, how are we going to mark SAI as experimental until it gets >>>>>> row offsets? Also, it is likely that index format is going to change >>>> when >>>>>> row offsets are added, so my concern is that we may have to support >>> two >>>>>> versions of a format for a smooth migration. >>>> >>>> The goal is to support row-level index when merging SAI, I will update >>> the >>>> CEP about it. >>>> >>>>>> I think switching to row >>>>>> offsets also has a huge impact on interaction with SPRC and has some >>>>>> potential for optimisations. >>>> >>>> Can you share more details on the optimizations? >>>> >>>> >>>> >>>> On Thu, 24 Sep 2020 at 15:20, Oleksandr Petrov < >>> [email protected] >>>>> >>>> wrote: >>>> >>>>>> But for improving overall index read performance, I think improving >>>> base >>>>> table read perf (because SAI/SASI executes LOTS of >>>>> SinglePartitionReadCommand after searching on-disk index) is more >>>> effective >>>>> than switching from Trie to Prefix BTree. >>>>> >>>>> I haven't suggested switching to Prefix B-Tree or any other >> structure, >>>> the >>>>> question was about rationale and motivation of picking one over the >>>> other, >>>>> which I am curious about for personal reasons/interests that lie >>> outside >>>> of >>>>> Cassandra. Having this listed in CEP could have been helpful for >> future >>>>> guidance. It's ok if this question is outside of the CEP scope. >>>>> >>>>> I also agree that there are many areas that require improvement >> around >>>> the >>>>> read/write path and 2i, many of which (even outside of base table >>> format >>>> or >>>>> read perf) can yield positive performance results. >>>>> >>>>>> FWIW, I personally look forward to receiving that contribution when >>> the >>>>> time is right. >>>>> >>>>> I am very excited for this contribution, too, and it looks like very >>>> solid >>>>> work. >>>>> >>>>> I have one more question, about "Upon resolving partition keys, rows >>> are >>>>> loaded using Cassandra’s internal partition read command across >>> SSTables >>>>> and are post filtered". One of the criticisms of SASI and reasons for >>>>> marking it as experimental was CASSANDRA-11990. I think switching to >>> row >>>>> offsets also has a huge impact on interaction with SPRC and has some >>>>> potential for optimisations. Question is: is this planned as a next >>> step? >>>>> If yes, how are we going to mark SAI as experimental until it gets >>>>> row offsets? Also, it is likely that index format is going to change >>> when >>>>> row offsets are added, so my concern is that we may have to support >> two >>>>> versions of a format for a smooth migration. >>>>> >>>>> >>>>> >>>>> On Thu, Sep 24, 2020 at 6:53 AM Jasonstack Zhao Yang < >>>>> [email protected]> wrote: >>>>> >>>>>>>> I think CEP should be more upfront with "eventually replace >>>>>>>> it" bit, since it raises the question about what the people who >>> are >>>>>> using >>>>>>>> other index implementations can expect. >>>>>> >>>>>> Will update the CEP to emphasize: SAI will replace other indexes. >>>>>> >>>>>>>> Unfortunately, I do not have an >>>>>>>> implementation sitting around for a direct comparison, but I can >>>>> imagine >>>>>>>> situations when B-Trees may perform better because of simpler >>>>>> construction. >>>>>>>> Maybe we should even consider prototyping a prefix B-Tree to >> have >>> a >>>>> more >>>>>>>> fair comparison. >>>>>> >>>>>> As long as prefix BTree supports range/prefix aggregation (which is >>>> used >>>>> to >>>>>> speed up >>>>>> range/prefix query when matching entire subtree), we can plug it in >>> and >>>>>> compare. It won't >>>>>> affect the CEP design which focuses on sharing data across indexes >>> and >>>>>> posting aggregation. >>>>>> >>>>>> But for improving overall index read performance, I think improving >>>> base >>>>>> table read perf >>>>>> (because SAI/SASI executes LOTS of SinglePartitionReadCommand >> after >>>>>> searching on-disk index) >>>>>> is more effective than switching from Trie to Prefix BTree. >>>>>> >>>>>> >>>>>> >>>>>> On Thu, 24 Sep 2020 at 05:33, Benedict Elliott Smith < >>>>> [email protected]> >>>>>> wrote: >>>>>> >>>>>>> FWIW, I personally look forward to receiving that contribution >> when >>>> the >>>>>>> time is right. >>>>>>> >>>>>>> On 23/09/2020, 18:45, "Josh McKenzie" <[email protected]> >>> wrote: >>>>>>> >>>>>>> talking about that would involve some bits of information >>>> DataStax >>>>>>> might >>>>>>> not be ready to share? >>>>>>> >>>>>>> At the risk of derailing, I've been poking and prodding this >>> week >>>>> at >>>>>> we >>>>>>> contributors at DS getting our act together w/a draft CEP for >>>>>> donating >>>>>>> the >>>>>>> trie-based indices to the ASF project. >>>>>>> >>>>>>> More to come; the intention is certainly to contribute that >>> code. >>>>> The >>>>>>> lack >>>>>>> of a destination to merge it into (i.e. no 5.0-dev branch) is >>>>>> removing >>>>>>> significant urgency from the process as well (not to open a >> 3rd >>>>>>> Pandora's >>>>>>> box), but there's certainly an interrelatedness to the >>>>> conversations >>>>>>> going >>>>>>> on. >>>>>>> >>>>>>> --- >>>>>>> Josh McKenzie >>>>>>> >>>>>>> >>>>>>> Sent via Superhuman < >> https://urldefense.proofpoint.com/v2/url?u=https-3A__sprh.mn_-3Fvip-3Djmckenzie-40apache.org&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=W153pibedwV7j_YCKUR0MVt-tPDUbvaHukx68pAo9zc&m=epkiu_3NED8CL23Ylg9qVnK7VfGLJGsT28TGXN6Wmc4&s=gJ7VsN1vFUYz0czKFU8Dv28TViVbCWWF1zE3ZQlxtWc&e= >> >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__sprh.mn_-3Fvip-3Djmckenzie-40apache.org&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=W153pibedwV7j_YCKUR0MVt-tPDUbvaHukx68pAo9zc&m=epkiu_3NED8CL23Ylg9qVnK7VfGLJGsT28TGXN6Wmc4&s=gJ7VsN1vFUYz0czKFU8Dv28TViVbCWWF1zE3ZQlxtWc&e=> >> >>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Sep 23, 2020 at 12:48 PM, Caleb Rackliffe < >>>>>>> [email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> As long as we can construct the on-disk indexes >>>>>> efficiently/directly >>>>>>> from >>>>>>>> a Memtable-attached index on flush, there's room to try >> other >>>>> data >>>>>>>> structures. Most of the innovation in SAI is around the >>> layout >>>> of >>>>>>> postings >>>>>>>> (something we can expand on if people are interested) and >>>> having >>>>> a >>>>>>>> natively row-oriented design that scales w/ multiple >> indexed >>>>>> columns >>>>>>> on >>>>>>>> single SSTables. There are some broader implications of >> using >>>> the >>>>>>> trie that >>>>>>>> reach outside SAI itself, but talking about that would >>> involve >>>>> some >>>>>>> bits of >>>>>>>> information DataStax might not be ready to share? >>>>>>>> >>>>>>>> On Wed, Sep 23, 2020 at 11:00 AM Jeremiah D Jordan < >>>>>> jeremiah.jordan@ >>>>>>>> gmail.com> wrote: >>>>>>>> >>>>>>>> Short question: looking forward, how are we going to >> maintain >>>>> three >>>>>>> 2i >>>>>>>> implementations: SASI, SAI, and 2i? >>>>>>>> >>>>>>>> I think one of the goals stated in the CEP is for SAI to >> have >>>>>> parity >>>>>>> with >>>>>>>> 2i such that it could eventually replace it. >>>>>>>> >>>>>>>> On Sep 23, 2020, at 10:34 AM, Oleksandr Petrov < >>>>>>>> >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>> Short question: looking forward, how are we going to >> maintain >>>>> three >>>>>>> 2i >>>>>>>> implementations: SASI, SAI, and 2i? >>>>>>>> >>>>>>>> Another thing I think this CEP is missing is rationale and >>>>>> motivation >>>>>>>> about why trie-based indexes were chosen over, say, B-Tree. >>> We >>>>> did >>>>>>> have a >>>>>>>> short discussion about this on Slack, but both arguments >> that >>>>> I've >>>>>>> heard >>>>>>>> (space-saving and keeping a small subset of nodes in >> memory) >>>> work >>>>>>> only >>>>>>>> >>>>>>>> for >>>>>>>> >>>>>>>> the most primitive implementation of a B-Tree. >> Fully-occupied >>>>>> prefix >>>>>>>> >>>>>>>> B-Tree >>>>>>>> >>>>>>>> can have similar properties. There's been a lot of research >>> on >>>>>>> B-Trees >>>>>>>> >>>>>>>> and >>>>>>>> >>>>>>>> optimisations in those. Unfortunately, I do not have an >>>>>>> implementation >>>>>>>> sitting around for a direct comparison, but I can imagine >>>>>> situations >>>>>>> when >>>>>>>> B-Trees may perform better because of simpler >>>>>>>> >>>>>>>> construction. >>>>>>>> >>>>>>>> Maybe we should even consider prototyping a prefix B-Tree >> to >>>>> have a >>>>>>> more >>>>>>>> fair comparison. >>>>>>>> >>>>>>>> Thank you, >>>>>>>> -- Alex >>>>>>>> >>>>>>>> On Thu, Sep 10, 2020 at 9:12 AM Jasonstack Zhao Yang < >>>>>>> jasonstack.zhao@ >>>>>>>> gmail.com> wrote: >>>>>>>> >>>>>>>> Thank you Patrick for hosting Cassandra Contributor Meeting >>> for >>>>>> CEP-7 >>>>>>>> >>>>>>>> SAI. >>>>>>>> >>>>>>>> The recorded video is available here: >>>>>>>> >>>>>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/ >>>>>>>> 2020-09-01+Apache+Cassandra+Contributor+Meeting >>>>>>>> >>>>>>>> On Tue, 1 Sep 2020 at 14:34, Jasonstack Zhao Yang < >>>>>>> jasonstack.zhao@gmail. >>>>>>>> com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Thank you, Charles and Patrick >>>>>>>> >>>>>>>> On Tue, 1 Sep 2020 at 04:56, Charles Cao < >>> [email protected] >>>>> >>>>>>> wrote: >>>>>>>> >>>>>>>> Thank you, Patrick! >>>>>>>> >>>>>>>> On Mon, Aug 31, 2020 at 12:59 PM Patrick McFadin < >>>>>> [email protected] >>>>>>>> >>>>>>>> wrote: >>>>>>>> >>>>>>>> I just moved it to 8AM for this meeting to better >> accommodate >>>>> APAC. >>>>>>>> >>>>>>>> Please >>>>>>>> >>>>>>>> see the update here: >>>>>>>> >>>>>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/ >>>>>>>> 2020-08-01+Apache+Cassandra+Contributor+Meeting >>>>>>>> >>>>>>>> Patrick >>>>>>>> >>>>>>>> On Mon, Aug 31, 2020 at 10:04 AM Charles Cao < >>>>> [email protected] >>>>>>> >>>>>>>> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Patrick, >>>>>>>> >>>>>>>> 11AM PST is a bad time for the people in the APAC timezone. >>> Can >>>>> we >>>>>>> move it >>>>>>>> to 7 or 8AM PST in the morning to accommodate their needs ? >>>>>>>> >>>>>>>> ~Charles >>>>>>>> >>>>>>>> On Fri, Aug 28, 2020 at 4:37 PM Patrick McFadin < >>>>>> [email protected] >>>>>>>> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Meeting scheduled. >>>>>>>> >>>>>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/ >>>>>>>> 2020-08-01+Apache+Cassandra+Contributor+Meeting >>>>>>>> >>>>>>>> Tuesday September 1st, 11AM PST. I added a basic bullet for >>> the >>>>>>>> >>>>>>>> agenda >>>>>>>> >>>>>>>> but >>>>>>>> >>>>>>>> if there is more, edit away. >>>>>>>> >>>>>>>> Patrick >>>>>>>> >>>>>>>> On Thu, Aug 27, 2020 at 11:31 AM Jasonstack Zhao Yang < >>>>>>> jasonstack.zhao@ >>>>>>>> gmail.com> wrote: >>>>>>>> >>>>>>>> +1 >>>>>>>> >>>>>>>> On Thu, 27 Aug 2020 at 04:52, Ekaterina Dimitrova < >>>>>>>> >>>>>>>> [email protected]> >>>>>>>> >>>>>>>> wrote: >>>>>>>> >>>>>>>> +1 >>>>>>>> >>>>>>>> On Wed, 26 Aug 2020 at 16:48, Caleb Rackliffe < >>>>>>>> >>>>>>>> [email protected]> >>>>>>>> >>>>>>>> wrote: >>>>>>>> >>>>>>>> +1 >>>>>>>> >>>>>>>> On Wed, Aug 26, 2020, 3:45 PM Patrick McFadin < >>>>>>>> >>>>>>>> [email protected]> >>>>>>>> >>>>>>>> wrote: >>>>>>>> >>>>>>>> This is related to the discussion Jordan and I had about >>>>>>>> >>>>>>>> the >>>>>>>> >>>>>>>> contributor >>>>>>>> >>>>>>>> Zoom call. Instead of open mic for any issue, call it >>>>>>>> >>>>>>>> based >>>>>>>> >>>>>>>> on a >>>>>>>> >>>>>>>> discussion >>>>>>>> >>>>>>>> thread or threads for higher bandwidth discussion. >>>>>>>> >>>>>>>> I would be happy to schedule on for next week to >>>>>>>> >>>>>>>> specifically >>>>>>>> >>>>>>>> discuss >>>>>>>> >>>>>>>> CEP-7. I can attach the recorded call to the CEP after. >>>>>>>> >>>>>>>> +1 or -1? >>>>>>>> >>>>>>>> Patrick >>>>>>>> >>>>>>>> On Tue, Aug 25, 2020 at 7:03 AM Joshua McKenzie < >>>>>>>> >>>>>>>> [email protected]> >>>>>>>> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Does community plan to open another discussion or CEP >>>>>>>> >>>>>>>> on >>>>>>>> >>>>>>>> modularization? >>>>>>>> >>>>>>>> We probably should have a discussion on the ML or >>>>>>>> >>>>>>>> monthly >>>>>>>> >>>>>>>> contrib >>>>>>>> >>>>>>>> call >>>>>>>> >>>>>>>> about it first to see how aligned the interested >>>>>>>> >>>>>>>> contributors >>>>>>>> >>>>>>>> are. >>>>>>>> >>>>>>>> Could >>>>>>>> >>>>>>>> do >>>>>>>> >>>>>>>> that through CEP as well but CEP's (at least thus far >>>>>>>> >>>>>>>> sans k8s >>>>>>>> >>>>>>>> operator) >>>>>>>> >>>>>>>> tend to start with a strong, deeply thought out point of >>>>>>>> >>>>>>>> view >>>>>>>> >>>>>>>> being >>>>>>>> >>>>>>>> expressed. >>>>>>>> >>>>>>>> On Tue, Aug 25, 2020 at 3:26 AM Jasonstack Zhao Yang < >>>>>>>> >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>> SASI's performance, specifically the search in the >>>>>>>> >>>>>>>> B+ >>>>>>>> >>>>>>>> tree >>>>>>>> >>>>>>>> component, >>>>>>>> >>>>>>>> depends a lot on the component file's header being >>>>>>>> >>>>>>>> available >>>>>>>> >>>>>>>> in >>>>>>>> >>>>>>>> the >>>>>>>> >>>>>>>> pagecache. SASI benefits from (needs) nodes with >>>>>>>> >>>>>>>> lots of >>>>>>>> >>>>>>>> RAM. >>>>>>>> >>>>>>>> Is >>>>>>>> >>>>>>>> SAI >>>>>>>> >>>>>>>> bound >>>>>>>> >>>>>>>> to this same or similar limitation? >>>>>>>> >>>>>>>> SAI also benefits from larger memory because SAI puts >>>>>>>> >>>>>>>> block >>>>>>>> >>>>>>>> info >>>>>>>> >>>>>>>> on >>>>>>>> >>>>>>>> heap >>>>>>>> >>>>>>>> for searching on-disk components and having >>>>>>>> >>>>>>>> cross-index >>>>>>>> >>>>>>>> files on >>>>>>>> >>>>>>>> page >>>>>>>> >>>>>>>> cache >>>>>>>> >>>>>>>> improves read performance of different indexes on the >>>>>>>> >>>>>>>> same >>>>>>>> >>>>>>>> table. >>>>>>>> >>>>>>>> Flushing of SASI can be CPU+IO intensive, to the >>>>>>>> >>>>>>>> point of >>>>>>>> >>>>>>>> saturation, >>>>>>>> >>>>>>>> pauses, and crashes on the node. SSDs are a must, >>>>>>>> >>>>>>>> along >>>>>>>> >>>>>>>> with >>>>>>>> >>>>>>>> a >>>>>>>> >>>>>>>> bit >>>>>>>> >>>>>>>> of >>>>>>>> >>>>>>>> tuning, just to avoid bringing down your cluster. >>>>>>>> >>>>>>>> Beyond >>>>>>>> >>>>>>>> reducing >>>>>>>> >>>>>>>> space >>>>>>>> >>>>>>>> requirements, does SAI improve on these things? >>>>>>>> >>>>>>>> Like >>>>>>>> >>>>>>>> SASI how >>>>>>>> >>>>>>>> does >>>>>>>> >>>>>>>> SAI, >>>>>>>> >>>>>>>> in >>>>>>>> >>>>>>>> its own way, change/narrow the recommendations on >>>>>>>> >>>>>>>> node >>>>>>>> >>>>>>>> hardware >>>>>>>> >>>>>>>> specs? >>>>>>>> >>>>>>>> SAI won't crash the node during compaction and >>>>>>>> >>>>>>>> requires >>>>>>>> >>>>>>>> less >>>>>>>> >>>>>>>> CPU/IO. >>>>>>>> >>>>>>>> * SAI defines global memory limit for compaction >>>>>>>> >>>>>>>> instead of >>>>>>>> >>>>>>>> per-index >>>>>>>> >>>>>>>> memory limit used by SASI. >>>>>>>> >>>>>>>> For example, compactions are running on 10 tables >>>>>>>> >>>>>>>> and >>>>>>>> >>>>>>>> each >>>>>>>> >>>>>>>> has >>>>>>>> >>>>>>>> 10 >>>>>>>> >>>>>>>> indexes. SAI will cap the >>>>>>>> >>>>>>>> memory usage with global limit while SASI may use up >>>>>>>> >>>>>>>> to >>>>>>>> >>>>>>>> 100 * >>>>>>>> >>>>>>>> per-index >>>>>>>> >>>>>>>> limit. >>>>>>>> >>>>>>>> * After flushing in-memory segments to disk, SAI won't >>>>>>>> >>>>>>>> merge >>>>>>>> >>>>>>>> on-disk >>>>>>>> >>>>>>>> segments while SASI >>>>>>>> >>>>>>>> attempts to merge them at the end. >>>>>>>> >>>>>>>> There are pros and cons of not merging segments: >>>>>>>> >>>>>>>> ** Pros: compaction runs faster and requires fewer >>>>>>>> >>>>>>>> resources. >>>>>>>> >>>>>>>> ** Cons: small segments reduce compression ratio. >>>>>>>> >>>>>>>> * SAI on-disk format with row ids compresses better. >>>>>>>> >>>>>>>> I understand the desire in keeping out of scope >>>>>>>> >>>>>>>> the >>>>>>>> >>>>>>>> longer >>>>>>>> >>>>>>>> term >>>>>>>> >>>>>>>> deprecation >>>>>>>> >>>>>>>> and migration plan, but… if SASI provides >>>>>>>> >>>>>>>> functionality >>>>>>>> >>>>>>>> that >>>>>>>> >>>>>>>> SAI >>>>>>>> >>>>>>>> doesn't, >>>>>>>> >>>>>>>> like tokenisation and DelimiterAnalyzer, yet >>>>>>>> >>>>>>>> introduces a >>>>>>>> >>>>>>>> body >>>>>>>> >>>>>>>> of >>>>>>>> >>>>>>>> code >>>>>>>> >>>>>>>> ~somewhat similar, shouldn't we be roughly >>>>>>>> >>>>>>>> sketching out >>>>>>>> >>>>>>>> how >>>>>>>> >>>>>>>> to >>>>>>>> >>>>>>>> reduce >>>>>>>> >>>>>>>> the >>>>>>>> >>>>>>>> maintenance surface area? >>>>>>>> >>>>>>>> Agreed that we should reduce maintenance area if >>>>>>>> >>>>>>>> possible, >>>>>>>> >>>>>>>> but >>>>>>>> >>>>>>>> only >>>>>>>> >>>>>>>> very >>>>>>>> >>>>>>>> limited >>>>>>>> >>>>>>>> code base (eg. RangeIterator, QueryPlan) can be >>>>>>>> >>>>>>>> shared. >>>>>>>> >>>>>>>> The >>>>>>>> >>>>>>>> rest >>>>>>>> >>>>>>>> of >>>>>>>> >>>>>>>> the >>>>>>>> >>>>>>>> code base >>>>>>>> >>>>>>>> is quite different because of on-disk format and >>>>>>>> >>>>>>>> cross-index >>>>>>>> >>>>>>>> files. >>>>>>>> >>>>>>>> The goal of this CEP is to get community buy-in on >>>>>>>> >>>>>>>> SAI's >>>>>>>> >>>>>>>> design. >>>>>>>> >>>>>>>> Tokenization, >>>>>>>> >>>>>>>> DelimiterAnalyzer should be straightforward to >>>>>>>> >>>>>>>> implement on >>>>>>>> >>>>>>>> top >>>>>>>> >>>>>>>> of >>>>>>>> >>>>>>>> SAI. >>>>>>>> >>>>>>>> Can we list what configurations of SASI will >>>>>>>> >>>>>>>> become >>>>>>>> >>>>>>>> deprecated >>>>>>>> >>>>>>>> once >>>>>>>> >>>>>>>> SAI >>>>>>>> >>>>>>>> becomes non-experimental? >>>>>>>> >>>>>>>> Except for "Like", "Tokenisation", >>>>>>>> >>>>>>>> "DelimiterAnalyzer", >>>>>>>> >>>>>>>> the >>>>>>>> >>>>>>>> rest >>>>>>>> >>>>>>>> of >>>>>>>> >>>>>>>> SASI >>>>>>>> >>>>>>>> can >>>>>>>> >>>>>>>> be replaced by SAI. >>>>>>>> >>>>>>>> Given a few bugs are open against 2i and SASI, can >>>>>>>> >>>>>>>> we >>>>>>>> >>>>>>>> provide >>>>>>>> >>>>>>>> some >>>>>>>> >>>>>>>> overview, or rough indication, of how many of them >>>>>>>> >>>>>>>> we >>>>>>>> >>>>>>>> could >>>>>>>> >>>>>>>> "triage >>>>>>>> >>>>>>>> away"? >>>>>>>> >>>>>>>> I believe most of the known bugs in 2i/SASI either >>>>>>>> >>>>>>>> have >>>>>>>> >>>>>>>> been >>>>>>>> >>>>>>>> addressed >>>>>>>> >>>>>>>> in >>>>>>>> >>>>>>>> SAI or >>>>>>>> >>>>>>>> don't apply to SAI. >>>>>>>> >>>>>>>> And, is it time for the project to start >>>>>>>> >>>>>>>> introducing new >>>>>>>> >>>>>>>> SPI >>>>>>>> >>>>>>>> implementations as separate sub-modules and jar >>>>>>>> >>>>>>>> files >>>>>>>> >>>>>>>> that >>>>>>>> >>>>>>>> are >>>>>>>> >>>>>>>> only >>>>>>>> >>>>>>>> loaded >>>>>>>> >>>>>>>> at runtime based on configuration settings? (sorry >>>>>>>> >>>>>>>> for >>>>>>>> >>>>>>>> the >>>>>>>> >>>>>>>> conflation >>>>>>>> >>>>>>>> on >>>>>>>> >>>>>>>> this one, but maybe it's the right time to raise >>>>>>>> >>>>>>>> it >>>>>>>> >>>>>>>> :shrug:) >>>>>>>> >>>>>>>> Agreed that modularization is the way to go and will >>>>>>>> >>>>>>>> speed up >>>>>>>> >>>>>>>> module >>>>>>>> >>>>>>>> development speed. >>>>>>>> >>>>>>>> Does community plan to open another discussion or CEP >>>>>>>> >>>>>>>> on >>>>>>>> >>>>>>>> modularization? >>>>>>>> >>>>>>>> On Mon, 24 Aug 2020 at 16:43, Mick Semb Wever < >>>>>>>> >>>>>>>> [email protected]> >>>>>>>> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Adding to Duy's questions… >>>>>>>> >>>>>>>> * Hardware specs >>>>>>>> >>>>>>>> SASI's performance, specifically the search in the >>>>>>>> >>>>>>>> B+ >>>>>>>> >>>>>>>> tree >>>>>>>> >>>>>>>> component, >>>>>>>> >>>>>>>> depends a lot on the component file's header being >>>>>>>> >>>>>>>> available in >>>>>>>> >>>>>>>> the >>>>>>>> >>>>>>>> pagecache. SASI benefits from (needs) nodes with >>>>>>>> >>>>>>>> lots >>>>>>>> >>>>>>>> of >>>>>>>> >>>>>>>> RAM. >>>>>>>> >>>>>>>> Is >>>>>>>> >>>>>>>> SAI >>>>>>>> >>>>>>>> bound >>>>>>>> >>>>>>>> to this same or similar limitation? >>>>>>>> >>>>>>>> Flushing of SASI can be CPU+IO intensive, to the >>>>>>>> >>>>>>>> point of >>>>>>>> >>>>>>>> saturation, >>>>>>>> >>>>>>>> pauses, and crashes on the node. SSDs are a must, >>>>>>>> >>>>>>>> along >>>>>>>> >>>>>>>> with a >>>>>>>> >>>>>>>> bit >>>>>>>> >>>>>>>> of >>>>>>>> >>>>>>>> tuning, just to avoid bringing down your cluster. >>>>>>>> >>>>>>>> Beyond >>>>>>>> >>>>>>>> reducing >>>>>>>> >>>>>>>> space >>>>>>>> >>>>>>>> requirements, does SAI improve on these things? Like >>>>>>>> >>>>>>>> SASI >>>>>>>> >>>>>>>> how >>>>>>>> >>>>>>>> does >>>>>>>> >>>>>>>> SAI, >>>>>>>> >>>>>>>> in >>>>>>>> >>>>>>>> its own way, change/narrow the recommendations on >>>>>>>> >>>>>>>> node >>>>>>>> >>>>>>>> hardware >>>>>>>> >>>>>>>> specs? >>>>>>>> >>>>>>>> * Code Maintenance >>>>>>>> >>>>>>>> I understand the desire in keeping out of scope the >>>>>>>> >>>>>>>> longer >>>>>>>> >>>>>>>> term >>>>>>>> >>>>>>>> deprecation >>>>>>>> >>>>>>>> and migration plan, but… if SASI provides >>>>>>>> >>>>>>>> functionality >>>>>>>> >>>>>>>> that >>>>>>>> >>>>>>>> SAI >>>>>>>> >>>>>>>> doesn't, >>>>>>>> >>>>>>>> like tokenisation and DelimiterAnalyzer, yet >>>>>>>> >>>>>>>> introduces a >>>>>>>> >>>>>>>> body >>>>>>>> >>>>>>>> of >>>>>>>> >>>>>>>> code >>>>>>>> >>>>>>>> ~somewhat similar, shouldn't we be roughly sketching >>>>>>>> >>>>>>>> out >>>>>>>> >>>>>>>> how to >>>>>>>> >>>>>>>> reduce >>>>>>>> >>>>>>>> the >>>>>>>> >>>>>>>> maintenance surface area? >>>>>>>> >>>>>>>> Can we list what configurations of SASI will become >>>>>>>> >>>>>>>> deprecated >>>>>>>> >>>>>>>> once >>>>>>>> >>>>>>>> SAI >>>>>>>> >>>>>>>> becomes non-experimental? >>>>>>>> >>>>>>>> Given a few bugs are open against 2i and SASI, can >>>>>>>> >>>>>>>> we >>>>>>>> >>>>>>>> provide >>>>>>>> >>>>>>>> some >>>>>>>> >>>>>>>> overview, or rough indication, of how many of them >>>>>>>> >>>>>>>> we >>>>>>>> >>>>>>>> could >>>>>>>> >>>>>>>> "triage >>>>>>>> >>>>>>>> away"? >>>>>>>> >>>>>>>> And, is it time for the project to start introducing >>>>>>>> >>>>>>>> new >>>>>>>> >>>>>>>> SPI >>>>>>>> >>>>>>>> implementations as separate sub-modules and jar >>>>>>>> >>>>>>>> files >>>>>>>> >>>>>>>> that >>>>>>>> >>>>>>>> are >>>>>>>> >>>>>>>> only >>>>>>>> >>>>>>>> loaded >>>>>>>> >>>>>>>> at runtime based on configuration settings? (sorry >>>>>>>> >>>>>>>> for the >>>>>>>> >>>>>>>> conflation >>>>>>>> >>>>>>>> on >>>>>>>> >>>>>>>> this one, but maybe it's the right time to raise it >>>>>>>> >>>>>>>> :shrug:) >>>>>>>> >>>>>>>> regards, >>>>>>>> >>>>>>>> Mick >>>>>>>> >>>>>>>> On Tue, 18 Aug 2020 at 13:05, DuyHai Doan < >>>>>>>> >>>>>>>> [email protected]> >>>>>>>> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Thank you Zhao Yang for starting this topic >>>>>>>> >>>>>>>> After reading the short design doc, I have a few >>>>>>>> >>>>>>>> questions >>>>>>>> >>>>>>>> 1) SASI was pretty inefficient indexing wide >>>>>>>> >>>>>>>> partitions >>>>>>>> >>>>>>>> because >>>>>>>> >>>>>>>> the >>>>>>>> >>>>>>>> index >>>>>>>> >>>>>>>> structure only retains the partition token, not >>>>>>>> >>>>>>>> the >>>>>>>> >>>>>>>> clustering >>>>>>>> >>>>>>>> colums. >>>>>>>> >>>>>>>> As >>>>>>>> >>>>>>>> per design doc SAI has row id mapping to partition >>>>>>>> >>>>>>>> offset, >>>>>>>> >>>>>>>> can >>>>>>>> >>>>>>>> we >>>>>>>> >>>>>>>> hope >>>>>>>> >>>>>>>> that >>>>>>>> >>>>>>>> indexing wide partition will be more efficient >>>>>>>> >>>>>>>> with >>>>>>>> >>>>>>>> SAI >>>>>>>> >>>>>>>> ? One >>>>>>>> >>>>>>>> detail >>>>>>>> >>>>>>>> that >>>>>>>> >>>>>>>> worries me is that in the beggining of the design >>>>>>>> >>>>>>>> doc, >>>>>>>> >>>>>>>> it is >>>>>>>> >>>>>>>> said >>>>>>>> >>>>>>>> that >>>>>>>> >>>>>>>> the >>>>>>>> >>>>>>>> matching rows are post filtered while scanning the >>>>>>>> >>>>>>>> partition. >>>>>>>> >>>>>>>> Can >>>>>>>> >>>>>>>> you >>>>>>>> >>>>>>>> confirm or infirm that SAI is efficient with wide >>>>>>>> >>>>>>>> partitions >>>>>>>> >>>>>>>> and >>>>>>>> >>>>>>>> provides >>>>>>>> >>>>>>>> the partition offsets to the matching rows ? >>>>>>>> >>>>>>>> 2) About space efficiency, one of the biggest >>>>>>>> >>>>>>>> drawback of >>>>>>>> >>>>>>>> SASI >>>>>>>> >>>>>>>> was >>>>>>>> >>>>>>>> the >>>>>>>> >>>>>>>> huge >>>>>>>> >>>>>>>> space required for index structure when using >>>>>>>> >>>>>>>> CONTAINS >>>>>>>> >>>>>>>> logic >>>>>>>> >>>>>>>> because >>>>>>>> >>>>>>>> of >>>>>>>> >>>>>>>> the >>>>>>>> >>>>>>>> decomposition of text columns into n-grams. Will >>>>>>>> >>>>>>>> SAI >>>>>>>> >>>>>>>> suffer >>>>>>>> >>>>>>>> from >>>>>>>> >>>>>>>> the >>>>>>>> >>>>>>>> same >>>>>>>> >>>>>>>> issue in future iterations ? I'm anticipating a >>>>>>>> >>>>>>>> bit >>>>>>>> >>>>>>>> 3) If I'm querying using SAI and providing >>>>>>>> >>>>>>>> complete >>>>>>>> >>>>>>>> partition >>>>>>>> >>>>>>>> key, >>>>>>>> >>>>>>>> will >>>>>>>> >>>>>>>> it >>>>>>>> >>>>>>>> be more efficient than querying without partition >>>>>>>> >>>>>>>> key. In >>>>>>>> >>>>>>>> other >>>>>>>> >>>>>>>> words, >>>>>>>> >>>>>>>> does >>>>>>>> >>>>>>>> SAI provide any optimisation when partition key is >>>>>>>> >>>>>>>> specified >>>>>>>> >>>>>>>> ? >>>>>>>> >>>>>>>> Regards >>>>>>>> >>>>>>>> Duy Hai DOAN >>>>>>>> >>>>>>>> Le mar. 18 août 2020 à 11:39, Mick Semb Wever < >>>>>>>> >>>>>>>> [email protected]> >>>>>>>> >>>>>>>> a >>>>>>>> >>>>>>>> écrit : >>>>>>>> >>>>>>>> We are looking forward to the community's >>>>>>>> >>>>>>>> feedback >>>>>>>> >>>>>>>> and >>>>>>>> >>>>>>>> suggestions. >>>>>>>> >>>>>>>> What comes immediately to mind is testing >>>>>>>> >>>>>>>> requirements. It >>>>>>>> >>>>>>>> has >>>>>>>> >>>>>>>> been >>>>>>>> >>>>>>>> mentioned already that the project's testability >>>>>>>> >>>>>>>> and QA >>>>>>>> >>>>>>>> guidelines >>>>>>>> >>>>>>>> are >>>>>>>> >>>>>>>> inadequate to successfully introduce new >>>>>>>> >>>>>>>> features >>>>>>>> >>>>>>>> and >>>>>>>> >>>>>>>> refactorings >>>>>>>> >>>>>>>> to >>>>>>>> >>>>>>>> the >>>>>>>> >>>>>>>> codebase. During the 4.0 beta phase this was >>>>>>>> >>>>>>>> intended >>>>>>>> >>>>>>>> to be >>>>>>>> >>>>>>>> addressed, >>>>>>>> >>>>>>>> i.e. >>>>>>>> >>>>>>>> defining more specific QA guidelines for 4.0-rc. >>>>>>>> >>>>>>>> This >>>>>>>> >>>>>>>> would >>>>>>>> >>>>>>>> be >>>>>>>> >>>>>>>> an >>>>>>>> >>>>>>>> important >>>>>>>> >>>>>>>> step towards QA guidelines for all changes and >>>>>>>> >>>>>>>> CEPs >>>>>>>> >>>>>>>> post-4.0. >>>>>>>> >>>>>>>> Questions from me >>>>>>>> >>>>>>>> - How will this be tested, how will its QA >>>>>>>> >>>>>>>> status and >>>>>>>> >>>>>>>> lifecycle >>>>>>>> >>>>>>>> be >>>>>>>> >>>>>>>> defined? (per above) >>>>>>>> >>>>>>>> - With existing C* code needing to be changed, >>>>>>>> >>>>>>>> what >>>>>>>> >>>>>>>> is the >>>>>>>> >>>>>>>> proposed >>>>>>>> >>>>>>>> plan >>>>>>>> >>>>>>>> for making those changes ensuring maintained QA, >>>>>>>> >>>>>>>> e.g. >>>>>>>> >>>>>>>> is >>>>>>>> >>>>>>>> there >>>>>>>> >>>>>>>> separate >>>>>>>> >>>>>>>> QA >>>>>>>> >>>>>>>> cycles planned for altering the SPI before >>>>>>>> >>>>>>>> adding >>>>>>>> >>>>>>>> a >>>>>>>> >>>>>>>> new SPI >>>>>>>> >>>>>>>> implementation? >>>>>>>> >>>>>>>> - Despite being out of scope, it would be nice >>>>>>>> >>>>>>>> to have >>>>>>>> >>>>>>>> some >>>>>>>> >>>>>>>> idea >>>>>>>> >>>>>>>> from >>>>>>>> >>>>>>>> the >>>>>>>> >>>>>>>> CEP author of when users might still choose >>>>>>>> >>>>>>>> afresh 2i >>>>>>>> >>>>>>>> or >>>>>>>> >>>>>>>> SASI >>>>>>>> >>>>>>>> over >>>>>>>> >>>>>>>> SAI, >>>>>>>> >>>>>>>> - Who fills the roles involved? Who are the >>>>>>>> >>>>>>>> contributors >>>>>>>> >>>>>>>> in >>>>>>>> >>>>>>>> this >>>>>>>> >>>>>>>> DataStax >>>>>>>> >>>>>>>> team? Who is the shepherd? Are there other >>>>>>>> >>>>>>>> stakeholders >>>>>>>> >>>>>>>> willing >>>>>>>> >>>>>>>> to >>>>>>>> >>>>>>>> be >>>>>>>> >>>>>>>> involved? >>>>>>>> >>>>>>>> - Is there a preference to use gdoc instead of >>>>>>>> >>>>>>>> the >>>>>>>> >>>>>>>> project's >>>>>>>> >>>>>>>> wiki, >>>>>>>> >>>>>>>> and >>>>>>>> >>>>>>>> why? (the CEP process suggest a wiki page, and >>>>>>>> >>>>>>>> feedback on >>>>>>>> >>>>>>>> why >>>>>>>> >>>>>>>> another >>>>>>>> >>>>>>>> approach is considered better helps evolve the >>>>>>>> >>>>>>>> CEP >>>>>>>> >>>>>>>> process >>>>>>>> >>>>>>>> itself) >>>>>>>> >>>>>>>> cheers, >>>>>>>> >>>>>>>> Mick >>>>>>>> >>>>>>>> >>>>>> >> --------------------------------------------------------------------- >>>>>>>> >>>>>>>> To unsubscribe, e-mail: >> [email protected] >>>> For >>>>>>>> additional commands, e-mail: [email protected] >>>>>>>> >>>>>>>> >>>>>>> >>> --------------------------------------------------------------------- >>>>> To >>>>>>>> unsubscribe, e-mail: [email protected] >>> For >>>>>>> additional >>>>>>>> commands, e-mail: [email protected] >>>>>>>> >>>>>>>> -- >>>>>>>> alex p >>>>>>>> >>>>>>>> >>>>>>> >>> --------------------------------------------------------------------- >>>>> To >>>>>>>> unsubscribe, e-mail: [email protected] >>> For >>>>>>> additional >>>>>>>> commands, e-mail: [email protected] >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>> For additional commands, e-mail: [email protected] >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> alex p
