Given the distributed search part is an issue with our secondary indexes in general, and not with any implementation, I don’t see a reason to hold up a vote on CEP-7 for it?
-Jeremiah > On Feb 2, 2022, at 10:01 AM, Henrik Ingo <henrik.i...@datastax.com> wrote: > > So this is an area I've thought about and in fact the overall dynamics are > the same as for MongoDB secondary indexes in a sharded cluster. The TL:DR; is > that the benefits far outweigh the limitations: > > * There's a large area of queries where you have the partition key but not > the full Primary Key. SAI (now with row awareness) is an efficient solution > for such queries. > * As a special case of the above would be that you have a partition key (or > keys) but want to sort by something else than the clustering key. However, > note that the current version of SAI doesn't actually support sorting. > * Your cluster has at most 10-20 nodes and the share of queries that lack a > partition key is at most 5% - 10%. > * Even for very large clusters, a low frequency of queries without partition > key is fine. > > If all of the above was obvious and the discussion was only about what > Guardrails we may want to set to warn or stop the use, then apologies... I > would suggest the guardrail could be that if share of non-pk queries *on each > node* is above 33% guardrails should warn and if it's above 66% it should > fail the non-pk queries. > > I blogged about the math behind scalability of secondary indexes a year ago: > https://web.archive.org/web/20210814021809/https://www.openlife.cc/blogs/2020/november/scalability-model-cassandra > > <https://web.archive.org/web/20210814021809/https://www.openlife.cc/blogs/2020/november/scalability-model-cassandra> > > henrik > > On Wed, Feb 2, 2022 at 3:59 PM Joshua McKenzie <jmcken...@apache.org > <mailto:jmcken...@apache.org>> wrote: > To me the outstanding thing worth tackling is the Challenges section Caleb > added in the CEP. Specifically: > "The only "easy" way around these two challenges is to focus our efforts on > queries that are restricted to either partitions or small token ranges. These > queries behave well locally even on LCS (given levels contain token-disjoint > SSTables, and assuming a low number of unleveled SSTables), avoid fan-out and > all of its secondary pitfalls, and allow us to make queries at varying CLs > with reasonable performance. Attempting to fix the local problems around > compaction strategy could mean either restricted strategy usage or partially > abandoning SSTable-attachment. Attempting to fix distributed read path > problems by pushing the design towards IR systems like ES could compromise > our ability to use higher read CLs." > > This is probably something we could integrate with Guardrails out of the gate > to discourage suboptimal use right? Or at least allude to in the CEP so it's > something on our rader. > > One of the big downfalls of Materialized Views (aside from the orphaned data > and inconsistency pains) was the lack of limits on creation of them (either > number or structure / data amount) with serious un-inspectable implications > on disk usage and performance. The more we can learn from those missteps the > better. > > On Wed, Feb 2, 2022 at 8:24 AM Mike Adamson <madam...@datastax.com > <mailto:madam...@datastax.com>> wrote: > Hi, > > I’d like to restart this thread. > > We merged the row-aware branch to the SAI codebase just before Christmas and > have subsequently updated the CEP to reflect these changes. > > I would like to move the discussion forward as to how we move this CEP > towards a vote. > > MikeA > >> On 16 Sep 2021, at 19:49, DuyHai Doan <doanduy...@gmail.com >> <mailto:doanduy...@gmail.com>> wrote: >> >> Good new Mike that row based indexing will be available, this was a major >> lacking from SASI at that time ! >> >> Le jeu. 16 sept. 2021 à 15:38, Mike Adamson <madam...@datastax.com >> <mailto:madam...@datastax.com>> a >> écrit : >> >>> Hi, >>> >>> Just to keep this thread up to date with development progress, we will be >>> adding row-aware support to SAI in the next few weeks. This is currently >>> going through the final stages of review and testing. >>> >>> This feature also adds on-disk versioning to SAI. This allows SAI to >>> support multiple on-disk formats during upgrades. >>> >>> I am mentioning this now because the CEP mentions “Partition Based >>> Iteration” as an initial feature. We will change that to “Row Based >>> Iteration” when the feature is merged. >>> >>> MikeA >>> >>>> On 15 Sep 2021, at 19:42, Caleb Rackliffe <calebrackli...@gmail.com >>>> <mailto:calebrackli...@gmail.com>> >>> wrote: >>>> >>>> Hey there, >>>> >>>> In the spirit of trying to get as many possible objections to a >>> successful >>>> vote out of the way, I've added a "Challenges" section to the CEP: >>>> >>>> >>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-7%3A+Storage+Attached+Index#CEP7:StorageAttachedIndex-Challenges >>> >>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-7%3A+Storage+Attached+Index#CEP7:StorageAttachedIndex-Challenges> >>> < >>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-7%3A+Storage+Attached+Index#CEP7:StorageAttachedIndex-Challenges >>> >>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-7%3A+Storage+Attached+Index#CEP7:StorageAttachedIndex-Challenges> >>>> >>>> >>>> Most of you will be familiar with these, but I think we need to be as >>>> open/candid as possible about the potential risk they pose to SAI's >>> broader >>>> usability. I've described them from the point of view that they are not >>>> intractable, but if anyone thinks they are, let's hash that disagreement >>>> out. >>>> >>>> Thanks! >>>> >>>> On Thu, Sep 9, 2021 at 11:13 AM Patrick McFadin <pmcfa...@gmail.com >>>> <mailto:pmcfa...@gmail.com> >>> <mailto:pmcfa...@gmail.com <mailto:pmcfa...@gmail.com>>> wrote: >>>> >>>>> +1 on introducing this in an incremental manner and after reading >>> through >>>>> CASSANDRA-16092 that seems like a perfect place to start. I see that >>> work >>>>> on that Jira has stopped until direction for CEP-7 has been voted in. >>>>> >>>>> I say start the vote and let's get this really valuable developer >>> feature >>>>> underway. >>>>> >>>>> Patrick >>>>> >>>>> On Tue, Sep 7, 2021 at 10:40 AM Caleb Rackliffe < >>> calebrackli...@gmail.com <mailto:calebrackli...@gmail.com>> >>>>> wrote: >>>>> >>>>>> So this thread stalled almost a year ago. (Wow, time flies when you're >>>>>> trying to release 4.0.) My synthesis of the conversation to this point >>> is >>>>>> that while there are some open questions about testing >>>>>> methodology/"definition of done" and our choice of particular on-disk >>>>> data >>>>>> structures, neither of these should be a serious obstacle to moving >>>>> forward >>>>>> w/ a vote. Having said that, is there anything left around the CEP that >>>>> we >>>>>> feel should prevent it from moving to a vote? >>>>>> >>>>>> In terms of how we would proceed from the point a vote passes, it seems >>>>>> like there have been enough concerns around the proposed/necessary >>>>> breaking >>>>>> changes to the 2i API, that we will start development by introducing >>>>>> components as incrementally as possible into a long-running feature >>>>> branch >>>>>> off trunk. (This work would likely start w/ *CASSANDRA-16092* >>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-16092 >>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-16092>>, which we >>> could >>>>>> resolve as a sub-task of the SAI epic without interfering with other >>>>> trunk >>>>>> development likely destined for a 4.x minor, etc.) >>>>>> >>>>>> On Thu, Sep 24, 2020 at 2:47 AM Jasonstack Zhao Yang < >>>>>> jasonstack.z...@gmail.com <mailto:jasonstack.z...@gmail.com>> wrote: >>>>>> >>>>>>>>> Question is: is this planned as a next step? >>>>>>>>> If yes, how are we going to mark SAI as experimental until it gets >>>>>>>>> row offsets? Also, it is likely that index format is going to change >>>>>>> when >>>>>>>>> row offsets are added, so my concern is that we may have to support >>>>>> two >>>>>>>>> versions of a format for a smooth migration. >>>>>>> >>>>>>> The goal is to support row-level index when merging SAI, I will update >>>>>> the >>>>>>> CEP about it. >>>>>>> >>>>>>>>> I think switching to row >>>>>>>>> offsets also has a huge impact on interaction with SPRC and has some >>>>>>>>> potential for optimisations. >>>>>>> >>>>>>> Can you share more details on the optimizations? >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, 24 Sep 2020 at 15:20, Oleksandr Petrov < >>>>>> oleksandr.pet...@gmail.com <mailto:oleksandr.pet...@gmail.com> >>>>>>>> >>>>>>> wrote: >>>>>>> >>>>>>>>> But for improving overall index read performance, I think improving >>>>>>> base >>>>>>>> table read perf (because SAI/SASI executes LOTS of >>>>>>>> SinglePartitionReadCommand after searching on-disk index) is more >>>>>>> effective >>>>>>>> than switching from Trie to Prefix BTree. >>>>>>>> >>>>>>>> I haven't suggested switching to Prefix B-Tree or any other >>>>> structure, >>>>>>> the >>>>>>>> question was about rationale and motivation of picking one over the >>>>>>> other, >>>>>>>> which I am curious about for personal reasons/interests that lie >>>>>> outside >>>>>>> of >>>>>>>> Cassandra. Having this listed in CEP could have been helpful for >>>>> future >>>>>>>> guidance. It's ok if this question is outside of the CEP scope. >>>>>>>> >>>>>>>> I also agree that there are many areas that require improvement >>>>> around >>>>>>> the >>>>>>>> read/write path and 2i, many of which (even outside of base table >>>>>> format >>>>>>> or >>>>>>>> read perf) can yield positive performance results. >>>>>>>> >>>>>>>>> FWIW, I personally look forward to receiving that contribution when >>>>>> the >>>>>>>> time is right. >>>>>>>> >>>>>>>> I am very excited for this contribution, too, and it looks like very >>>>>>> solid >>>>>>>> work. >>>>>>>> >>>>>>>> I have one more question, about "Upon resolving partition keys, rows >>>>>> are >>>>>>>> loaded using Cassandra’s internal partition read command across >>>>>> SSTables >>>>>>>> and are post filtered". One of the criticisms of SASI and reasons for >>>>>>>> marking it as experimental was CASSANDRA-11990. I think switching to >>>>>> row >>>>>>>> offsets also has a huge impact on interaction with SPRC and has some >>>>>>>> potential for optimisations. Question is: is this planned as a next >>>>>> step? >>>>>>>> If yes, how are we going to mark SAI as experimental until it gets >>>>>>>> row offsets? Also, it is likely that index format is going to change >>>>>> when >>>>>>>> row offsets are added, so my concern is that we may have to support >>>>> two >>>>>>>> versions of a format for a smooth migration. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Sep 24, 2020 at 6:53 AM Jasonstack Zhao Yang < >>>>>>>> jasonstack.z...@gmail.com <mailto:jasonstack.z...@gmail.com>> wrote: >>>>>>>> >>>>>>>>>>> I think CEP should be more upfront with "eventually replace >>>>>>>>>>> it" bit, since it raises the question about what the people who >>>>>> are >>>>>>>>> using >>>>>>>>>>> other index implementations can expect. >>>>>>>>> >>>>>>>>> Will update the CEP to emphasize: SAI will replace other indexes. >>>>>>>>> >>>>>>>>>>> Unfortunately, I do not have an >>>>>>>>>>> implementation sitting around for a direct comparison, but I can >>>>>>>> imagine >>>>>>>>>>> situations when B-Trees may perform better because of simpler >>>>>>>>> construction. >>>>>>>>>>> Maybe we should even consider prototyping a prefix B-Tree to >>>>> have >>>>>> a >>>>>>>> more >>>>>>>>>>> fair comparison. >>>>>>>>> >>>>>>>>> As long as prefix BTree supports range/prefix aggregation (which is >>>>>>> used >>>>>>>> to >>>>>>>>> speed up >>>>>>>>> range/prefix query when matching entire subtree), we can plug it in >>>>>> and >>>>>>>>> compare. It won't >>>>>>>>> affect the CEP design which focuses on sharing data across indexes >>>>>> and >>>>>>>>> posting aggregation. >>>>>>>>> >>>>>>>>> But for improving overall index read performance, I think improving >>>>>>> base >>>>>>>>> table read perf >>>>>>>>> (because SAI/SASI executes LOTS of SinglePartitionReadCommand >>>>> after >>>>>>>>> searching on-disk index) >>>>>>>>> is more effective than switching from Trie to Prefix BTree. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, 24 Sep 2020 at 05:33, Benedict Elliott Smith < >>>>>>>> bened...@apache.org <mailto:bened...@apache.org>> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> FWIW, I personally look forward to receiving that contribution >>>>> when >>>>>>> the >>>>>>>>>> time is right. >>>>>>>>>> >>>>>>>>>> On 23/09/2020, 18:45, "Josh McKenzie" <jmcken...@apache.org >>>>>>>>>> <mailto:jmcken...@apache.org>> >>>>>> wrote: >>>>>>>>>> >>>>>>>>>> talking about that would involve some bits of information >>>>>>> DataStax >>>>>>>>>> might >>>>>>>>>> not be ready to share? >>>>>>>>>> >>>>>>>>>> At the risk of derailing, I've been poking and prodding this >>>>>> week >>>>>>>> at >>>>>>>>> we >>>>>>>>>> contributors at DS getting our act together w/a draft CEP for >>>>>>>>> donating >>>>>>>>>> the >>>>>>>>>> trie-based indices to the ASF project. >>>>>>>>>> >>>>>>>>>> More to come; the intention is certainly to contribute that >>>>>> code. >>>>>>>> The >>>>>>>>>> lack >>>>>>>>>> of a destination to merge it into (i.e. no 5.0-dev branch) is >>>>>>>>> removing >>>>>>>>>> significant urgency from the process as well (not to open a >>>>> 3rd >>>>>>>>>> Pandora's >>>>>>>>>> box), but there's certainly an interrelatedness to the >>>>>>>> conversations >>>>>>>>>> going >>>>>>>>>> on. >>>>>>>>>> >>>>>>>>>> --- >>>>>>>>>> Josh McKenzie >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Sent via Superhuman < >>>>> >>> https://urldefense.proofpoint.com/v2/url?u=https-3A__sprh.mn_-3Fvip-3Djmckenzie-40apache.org&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=W153pibedwV7j_YCKUR0MVt-tPDUbvaHukx68pAo9zc&m=epkiu_3NED8CL23Ylg9qVnK7VfGLJGsT28TGXN6Wmc4&s=gJ7VsN1vFUYz0czKFU8Dv28TViVbCWWF1zE3ZQlxtWc&e= >>> >>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__sprh.mn_-3Fvip-3Djmckenzie-40apache.org&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=W153pibedwV7j_YCKUR0MVt-tPDUbvaHukx68pAo9zc&m=epkiu_3NED8CL23Ylg9qVnK7VfGLJGsT28TGXN6Wmc4&s=gJ7VsN1vFUYz0czKFU8Dv28TViVbCWWF1zE3ZQlxtWc&e=> >>> < >>> https://urldefense.proofpoint.com/v2/url?u=https-3A__sprh.mn_-3Fvip-3Djmckenzie-40apache.org&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=W153pibedwV7j_YCKUR0MVt-tPDUbvaHukx68pAo9zc&m=epkiu_3NED8CL23Ylg9qVnK7VfGLJGsT28TGXN6Wmc4&s=gJ7VsN1vFUYz0czKFU8Dv28TViVbCWWF1zE3ZQlxtWc&e= >>> >>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__sprh.mn_-3Fvip-3Djmckenzie-40apache.org&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=W153pibedwV7j_YCKUR0MVt-tPDUbvaHukx68pAo9zc&m=epkiu_3NED8CL23Ylg9qVnK7VfGLJGsT28TGXN6Wmc4&s=gJ7VsN1vFUYz0czKFU8Dv28TViVbCWWF1zE3ZQlxtWc&e=>> >>> >>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Sep 23, 2020 at 12:48 PM, Caleb Rackliffe < >>>>>>>>>> calebrackli...@gmail.com <mailto:calebrackli...@gmail.com>> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> As long as we can construct the on-disk indexes >>>>>>>>> efficiently/directly >>>>>>>>>> from >>>>>>>>>>> a Memtable-attached index on flush, there's room to try >>>>> other >>>>>>>> data >>>>>>>>>>> structures. Most of the innovation in SAI is around the >>>>>> layout >>>>>>> of >>>>>>>>>> postings >>>>>>>>>>> (something we can expand on if people are interested) and >>>>>>> having >>>>>>>> a >>>>>>>>>>> natively row-oriented design that scales w/ multiple >>>>> indexed >>>>>>>>> columns >>>>>>>>>> on >>>>>>>>>>> single SSTables. There are some broader implications of >>>>> using >>>>>>> the >>>>>>>>>> trie that >>>>>>>>>>> reach outside SAI itself, but talking about that would >>>>>> involve >>>>>>>> some >>>>>>>>>> bits of >>>>>>>>>>> information DataStax might not be ready to share? >>>>>>>>>>> >>>>>>>>>>> On Wed, Sep 23, 2020 at 11:00 AM Jeremiah D Jordan < >>>>>>>>> jeremiah.jordan@ >>>>>>>>>>> gmail.com <http://gmail.com/>> wrote: >>>>>>>>>>> >>>>>>>>>>> Short question: looking forward, how are we going to >>>>> maintain >>>>>>>> three >>>>>>>>>> 2i >>>>>>>>>>> implementations: SASI, SAI, and 2i? >>>>>>>>>>> >>>>>>>>>>> I think one of the goals stated in the CEP is for SAI to >>>>> have >>>>>>>>> parity >>>>>>>>>> with >>>>>>>>>>> 2i such that it could eventually replace it. >>>>>>>>>>> >>>>>>>>>>> On Sep 23, 2020, at 10:34 AM, Oleksandr Petrov < >>>>>>>>>>> >>>>>>>>>>> oleksandr.pet...@gmail.com <mailto:oleksandr.pet...@gmail.com>> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Short question: looking forward, how are we going to >>>>> maintain >>>>>>>> three >>>>>>>>>> 2i >>>>>>>>>>> implementations: SASI, SAI, and 2i? >>>>>>>>>>> >>>>>>>>>>> Another thing I think this CEP is missing is rationale and >>>>>>>>> motivation >>>>>>>>>>> about why trie-based indexes were chosen over, say, B-Tree. >>>>>> We >>>>>>>> did >>>>>>>>>> have a >>>>>>>>>>> short discussion about this on Slack, but both arguments >>>>> that >>>>>>>> I've >>>>>>>>>> heard >>>>>>>>>>> (space-saving and keeping a small subset of nodes in >>>>> memory) >>>>>>> work >>>>>>>>>> only >>>>>>>>>>> >>>>>>>>>>> for >>>>>>>>>>> >>>>>>>>>>> the most primitive implementation of a B-Tree. >>>>> Fully-occupied >>>>>>>>> prefix >>>>>>>>>>> >>>>>>>>>>> B-Tree >>>>>>>>>>> >>>>>>>>>>> can have similar properties. There's been a lot of research >>>>>> on >>>>>>>>>> B-Trees >>>>>>>>>>> >>>>>>>>>>> and >>>>>>>>>>> >>>>>>>>>>> optimisations in those. Unfortunately, I do not have an >>>>>>>>>> implementation >>>>>>>>>>> sitting around for a direct comparison, but I can imagine >>>>>>>>> situations >>>>>>>>>> when >>>>>>>>>>> B-Trees may perform better because of simpler >>>>>>>>>>> >>>>>>>>>>> construction. >>>>>>>>>>> >>>>>>>>>>> Maybe we should even consider prototyping a prefix B-Tree >>>>> to >>>>>>>> have a >>>>>>>>>> more >>>>>>>>>>> fair comparison. >>>>>>>>>>> >>>>>>>>>>> Thank you, >>>>>>>>>>> -- Alex >>>>>>>>>>> >>>>>>>>>>> On Thu, Sep 10, 2020 at 9:12 AM Jasonstack Zhao Yang < >>>>>>>>>> jasonstack.zhao@ >>>>>>>>>>> gmail.com <http://gmail.com/>> wrote: >>>>>>>>>>> >>>>>>>>>>> Thank you Patrick for hosting Cassandra Contributor Meeting >>>>>> for >>>>>>>>> CEP-7 >>>>>>>>>>> >>>>>>>>>>> SAI. >>>>>>>>>>> >>>>>>>>>>> The recorded video is available here: >>>>>>>>>>> >>>>>>>>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/ >>>>>>>>>>> <https://cwiki.apache.org/confluence/display/CASSANDRA/> >>>>>>>>>>> 2020-09-01+Apache+Cassandra+Contributor+Meeting >>>>>>>>>>> >>>>>>>>>>> On Tue, 1 Sep 2020 at 14:34, Jasonstack Zhao Yang < >>>>>>>>>> jasonstack.zhao@gmail. >>>>>>>>>>> com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Thank you, Charles and Patrick >>>>>>>>>>> >>>>>>>>>>> On Tue, 1 Sep 2020 at 04:56, Charles Cao < >>>>>> caohair...@gmail.com <mailto:caohair...@gmail.com> >>>>>>>> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Thank you, Patrick! >>>>>>>>>>> >>>>>>>>>>> On Mon, Aug 31, 2020 at 12:59 PM Patrick McFadin < >>>>>>>>> pmcfa...@gmail.com <mailto:pmcfa...@gmail.com> >>>>>>>>>>> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> I just moved it to 8AM for this meeting to better >>>>> accommodate >>>>>>>> APAC. >>>>>>>>>>> >>>>>>>>>>> Please >>>>>>>>>>> >>>>>>>>>>> see the update here: >>>>>>>>>>> >>>>>>>>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/ >>>>>>>>>>> <https://cwiki.apache.org/confluence/display/CASSANDRA/> >>>>>>>>>>> 2020-08-01+Apache+Cassandra+Contributor+Meeting >>>>>>>>>>> >>>>>>>>>>> Patrick >>>>>>>>>>> >>>>>>>>>>> On Mon, Aug 31, 2020 at 10:04 AM Charles Cao < >>>>>>>> caohair...@gmail.com <mailto:caohair...@gmail.com> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Patrick, >>>>>>>>>>> >>>>>>>>>>> 11AM PST is a bad time for the people in the APAC timezone. >>>>>> Can >>>>>>>> we >>>>>>>>>> move it >>>>>>>>>>> to 7 or 8AM PST in the morning to accommodate their needs ? >>>>>>>>>>> >>>>>>>>>>> ~Charles >>>>>>>>>>> >>>>>>>>>>> On Fri, Aug 28, 2020 at 4:37 PM Patrick McFadin < >>>>>>>>> pmcfa...@gmail.com <mailto:pmcfa...@gmail.com> >>>>>>>>>>> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Meeting scheduled. >>>>>>>>>>> >>>>>>>>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/ >>>>>>>>>>> <https://cwiki.apache.org/confluence/display/CASSANDRA/> >>>>>>>>>>> 2020-08-01+Apache+Cassandra+Contributor+Meeting >>>>>>>>>>> >>>>>>>>>>> Tuesday September 1st, 11AM PST. I added a basic bullet for >>>>>> the >>>>>>>>>>> >>>>>>>>>>> agenda >>>>>>>>>>> >>>>>>>>>>> but >>>>>>>>>>> >>>>>>>>>>> if there is more, edit away. >>>>>>>>>>> >>>>>>>>>>> Patrick >>>>>>>>>>> >>>>>>>>>>> On Thu, Aug 27, 2020 at 11:31 AM Jasonstack Zhao Yang < >>>>>>>>>> jasonstack.zhao@ >>>>>>>>>>> gmail.com <http://gmail.com/>> wrote: >>>>>>>>>>> >>>>>>>>>>> +1 >>>>>>>>>>> >>>>>>>>>>> On Thu, 27 Aug 2020 at 04:52, Ekaterina Dimitrova < >>>>>>>>>>> >>>>>>>>>>> e.dimitr...@gmail.com <mailto:e.dimitr...@gmail.com>> >>>>>>>>>>> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> +1 >>>>>>>>>>> >>>>>>>>>>> On Wed, 26 Aug 2020 at 16:48, Caleb Rackliffe < >>>>>>>>>>> >>>>>>>>>>> calebrackli...@gmail.com <mailto:calebrackli...@gmail.com>> >>>>>>>>>>> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> +1 >>>>>>>>>>> >>>>>>>>>>> On Wed, Aug 26, 2020, 3:45 PM Patrick McFadin < >>>>>>>>>>> >>>>>>>>>>> pmcfa...@gmail.com <mailto:pmcfa...@gmail.com>> >>>>>>>>>>> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> This is related to the discussion Jordan and I had about >>>>>>>>>>> >>>>>>>>>>> the >>>>>>>>>>> >>>>>>>>>>> contributor >>>>>>>>>>> >>>>>>>>>>> Zoom call. Instead of open mic for any issue, call it >>>>>>>>>>> >>>>>>>>>>> based >>>>>>>>>>> >>>>>>>>>>> on a >>>>>>>>>>> >>>>>>>>>>> discussion >>>>>>>>>>> >>>>>>>>>>> thread or threads for higher bandwidth discussion. >>>>>>>>>>> >>>>>>>>>>> I would be happy to schedule on for next week to >>>>>>>>>>> >>>>>>>>>>> specifically >>>>>>>>>>> >>>>>>>>>>> discuss >>>>>>>>>>> >>>>>>>>>>> CEP-7. I can attach the recorded call to the CEP after. >>>>>>>>>>> >>>>>>>>>>> +1 or -1? >>>>>>>>>>> >>>>>>>>>>> Patrick >>>>>>>>>>> >>>>>>>>>>> On Tue, Aug 25, 2020 at 7:03 AM Joshua McKenzie < >>>>>>>>>>> >>>>>>>>>>> jmcken...@apache.org <mailto:jmcken...@apache.org>> >>>>>>>>>>> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Does community plan to open another discussion or CEP >>>>>>>>>>> >>>>>>>>>>> on >>>>>>>>>>> >>>>>>>>>>> modularization? >>>>>>>>>>> >>>>>>>>>>> We probably should have a discussion on the ML or >>>>>>>>>>> >>>>>>>>>>> monthly >>>>>>>>>>> >>>>>>>>>>> contrib >>>>>>>>>>> >>>>>>>>>>> call >>>>>>>>>>> >>>>>>>>>>> about it first to see how aligned the interested >>>>>>>>>>> >>>>>>>>>>> contributors >>>>>>>>>>> >>>>>>>>>>> are. >>>>>>>>>>> >>>>>>>>>>> Could >>>>>>>>>>> >>>>>>>>>>> do >>>>>>>>>>> >>>>>>>>>>> that through CEP as well but CEP's (at least thus far >>>>>>>>>>> >>>>>>>>>>> sans k8s >>>>>>>>>>> >>>>>>>>>>> operator) >>>>>>>>>>> >>>>>>>>>>> tend to start with a strong, deeply thought out point of >>>>>>>>>>> >>>>>>>>>>> view >>>>>>>>>>> >>>>>>>>>>> being >>>>>>>>>>> >>>>>>>>>>> expressed. >>>>>>>>>>> >>>>>>>>>>> On Tue, Aug 25, 2020 at 3:26 AM Jasonstack Zhao Yang < >>>>>>>>>>> >>>>>>>>>>> jasonstack.z...@gmail.com <mailto:jasonstack.z...@gmail.com>> wrote: >>>>>>>>>>> >>>>>>>>>>> SASI's performance, specifically the search in the >>>>>>>>>>> >>>>>>>>>>> B+ >>>>>>>>>>> >>>>>>>>>>> tree >>>>>>>>>>> >>>>>>>>>>> component, >>>>>>>>>>> >>>>>>>>>>> depends a lot on the component file's header being >>>>>>>>>>> >>>>>>>>>>> available >>>>>>>>>>> >>>>>>>>>>> in >>>>>>>>>>> >>>>>>>>>>> the >>>>>>>>>>> >>>>>>>>>>> pagecache. SASI benefits from (needs) nodes with >>>>>>>>>>> >>>>>>>>>>> lots of >>>>>>>>>>> >>>>>>>>>>> RAM. >>>>>>>>>>> >>>>>>>>>>> Is >>>>>>>>>>> >>>>>>>>>>> SAI >>>>>>>>>>> >>>>>>>>>>> bound >>>>>>>>>>> >>>>>>>>>>> to this same or similar limitation? >>>>>>>>>>> >>>>>>>>>>> SAI also benefits from larger memory because SAI puts >>>>>>>>>>> >>>>>>>>>>> block >>>>>>>>>>> >>>>>>>>>>> info >>>>>>>>>>> >>>>>>>>>>> on >>>>>>>>>>> >>>>>>>>>>> heap >>>>>>>>>>> >>>>>>>>>>> for searching on-disk components and having >>>>>>>>>>> >>>>>>>>>>> cross-index >>>>>>>>>>> >>>>>>>>>>> files on >>>>>>>>>>> >>>>>>>>>>> page >>>>>>>>>>> >>>>>>>>>>> cache >>>>>>>>>>> >>>>>>>>>>> improves read performance of different indexes on the >>>>>>>>>>> >>>>>>>>>>> same >>>>>>>>>>> >>>>>>>>>>> table. >>>>>>>>>>> >>>>>>>>>>> Flushing of SASI can be CPU+IO intensive, to the >>>>>>>>>>> >>>>>>>>>>> point of >>>>>>>>>>> >>>>>>>>>>> saturation, >>>>>>>>>>> >>>>>>>>>>> pauses, and crashes on the node. SSDs are a must, >>>>>>>>>>> >>>>>>>>>>> along >>>>>>>>>>> >>>>>>>>>>> with >>>>>>>>>>> >>>>>>>>>>> a >>>>>>>>>>> >>>>>>>>>>> bit >>>>>>>>>>> >>>>>>>>>>> of >>>>>>>>>>> >>>>>>>>>>> tuning, just to avoid bringing down your cluster. >>>>>>>>>>> >>>>>>>>>>> Beyond >>>>>>>>>>> >>>>>>>>>>> reducing >>>>>>>>>>> >>>>>>>>>>> space >>>>>>>>>>> >>>>>>>>>>> requirements, does SAI improve on these things? >>>>>>>>>>> >>>>>>>>>>> Like >>>>>>>>>>> >>>>>>>>>>> SASI how >>>>>>>>>>> >>>>>>>>>>> does >>>>>>>>>>> >>>>>>>>>>> SAI, >>>>>>>>>>> >>>>>>>>>>> in >>>>>>>>>>> >>>>>>>>>>> its own way, change/narrow the recommendations on >>>>>>>>>>> >>>>>>>>>>> node >>>>>>>>>>> >>>>>>>>>>> hardware >>>>>>>>>>> >>>>>>>>>>> specs? >>>>>>>>>>> >>>>>>>>>>> SAI won't crash the node during compaction and >>>>>>>>>>> >>>>>>>>>>> requires >>>>>>>>>>> >>>>>>>>>>> less >>>>>>>>>>> >>>>>>>>>>> CPU/IO. >>>>>>>>>>> >>>>>>>>>>> * SAI defines global memory limit for compaction >>>>>>>>>>> >>>>>>>>>>> instead of >>>>>>>>>>> >>>>>>>>>>> per-index >>>>>>>>>>> >>>>>>>>>>> memory limit used by SASI. >>>>>>>>>>> >>>>>>>>>>> For example, compactions are running on 10 tables >>>>>>>>>>> >>>>>>>>>>> and >>>>>>>>>>> >>>>>>>>>>> each >>>>>>>>>>> >>>>>>>>>>> has >>>>>>>>>>> >>>>>>>>>>> 10 >>>>>>>>>>> >>>>>>>>>>> indexes. SAI will cap the >>>>>>>>>>> >>>>>>>>>>> memory usage with global limit while SASI may use up >>>>>>>>>>> >>>>>>>>>>> to >>>>>>>>>>> >>>>>>>>>>> 100 * >>>>>>>>>>> >>>>>>>>>>> per-index >>>>>>>>>>> >>>>>>>>>>> limit. >>>>>>>>>>> >>>>>>>>>>> * After flushing in-memory segments to disk, SAI won't >>>>>>>>>>> >>>>>>>>>>> merge >>>>>>>>>>> >>>>>>>>>>> on-disk >>>>>>>>>>> >>>>>>>>>>> segments while SASI >>>>>>>>>>> >>>>>>>>>>> attempts to merge them at the end. >>>>>>>>>>> >>>>>>>>>>> There are pros and cons of not merging segments: >>>>>>>>>>> >>>>>>>>>>> ** Pros: compaction runs faster and requires fewer >>>>>>>>>>> >>>>>>>>>>> resources. >>>>>>>>>>> >>>>>>>>>>> ** Cons: small segments reduce compression ratio. >>>>>>>>>>> >>>>>>>>>>> * SAI on-disk format with row ids compresses better. >>>>>>>>>>> >>>>>>>>>>> I understand the desire in keeping out of scope >>>>>>>>>>> >>>>>>>>>>> the >>>>>>>>>>> >>>>>>>>>>> longer >>>>>>>>>>> >>>>>>>>>>> term >>>>>>>>>>> >>>>>>>>>>> deprecation >>>>>>>>>>> >>>>>>>>>>> and migration plan, but… if SASI provides >>>>>>>>>>> >>>>>>>>>>> functionality >>>>>>>>>>> >>>>>>>>>>> that >>>>>>>>>>> >>>>>>>>>>> SAI >>>>>>>>>>> >>>>>>>>>>> doesn't, >>>>>>>>>>> >>>>>>>>>>> like tokenisation and DelimiterAnalyzer, yet >>>>>>>>>>> >>>>>>>>>>> introduces a >>>>>>>>>>> >>>>>>>>>>> body >>>>>>>>>>> >>>>>>>>>>> of >>>>>>>>>>> >>>>>>>>>>> code >>>>>>>>>>> >>>>>>>>>>> ~somewhat similar, shouldn't we be roughly >>>>>>>>>>> >>>>>>>>>>> sketching out >>>>>>>>>>> >>>>>>>>>>> how >>>>>>>>>>> >>>>>>>>>>> to >>>>>>>>>>> >>>>>>>>>>> reduce >>>>>>>>>>> >>>>>>>>>>> the >>>>>>>>>>> >>>>>>>>>>> maintenance surface area? >>>>>>>>>>> >>>>>>>>>>> Agreed that we should reduce maintenance area if >>>>>>>>>>> >>>>>>>>>>> possible, >>>>>>>>>>> >>>>>>>>>>> but >>>>>>>>>>> >>>>>>>>>>> only >>>>>>>>>>> >>>>>>>>>>> very >>>>>>>>>>> >>>>>>>>>>> limited >>>>>>>>>>> >>>>>>>>>>> code base (eg. RangeIterator, QueryPlan) can be >>>>>>>>>>> >>>>>>>>>>> shared. >>>>>>>>>>> >>>>>>>>>>> The >>>>>>>>>>> >>>>>>>>>>> rest >>>>>>>>>>> >>>>>>>>>>> of >>>>>>>>>>> >>>>>>>>>>> the >>>>>>>>>>> >>>>>>>>>>> code base >>>>>>>>>>> >>>>>>>>>>> is quite different because of on-disk format and >>>>>>>>>>> >>>>>>>>>>> cross-index >>>>>>>>>>> >>>>>>>>>>> files. >>>>>>>>>>> >>>>>>>>>>> The goal of this CEP is to get community buy-in on >>>>>>>>>>> >>>>>>>>>>> SAI's >>>>>>>>>>> >>>>>>>>>>> design. >>>>>>>>>>> >>>>>>>>>>> Tokenization, >>>>>>>>>>> >>>>>>>>>>> DelimiterAnalyzer should be straightforward to >>>>>>>>>>> >>>>>>>>>>> implement on >>>>>>>>>>> >>>>>>>>>>> top >>>>>>>>>>> >>>>>>>>>>> of >>>>>>>>>>> >>>>>>>>>>> SAI. >>>>>>>>>>> >>>>>>>>>>> Can we list what configurations of SASI will >>>>>>>>>>> >>>>>>>>>>> become >>>>>>>>>>> >>>>>>>>>>> deprecated >>>>>>>>>>> >>>>>>>>>>> once >>>>>>>>>>> >>>>>>>>>>> SAI >>>>>>>>>>> >>>>>>>>>>> becomes non-experimental? >>>>>>>>>>> >>>>>>>>>>> Except for "Like", "Tokenisation", >>>>>>>>>>> >>>>>>>>>>> "DelimiterAnalyzer", >>>>>>>>>>> >>>>>>>>>>> the >>>>>>>>>>> >>>>>>>>>>> rest >>>>>>>>>>> >>>>>>>>>>> of >>>>>>>>>>> >>>>>>>>>>> SASI >>>>>>>>>>> >>>>>>>>>>> can >>>>>>>>>>> >>>>>>>>>>> be replaced by SAI. >>>>>>>>>>> >>>>>>>>>>> Given a few bugs are open against 2i and SASI, can >>>>>>>>>>> >>>>>>>>>>> we >>>>>>>>>>> >>>>>>>>>>> provide >>>>>>>>>>> >>>>>>>>>>> some >>>>>>>>>>> >>>>>>>>>>> overview, or rough indication, of how many of them >>>>>>>>>>> >>>>>>>>>>> we >>>>>>>>>>> >>>>>>>>>>> could >>>>>>>>>>> >>>>>>>>>>> "triage >>>>>>>>>>> >>>>>>>>>>> away"? >>>>>>>>>>> >>>>>>>>>>> I believe most of the known bugs in 2i/SASI either >>>>>>>>>>> >>>>>>>>>>> have >>>>>>>>>>> >>>>>>>>>>> been >>>>>>>>>>> >>>>>>>>>>> addressed >>>>>>>>>>> >>>>>>>>>>> in >>>>>>>>>>> >>>>>>>>>>> SAI or >>>>>>>>>>> >>>>>>>>>>> don't apply to SAI. >>>>>>>>>>> >>>>>>>>>>> And, is it time for the project to start >>>>>>>>>>> >>>>>>>>>>> introducing new >>>>>>>>>>> >>>>>>>>>>> SPI >>>>>>>>>>> >>>>>>>>>>> implementations as separate sub-modules and jar >>>>>>>>>>> >>>>>>>>>>> files >>>>>>>>>>> >>>>>>>>>>> that >>>>>>>>>>> >>>>>>>>>>> are >>>>>>>>>>> >>>>>>>>>>> only >>>>>>>>>>> >>>>>>>>>>> loaded >>>>>>>>>>> >>>>>>>>>>> at runtime based on configuration settings? (sorry >>>>>>>>>>> >>>>>>>>>>> for >>>>>>>>>>> >>>>>>>>>>> the >>>>>>>>>>> >>>>>>>>>>> conflation >>>>>>>>>>> >>>>>>>>>>> on >>>>>>>>>>> >>>>>>>>>>> this one, but maybe it's the right time to raise >>>>>>>>>>> >>>>>>>>>>> it >>>>>>>>>>> >>>>>>>>>>> :shrug:) >>>>>>>>>>> >>>>>>>>>>> Agreed that modularization is the way to go and will >>>>>>>>>>> >>>>>>>>>>> speed up >>>>>>>>>>> >>>>>>>>>>> module >>>>>>>>>>> >>>>>>>>>>> development speed. >>>>>>>>>>> >>>>>>>>>>> Does community plan to open another discussion or CEP >>>>>>>>>>> >>>>>>>>>>> on >>>>>>>>>>> >>>>>>>>>>> modularization? >>>>>>>>>>> >>>>>>>>>>> On Mon, 24 Aug 2020 at 16:43, Mick Semb Wever < >>>>>>>>>>> >>>>>>>>>>> m...@apache.org <mailto:m...@apache.org>> >>>>>>>>>>> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Adding to Duy's questions… >>>>>>>>>>> >>>>>>>>>>> * Hardware specs >>>>>>>>>>> >>>>>>>>>>> SASI's performance, specifically the search in the >>>>>>>>>>> >>>>>>>>>>> B+ >>>>>>>>>>> >>>>>>>>>>> tree >>>>>>>>>>> >>>>>>>>>>> component, >>>>>>>>>>> >>>>>>>>>>> depends a lot on the component file's header being >>>>>>>>>>> >>>>>>>>>>> available in >>>>>>>>>>> >>>>>>>>>>> the >>>>>>>>>>> >>>>>>>>>>> pagecache. SASI benefits from (needs) nodes with >>>>>>>>>>> >>>>>>>>>>> lots >>>>>>>>>>> >>>>>>>>>>> of >>>>>>>>>>> >>>>>>>>>>> RAM. >>>>>>>>>>> >>>>>>>>>>> Is >>>>>>>>>>> >>>>>>>>>>> SAI >>>>>>>>>>> >>>>>>>>>>> bound >>>>>>>>>>> >>>>>>>>>>> to this same or similar limitation? >>>>>>>>>>> >>>>>>>>>>> Flushing of SASI can be CPU+IO intensive, to the >>>>>>>>>>> >>>>>>>>>>> point of >>>>>>>>>>> >>>>>>>>>>> saturation, >>>>>>>>>>> >>>>>>>>>>> pauses, and crashes on the node. SSDs are a must, >>>>>>>>>>> >>>>>>>>>>> along >>>>>>>>>>> >>>>>>>>>>> with a >>>>>>>>>>> >>>>>>>>>>> bit >>>>>>>>>>> >>>>>>>>>>> of >>>>>>>>>>> >>>>>>>>>>> tuning, just to avoid bringing down your cluster. >>>>>>>>>>> >>>>>>>>>>> Beyond >>>>>>>>>>> >>>>>>>>>>> reducing >>>>>>>>>>> >>>>>>>>>>> space >>>>>>>>>>> >>>>>>>>>>> requirements, does SAI improve on these things? Like >>>>>>>>>>> >>>>>>>>>>> SASI >>>>>>>>>>> >>>>>>>>>>> how >>>>>>>>>>> >>>>>>>>>>> does >>>>>>>>>>> >>>>>>>>>>> SAI, >>>>>>>>>>> >>>>>>>>>>> in >>>>>>>>>>> >>>>>>>>>>> its own way, change/narrow the recommendations on >>>>>>>>>>> >>>>>>>>>>> node >>>>>>>>>>> >>>>>>>>>>> hardware >>>>>>>>>>> >>>>>>>>>>> specs? >>>>>>>>>>> >>>>>>>>>>> * Code Maintenance >>>>>>>>>>> >>>>>>>>>>> I understand the desire in keeping out of scope the >>>>>>>>>>> >>>>>>>>>>> longer >>>>>>>>>>> >>>>>>>>>>> term >>>>>>>>>>> >>>>>>>>>>> deprecation >>>>>>>>>>> >>>>>>>>>>> and migration plan, but… if SASI provides >>>>>>>>>>> >>>>>>>>>>> functionality >>>>>>>>>>> >>>>>>>>>>> that >>>>>>>>>>> >>>>>>>>>>> SAI >>>>>>>>>>> >>>>>>>>>>> doesn't, >>>>>>>>>>> >>>>>>>>>>> like tokenisation and DelimiterAnalyzer, yet >>>>>>>>>>> >>>>>>>>>>> introduces a >>>>>>>>>>> >>>>>>>>>>> body >>>>>>>>>>> >>>>>>>>>>> of >>>>>>>>>>> >>>>>>>>>>> code >>>>>>>>>>> >>>>>>>>>>> ~somewhat similar, shouldn't we be roughly sketching >>>>>>>>>>> >>>>>>>>>>> out >>>>>>>>>>> >>>>>>>>>>> how to >>>>>>>>>>> >>>>>>>>>>> reduce >>>>>>>>>>> >>>>>>>>>>> the >>>>>>>>>>> >>>>>>>>>>> maintenance surface area? >>>>>>>>>>> >>>>>>>>>>> Can we list what configurations of SASI will become >>>>>>>>>>> >>>>>>>>>>> deprecated >>>>>>>>>>> >>>>>>>>>>> once >>>>>>>>>>> >>>>>>>>>>> SAI >>>>>>>>>>> >>>>>>>>>>> becomes non-experimental? >>>>>>>>>>> >>>>>>>>>>> Given a few bugs are open against 2i and SASI, can >>>>>>>>>>> >>>>>>>>>>> we >>>>>>>>>>> >>>>>>>>>>> provide >>>>>>>>>>> >>>>>>>>>>> some >>>>>>>>>>> >>>>>>>>>>> overview, or rough indication, of how many of them >>>>>>>>>>> >>>>>>>>>>> we >>>>>>>>>>> >>>>>>>>>>> could >>>>>>>>>>> >>>>>>>>>>> "triage >>>>>>>>>>> >>>>>>>>>>> away"? >>>>>>>>>>> >>>>>>>>>>> And, is it time for the project to start introducing >>>>>>>>>>> >>>>>>>>>>> new >>>>>>>>>>> >>>>>>>>>>> SPI >>>>>>>>>>> >>>>>>>>>>> implementations as separate sub-modules and jar >>>>>>>>>>> >>>>>>>>>>> files >>>>>>>>>>> >>>>>>>>>>> that >>>>>>>>>>> >>>>>>>>>>> are >>>>>>>>>>> >>>>>>>>>>> only >>>>>>>>>>> >>>>>>>>>>> loaded >>>>>>>>>>> >>>>>>>>>>> at runtime based on configuration settings? (sorry >>>>>>>>>>> >>>>>>>>>>> for the >>>>>>>>>>> >>>>>>>>>>> conflation >>>>>>>>>>> >>>>>>>>>>> on >>>>>>>>>>> >>>>>>>>>>> this one, but maybe it's the right time to raise it >>>>>>>>>>> >>>>>>>>>>> :shrug:) >>>>>>>>>>> >>>>>>>>>>> regards, >>>>>>>>>>> >>>>>>>>>>> Mick >>>>>>>>>>> >>>>>>>>>>> On Tue, 18 Aug 2020 at 13:05, DuyHai Doan < >>>>>>>>>>> >>>>>>>>>>> doanduy...@gmail.com <mailto:doanduy...@gmail.com>> >>>>>>>>>>> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Thank you Zhao Yang for starting this topic >>>>>>>>>>> >>>>>>>>>>> After reading the short design doc, I have a few >>>>>>>>>>> >>>>>>>>>>> questions >>>>>>>>>>> >>>>>>>>>>> 1) SASI was pretty inefficient indexing wide >>>>>>>>>>> >>>>>>>>>>> partitions >>>>>>>>>>> >>>>>>>>>>> because >>>>>>>>>>> >>>>>>>>>>> the >>>>>>>>>>> >>>>>>>>>>> index >>>>>>>>>>> >>>>>>>>>>> structure only retains the partition token, not >>>>>>>>>>> >>>>>>>>>>> the >>>>>>>>>>> >>>>>>>>>>> clustering >>>>>>>>>>> >>>>>>>>>>> colums. >>>>>>>>>>> >>>>>>>>>>> As >>>>>>>>>>> >>>>>>>>>>> per design doc SAI has row id mapping to partition >>>>>>>>>>> >>>>>>>>>>> offset, >>>>>>>>>>> >>>>>>>>>>> can >>>>>>>>>>> >>>>>>>>>>> we >>>>>>>>>>> >>>>>>>>>>> hope >>>>>>>>>>> >>>>>>>>>>> that >>>>>>>>>>> >>>>>>>>>>> indexing wide partition will be more efficient >>>>>>>>>>> >>>>>>>>>>> with >>>>>>>>>>> >>>>>>>>>>> SAI >>>>>>>>>>> >>>>>>>>>>> ? One >>>>>>>>>>> >>>>>>>>>>> detail >>>>>>>>>>> >>>>>>>>>>> that >>>>>>>>>>> >>>>>>>>>>> worries me is that in the beggining of the design >>>>>>>>>>> >>>>>>>>>>> doc, >>>>>>>>>>> >>>>>>>>>>> it is >>>>>>>>>>> >>>>>>>>>>> said >>>>>>>>>>> >>>>>>>>>>> that >>>>>>>>>>> >>>>>>>>>>> the >>>>>>>>>>> >>>>>>>>>>> matching rows are post filtered while scanning the >>>>>>>>>>> >>>>>>>>>>> partition. >>>>>>>>>>> >>>>>>>>>>> Can >>>>>>>>>>> >>>>>>>>>>> you >>>>>>>>>>> >>>>>>>>>>> confirm or infirm that SAI is efficient with wide >>>>>>>>>>> >>>>>>>>>>> partitions >>>>>>>>>>> >>>>>>>>>>> and >>>>>>>>>>> >>>>>>>>>>> provides >>>>>>>>>>> >>>>>>>>>>> the partition offsets to the matching rows ? >>>>>>>>>>> >>>>>>>>>>> 2) About space efficiency, one of the biggest >>>>>>>>>>> >>>>>>>>>>> drawback of >>>>>>>>>>> >>>>>>>>>>> SASI >>>>>>>>>>> >>>>>>>>>>> was >>>>>>>>>>> >>>>>>>>>>> the >>>>>>>>>>> >>>>>>>>>>> huge >>>>>>>>>>> >>>>>>>>>>> space required for index structure when using >>>>>>>>>>> >>>>>>>>>>> CONTAINS >>>>>>>>>>> >>>>>>>>>>> logic >>>>>>>>>>> >>>>>>>>>>> because >>>>>>>>>>> >>>>>>>>>>> of >>>>>>>>>>> >>>>>>>>>>> the >>>>>>>>>>> >>>>>>>>>>> decomposition of text columns into n-grams. Will >>>>>>>>>>> >>>>>>>>>>> SAI >>>>>>>>>>> >>>>>>>>>>> suffer >>>>>>>>>>> >>>>>>>>>>> from >>>>>>>>>>> >>>>>>>>>>> the >>>>>>>>>>> >>>>>>>>>>> same >>>>>>>>>>> >>>>>>>>>>> issue in future iterations ? I'm anticipating a >>>>>>>>>>> >>>>>>>>>>> bit >>>>>>>>>>> >>>>>>>>>>> 3) If I'm querying using SAI and providing >>>>>>>>>>> >>>>>>>>>>> complete >>>>>>>>>>> >>>>>>>>>>> partition >>>>>>>>>>> >>>>>>>>>>> key, >>>>>>>>>>> >>>>>>>>>>> will >>>>>>>>>>> >>>>>>>>>>> it >>>>>>>>>>> >>>>>>>>>>> be more efficient than querying without partition >>>>>>>>>>> >>>>>>>>>>> key. In >>>>>>>>>>> >>>>>>>>>>> other >>>>>>>>>>> >>>>>>>>>>> words, >>>>>>>>>>> >>>>>>>>>>> does >>>>>>>>>>> >>>>>>>>>>> SAI provide any optimisation when partition key is >>>>>>>>>>> >>>>>>>>>>> specified >>>>>>>>>>> >>>>>>>>>>> ? >>>>>>>>>>> >>>>>>>>>>> Regards >>>>>>>>>>> >>>>>>>>>>> Duy Hai DOAN >>>>>>>>>>> >>>>>>>>>>> Le mar. 18 août 2020 à 11:39, Mick Semb Wever < >>>>>>>>>>> >>>>>>>>>>> m...@apache.org <mailto:m...@apache.org>> >>>>>>>>>>> >>>>>>>>>>> a >>>>>>>>>>> >>>>>>>>>>> écrit : >>>>>>>>>>> >>>>>>>>>>> We are looking forward to the community's >>>>>>>>>>> >>>>>>>>>>> feedback >>>>>>>>>>> >>>>>>>>>>> and >>>>>>>>>>> >>>>>>>>>>> suggestions. >>>>>>>>>>> >>>>>>>>>>> What comes immediately to mind is testing >>>>>>>>>>> >>>>>>>>>>> requirements. It >>>>>>>>>>> >>>>>>>>>>> has >>>>>>>>>>> >>>>>>>>>>> been >>>>>>>>>>> >>>>>>>>>>> mentioned already that the project's testability >>>>>>>>>>> >>>>>>>>>>> and QA >>>>>>>>>>> >>>>>>>>>>> guidelines >>>>>>>>>>> >>>>>>>>>>> are >>>>>>>>>>> >>>>>>>>>>> inadequate to successfully introduce new >>>>>>>>>>> >>>>>>>>>>> features >>>>>>>>>>> >>>>>>>>>>> and >>>>>>>>>>> >>>>>>>>>>> refactorings >>>>>>>>>>> >>>>>>>>>>> to >>>>>>>>>>> >>>>>>>>>>> the >>>>>>>>>>> >>>>>>>>>>> codebase. During the 4.0 beta phase this was >>>>>>>>>>> >>>>>>>>>>> intended >>>>>>>>>>> >>>>>>>>>>> to be >>>>>>>>>>> >>>>>>>>>>> addressed, >>>>>>>>>>> >>>>>>>>>>> i.e. >>>>>>>>>>> >>>>>>>>>>> defining more specific QA guidelines for 4.0-rc. >>>>>>>>>>> >>>>>>>>>>> This >>>>>>>>>>> >>>>>>>>>>> would >>>>>>>>>>> >>>>>>>>>>> be >>>>>>>>>>> >>>>>>>>>>> an >>>>>>>>>>> >>>>>>>>>>> important >>>>>>>>>>> >>>>>>>>>>> step towards QA guidelines for all changes and >>>>>>>>>>> >>>>>>>>>>> CEPs >>>>>>>>>>> >>>>>>>>>>> post-4.0. >>>>>>>>>>> >>>>>>>>>>> Questions from me >>>>>>>>>>> >>>>>>>>>>> - How will this be tested, how will its QA >>>>>>>>>>> >>>>>>>>>>> status and >>>>>>>>>>> >>>>>>>>>>> lifecycle >>>>>>>>>>> >>>>>>>>>>> be >>>>>>>>>>> >>>>>>>>>>> defined? (per above) >>>>>>>>>>> >>>>>>>>>>> - With existing C* code needing to be changed, >>>>>>>>>>> >>>>>>>>>>> what >>>>>>>>>>> >>>>>>>>>>> is the >>>>>>>>>>> >>>>>>>>>>> proposed >>>>>>>>>>> >>>>>>>>>>> plan >>>>>>>>>>> >>>>>>>>>>> for making those changes ensuring maintained QA, >>>>>>>>>>> >>>>>>>>>>> e.g. >>>>>>>>>>> >>>>>>>>>>> is >>>>>>>>>>> >>>>>>>>>>> there >>>>>>>>>>> >>>>>>>>>>> separate >>>>>>>>>>> >>>>>>>>>>> QA >>>>>>>>>>> >>>>>>>>>>> cycles planned for altering the SPI before >>>>>>>>>>> >>>>>>>>>>> adding >>>>>>>>>>> >>>>>>>>>>> a >>>>>>>>>>> >>>>>>>>>>> new SPI >>>>>>>>>>> >>>>>>>>>>> implementation? >>>>>>>>>>> >>>>>>>>>>> - Despite being out of scope, it would be nice >>>>>>>>>>> >>>>>>>>>>> to have >>>>>>>>>>> >>>>>>>>>>> some >>>>>>>>>>> >>>>>>>>>>> idea >>>>>>>>>>> >>>>>>>>>>> from >>>>>>>>>>> >>>>>>>>>>> the >>>>>>>>>>> >>>>>>>>>>> CEP author of when users might still choose >>>>>>>>>>> >>>>>>>>>>> afresh 2i >>>>>>>>>>> >>>>>>>>>>> or >>>>>>>>>>> >>>>>>>>>>> SASI >>>>>>>>>>> >>>>>>>>>>> over >>>>>>>>>>> >>>>>>>>>>> SAI, >>>>>>>>>>> >>>>>>>>>>> - Who fills the roles involved? Who are the >>>>>>>>>>> >>>>>>>>>>> contributors >>>>>>>>>>> >>>>>>>>>>> in >>>>>>>>>>> >>>>>>>>>>> this >>>>>>>>>>> >>>>>>>>>>> DataStax >>>>>>>>>>> >>>>>>>>>>> team? Who is the shepherd? Are there other >>>>>>>>>>> >>>>>>>>>>> stakeholders >>>>>>>>>>> >>>>>>>>>>> willing >>>>>>>>>>> >>>>>>>>>>> to >>>>>>>>>>> >>>>>>>>>>> be >>>>>>>>>>> >>>>>>>>>>> involved? >>>>>>>>>>> >>>>>>>>>>> - Is there a preference to use gdoc instead of >>>>>>>>>>> >>>>>>>>>>> the >>>>>>>>>>> >>>>>>>>>>> project's >>>>>>>>>>> >>>>>>>>>>> wiki, >>>>>>>>>>> >>>>>>>>>>> and >>>>>>>>>>> >>>>>>>>>>> why? (the CEP process suggest a wiki page, and >>>>>>>>>>> >>>>>>>>>>> feedback on >>>>>>>>>>> >>>>>>>>>>> why >>>>>>>>>>> >>>>>>>>>>> another >>>>>>>>>>> >>>>>>>>>>> approach is considered better helps evolve the >>>>>>>>>>> >>>>>>>>>>> CEP >>>>>>>>>>> >>>>>>>>>>> process >>>>>>>>>>> >>>>>>>>>>> itself) >>>>>>>>>>> >>>>>>>>>>> cheers, >>>>>>>>>>> >>>>>>>>>>> Mick >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>> --------------------------------------------------------------------- >>>>>>>>>>> >>>>>>>>>>> To unsubscribe, e-mail: >>>>> dev-unsubscr...@cassandra.apache.org >>>>> <mailto:dev-unsubscr...@cassandra.apache.org> >>>>>>> For >>>>>>>>>>> additional commands, e-mail: dev-h...@cassandra.apache.org >>>>>>>>>>> <mailto:dev-h...@cassandra.apache.org> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>> --------------------------------------------------------------------- >>>>>>>> To >>>>>>>>>>> unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >>>>>>>>>>> <mailto:dev-unsubscr...@cassandra.apache.org> >>>>>> For >>>>>>>>>> additional >>>>>>>>>>> commands, e-mail: dev-h...@cassandra.apache.org >>>>>>>>>>> <mailto:dev-h...@cassandra.apache.org> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> alex p >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>> --------------------------------------------------------------------- >>>>>>>> To >>>>>>>>>>> unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >>>>>>>>>>> <mailto:dev-unsubscr...@cassandra.apache.org> >>>>>> For >>>>>>>>>> additional >>>>>>>>>>> commands, e-mail: dev-h...@cassandra.apache.org >>>>>>>>>>> <mailto:dev-h...@cassandra.apache.org> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>> --------------------------------------------------------------------- >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >>>>>>>>>> <mailto:dev-unsubscr...@cassandra.apache.org> >>>>>>>>>> For additional commands, e-mail: dev-h...@cassandra.apache.org >>>>>>>>>> <mailto:dev-h...@cassandra.apache.org> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> alex p > > > > -- > Henrik Ingo > +358 40 569 7354 <tel:358405697354> > <https://www.datastax.com/> <https://twitter.com/DataStaxEng> > <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=> > <https://www.linkedin.com/in/heingo/>