Re: [DISCUSS] CEP-7 Storage Attached Index

Jeremiah D Jordan Wed, 02 Feb 2022 08:18:19 -0800

Given the distributed search part is an issue with our secondary indexes in 
general, and not with any implementation, I don’t see a reason to hold up a 
vote on CEP-7 for it?


-Jeremiah

> On Feb 2, 2022, at 10:01 AM, Henrik Ingo <henrik.i...@datastax.com> wrote:
> 
> So this is an area I've thought about and in fact the overall dynamics are 
> the same as for MongoDB secondary indexes in a sharded cluster. The TL:DR; is 
> that the benefits far outweigh the limitations:
> 
> * There's a large area of queries where you have the partition key but not 
> the full Primary Key. SAI (now with row awareness) is an efficient solution 
> for such queries.
> * As a special case of the above would be that you have a partition key (or 
> keys) but want to sort by something else than the clustering key. However, 
> note that the current version of SAI doesn't actually support sorting.
> * Your cluster has at most 10-20 nodes and the share of queries that lack a 
> partition key is at most 5% - 10%.
> * Even for very large clusters, a low frequency of queries without partition 
> key is fine.
> 
> If all of the above was obvious and the discussion was only about what 
> Guardrails we may want to set to warn or stop the use, then apologies... I 
> would suggest the guardrail could be that if share of non-pk queries *on each 
> node* is above 33% guardrails should warn and if it's above 66% it should 
> fail the non-pk queries.
> 
> I blogged about the math behind scalability of secondary indexes a year ago: 
> https://web.archive.org/web/20210814021809/https://www.openlife.cc/blogs/2020/november/scalability-model-cassandra
>  
> <https://web.archive.org/web/20210814021809/https://www.openlife.cc/blogs/2020/november/scalability-model-cassandra>
> 
> henrik
> 
> On Wed, Feb 2, 2022 at 3:59 PM Joshua McKenzie <jmcken...@apache.org 
> <mailto:jmcken...@apache.org>> wrote:
> To me the outstanding thing worth tackling is the Challenges section Caleb 
> added in the CEP. Specifically:
> "The only "easy" way around these two challenges is to focus our efforts on 
> queries that are restricted to either partitions or small token ranges. These 
> queries behave well locally even on LCS (given levels contain token-disjoint 
> SSTables, and assuming a low number of unleveled SSTables), avoid fan-out and 
> all of its secondary pitfalls, and allow us to make queries at varying CLs 
> with reasonable performance. Attempting to fix the local problems around 
> compaction strategy could mean either restricted strategy usage or partially 
> abandoning SSTable-attachment. Attempting to fix distributed read path 
> problems by pushing the design towards IR systems like ES could compromise 
> our ability to use higher read CLs."
> 
> This is probably something we could integrate with Guardrails out of the gate 
> to discourage suboptimal use right? Or at least allude to in the CEP so it's 
> something on our rader.
> 
> One of the big downfalls of Materialized Views (aside from the orphaned data 
> and inconsistency pains) was the lack of limits on creation of them (either 
> number or structure / data amount) with serious un-inspectable implications 
> on disk usage and performance. The more we can learn from those missteps the 
> better.
> 
> On Wed, Feb 2, 2022 at 8:24 AM Mike Adamson <madam...@datastax.com 
> <mailto:madam...@datastax.com>> wrote:
> Hi,
> 
> I’d like to restart this thread.
> 
> We merged the row-aware branch to the SAI codebase just before Christmas and 
> have subsequently updated the CEP to reflect these changes.
> 
> I would like to move the discussion forward as to how we move this CEP 
> towards a vote.
> 
> MikeA
> 
>> On 16 Sep 2021, at 19:49, DuyHai Doan <doanduy...@gmail.com 
>> <mailto:doanduy...@gmail.com>> wrote:
>> 
>> Good new Mike that row based indexing will be available, this was a major
>> lacking from SASI at that time !
>> 
>> Le jeu. 16 sept. 2021 à 15:38, Mike Adamson <madam...@datastax.com 
>> <mailto:madam...@datastax.com>> a
>> écrit :
>> 
>>> Hi,
>>> 
>>> Just to keep this thread up to date with development progress, we will be
>>> adding row-aware support to SAI in the next few weeks. This is currently
>>> going through the final stages of review and testing.
>>> 
>>> This feature also adds on-disk versioning to SAI. This allows SAI to
>>> support multiple on-disk formats during upgrades.
>>> 
>>> I am mentioning this now because the CEP mentions “Partition Based
>>> Iteration” as an initial feature. We will change that to “Row Based
>>> Iteration” when the feature is merged.
>>> 
>>> MikeA
>>> 
>>>> On 15 Sep 2021, at 19:42, Caleb Rackliffe <calebrackli...@gmail.com 
>>>> <mailto:calebrackli...@gmail.com>>
>>> wrote:
>>>> 
>>>> Hey there,
>>>> 
>>>> In the spirit of trying to get as many possible objections to a
>>> successful
>>>> vote out of the way, I've added a "Challenges" section to the CEP:
>>>> 
>>>> 
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-7%3A+Storage+Attached+Index#CEP7:StorageAttachedIndex-Challenges
>>>  
>>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-7%3A+Storage+Attached+Index#CEP7:StorageAttachedIndex-Challenges>
>>> <
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-7%3A+Storage+Attached+Index#CEP7:StorageAttachedIndex-Challenges
>>>  
>>> <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-7%3A+Storage+Attached+Index#CEP7:StorageAttachedIndex-Challenges>
>>>> 
>>>> 
>>>> Most of you will be familiar with these, but I think we need to be as
>>>> open/candid as possible about the potential risk they pose to SAI's
>>> broader
>>>> usability. I've described them from the point of view that they are not
>>>> intractable, but if anyone thinks they are, let's hash that disagreement
>>>> out.
>>>> 
>>>> Thanks!
>>>> 
>>>> On Thu, Sep 9, 2021 at 11:13 AM Patrick McFadin <pmcfa...@gmail.com 
>>>> <mailto:pmcfa...@gmail.com>
>>> <mailto:pmcfa...@gmail.com <mailto:pmcfa...@gmail.com>>> wrote:
>>>> 
>>>>> +1 on introducing this in an incremental manner and after reading
>>> through
>>>>> CASSANDRA-16092 that seems like a perfect place to start. I see that
>>> work
>>>>> on that Jira has stopped until direction for CEP-7 has been voted in.
>>>>> 
>>>>> I say start the vote and let's get this really valuable developer
>>> feature
>>>>> underway.
>>>>> 
>>>>> Patrick
>>>>> 
>>>>> On Tue, Sep 7, 2021 at 10:40 AM Caleb Rackliffe <
>>> calebrackli...@gmail.com <mailto:calebrackli...@gmail.com>>
>>>>> wrote:
>>>>> 
>>>>>> So this thread stalled almost a year ago. (Wow, time flies when you're
>>>>>> trying to release 4.0.) My synthesis of the conversation to this point
>>> is
>>>>>> that while there are some open questions about testing
>>>>>> methodology/"definition of done" and our choice of particular on-disk
>>>>> data
>>>>>> structures, neither of these should be a serious obstacle to moving
>>>>> forward
>>>>>> w/ a vote. Having said that, is there anything left around the CEP that
>>>>> we
>>>>>> feel should prevent it from moving to a vote?
>>>>>> 
>>>>>> In terms of how we would proceed from the point a vote passes, it seems
>>>>>> like there have been enough concerns around the proposed/necessary
>>>>> breaking
>>>>>> changes to the 2i API, that we will start development by introducing
>>>>>> components as incrementally as possible into a long-running feature
>>>>> branch
>>>>>> off trunk. (This work would likely start w/ *CASSANDRA-16092*
>>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-16092 
>>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-16092>>, which we
>>> could
>>>>>> resolve as a sub-task of the SAI epic without interfering with other
>>>>> trunk
>>>>>> development likely destined for a 4.x minor, etc.)
>>>>>> 
>>>>>> On Thu, Sep 24, 2020 at 2:47 AM Jasonstack Zhao Yang <
>>>>>> jasonstack.z...@gmail.com <mailto:jasonstack.z...@gmail.com>> wrote:
>>>>>> 
>>>>>>>>> Question is: is this planned as a next step?
>>>>>>>>> If yes, how are we going to mark SAI as experimental until it gets
>>>>>>>>> row offsets? Also, it is likely that index format is going to change
>>>>>>> when
>>>>>>>>> row offsets are added, so my concern is that we may have to support
>>>>>> two
>>>>>>>>> versions of a format for a smooth migration.
>>>>>>> 
>>>>>>> The goal is to support row-level index when merging SAI, I will update
>>>>>> the
>>>>>>> CEP about it.
>>>>>>> 
>>>>>>>>> I think switching to row
>>>>>>>>> offsets also has a huge impact on interaction with SPRC and has some
>>>>>>>>> potential for optimisations.
>>>>>>> 
>>>>>>> Can you share more details on the optimizations?
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, 24 Sep 2020 at 15:20, Oleksandr Petrov <
>>>>>> oleksandr.pet...@gmail.com <mailto:oleksandr.pet...@gmail.com>
>>>>>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>>> But for improving overall index read performance, I think improving
>>>>>>> base
>>>>>>>> table read perf  (because SAI/SASI executes LOTS of
>>>>>>>> SinglePartitionReadCommand after searching on-disk index) is more
>>>>>>> effective
>>>>>>>> than switching from Trie to Prefix BTree.
>>>>>>>> 
>>>>>>>> I haven't suggested switching to Prefix B-Tree or any other
>>>>> structure,
>>>>>>> the
>>>>>>>> question was about rationale and motivation of picking one over the
>>>>>>> other,
>>>>>>>> which I am curious about for personal reasons/interests that lie
>>>>>> outside
>>>>>>> of
>>>>>>>> Cassandra. Having this listed in CEP could have been helpful for
>>>>> future
>>>>>>>> guidance. It's ok if this question is outside of the CEP scope.
>>>>>>>> 
>>>>>>>> I also agree that there are many areas that require improvement
>>>>> around
>>>>>>> the
>>>>>>>> read/write path and 2i, many of which (even outside of base table
>>>>>> format
>>>>>>> or
>>>>>>>> read perf) can yield positive performance results.
>>>>>>>> 
>>>>>>>>> FWIW, I personally look forward to receiving that contribution when
>>>>>> the
>>>>>>>> time is right.
>>>>>>>> 
>>>>>>>> I am very excited for this contribution, too, and it looks like very
>>>>>>> solid
>>>>>>>> work.
>>>>>>>> 
>>>>>>>> I have one more question, about "Upon resolving partition keys, rows
>>>>>> are
>>>>>>>> loaded using Cassandra’s internal partition read command across
>>>>>> SSTables
>>>>>>>> and are post filtered". One of the criticisms of SASI and reasons for
>>>>>>>> marking it as experimental was CASSANDRA-11990. I think switching to
>>>>>> row
>>>>>>>> offsets also has a huge impact on interaction with SPRC and has some
>>>>>>>> potential for optimisations. Question is: is this planned as a next
>>>>>> step?
>>>>>>>> If yes, how are we going to mark SAI as experimental until it gets
>>>>>>>> row offsets? Also, it is likely that index format is going to change
>>>>>> when
>>>>>>>> row offsets are added, so my concern is that we may have to support
>>>>> two
>>>>>>>> versions of a format for a smooth migration.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, Sep 24, 2020 at 6:53 AM Jasonstack Zhao Yang <
>>>>>>>> jasonstack.z...@gmail.com <mailto:jasonstack.z...@gmail.com>> wrote:
>>>>>>>> 
>>>>>>>>>>> I think CEP should be more upfront with "eventually replace
>>>>>>>>>>> it" bit, since it raises the question about what the people who
>>>>>> are
>>>>>>>>> using
>>>>>>>>>>> other index implementations can expect.
>>>>>>>>> 
>>>>>>>>> Will update the CEP to emphasize: SAI will replace other indexes.
>>>>>>>>> 
>>>>>>>>>>> Unfortunately, I do not have an
>>>>>>>>>>> implementation sitting around for a direct comparison, but I can
>>>>>>>> imagine
>>>>>>>>>>> situations when B-Trees may perform better because of simpler
>>>>>>>>> construction.
>>>>>>>>>>> Maybe we should even consider prototyping a prefix B-Tree to
>>>>> have
>>>>>> a
>>>>>>>> more
>>>>>>>>>>> fair comparison.
>>>>>>>>> 
>>>>>>>>> As long as prefix BTree supports range/prefix aggregation (which is
>>>>>>> used
>>>>>>>> to
>>>>>>>>> speed up
>>>>>>>>> range/prefix query when matching entire subtree), we can plug it in
>>>>>> and
>>>>>>>>> compare. It won't
>>>>>>>>> affect the CEP design which focuses on sharing data across indexes
>>>>>> and
>>>>>>>>> posting aggregation.
>>>>>>>>> 
>>>>>>>>> But for improving overall index read performance, I think improving
>>>>>>> base
>>>>>>>>> table read perf
>>>>>>>>> (because SAI/SASI executes LOTS of SinglePartitionReadCommand
>>>>> after
>>>>>>>>> searching on-disk index)
>>>>>>>>> is more effective than switching from Trie to Prefix BTree.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Thu, 24 Sep 2020 at 05:33, Benedict Elliott Smith <
>>>>>>>> bened...@apache.org <mailto:bened...@apache.org>>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> FWIW, I personally look forward to receiving that contribution
>>>>> when
>>>>>>> the
>>>>>>>>>> time is right.
>>>>>>>>>> 
>>>>>>>>>> On 23/09/2020, 18:45, "Josh McKenzie" <jmcken...@apache.org 
>>>>>>>>>> <mailto:jmcken...@apache.org>>
>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>   talking about that would involve some bits of information
>>>>>>> DataStax
>>>>>>>>>> might
>>>>>>>>>>   not be ready to share?
>>>>>>>>>> 
>>>>>>>>>>   At the risk of derailing, I've been poking and prodding this
>>>>>> week
>>>>>>>> at
>>>>>>>>> we
>>>>>>>>>>   contributors at DS getting our act together w/a draft CEP for
>>>>>>>>> donating
>>>>>>>>>> the
>>>>>>>>>>   trie-based indices to the ASF project.
>>>>>>>>>> 
>>>>>>>>>>   More to come; the intention is certainly to contribute that
>>>>>> code.
>>>>>>>> The
>>>>>>>>>> lack
>>>>>>>>>>   of a destination to merge it into (i.e. no 5.0-dev branch) is
>>>>>>>>> removing
>>>>>>>>>>   significant urgency from the process as well (not to open a
>>>>> 3rd
>>>>>>>>>> Pandora's
>>>>>>>>>>   box), but there's certainly an interrelatedness to the
>>>>>>>> conversations
>>>>>>>>>> going
>>>>>>>>>>   on.
>>>>>>>>>> 
>>>>>>>>>>   ---
>>>>>>>>>>   Josh McKenzie
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>   Sent via Superhuman <
>>>>> 
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__sprh.mn_-3Fvip-3Djmckenzie-40apache.org&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=W153pibedwV7j_YCKUR0MVt-tPDUbvaHukx68pAo9zc&m=epkiu_3NED8CL23Ylg9qVnK7VfGLJGsT28TGXN6Wmc4&s=gJ7VsN1vFUYz0czKFU8Dv28TViVbCWWF1zE3ZQlxtWc&e=
>>>  
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__sprh.mn_-3Fvip-3Djmckenzie-40apache.org&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=W153pibedwV7j_YCKUR0MVt-tPDUbvaHukx68pAo9zc&m=epkiu_3NED8CL23Ylg9qVnK7VfGLJGsT28TGXN6Wmc4&s=gJ7VsN1vFUYz0czKFU8Dv28TViVbCWWF1zE3ZQlxtWc&e=>
>>> <
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__sprh.mn_-3Fvip-3Djmckenzie-40apache.org&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=W153pibedwV7j_YCKUR0MVt-tPDUbvaHukx68pAo9zc&m=epkiu_3NED8CL23Ylg9qVnK7VfGLJGsT28TGXN6Wmc4&s=gJ7VsN1vFUYz0czKFU8Dv28TViVbCWWF1zE3ZQlxtWc&e=
>>>  
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__sprh.mn_-3Fvip-3Djmckenzie-40apache.org&d=DwIFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=W153pibedwV7j_YCKUR0MVt-tPDUbvaHukx68pAo9zc&m=epkiu_3NED8CL23Ylg9qVnK7VfGLJGsT28TGXN6Wmc4&s=gJ7VsN1vFUYz0czKFU8Dv28TViVbCWWF1zE3ZQlxtWc&e=>>
>>> 
>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>   On Wed, Sep 23, 2020 at 12:48 PM, Caleb Rackliffe <
>>>>>>>>>> calebrackli...@gmail.com <mailto:calebrackli...@gmail.com>>
>>>>>>>>>>   wrote:
>>>>>>>>>> 
>>>>>>>>>>> As long as we can construct the on-disk indexes
>>>>>>>>> efficiently/directly
>>>>>>>>>> from
>>>>>>>>>>> a Memtable-attached index on flush, there's room to try
>>>>> other
>>>>>>>> data
>>>>>>>>>>> structures. Most of the innovation in SAI is around the
>>>>>> layout
>>>>>>> of
>>>>>>>>>> postings
>>>>>>>>>>> (something we can expand on if people are interested) and
>>>>>>> having
>>>>>>>> a
>>>>>>>>>>> natively row-oriented design that scales w/ multiple
>>>>> indexed
>>>>>>>>> columns
>>>>>>>>>> on
>>>>>>>>>>> single SSTables. There are some broader implications of
>>>>> using
>>>>>>> the
>>>>>>>>>> trie that
>>>>>>>>>>> reach outside SAI itself, but talking about that would
>>>>>> involve
>>>>>>>> some
>>>>>>>>>> bits of
>>>>>>>>>>> information DataStax might not be ready to share?
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Sep 23, 2020 at 11:00 AM Jeremiah D Jordan <
>>>>>>>>> jeremiah.jordan@
>>>>>>>>>>> gmail.com <http://gmail.com/>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Short question: looking forward, how are we going to
>>>>> maintain
>>>>>>>> three
>>>>>>>>>> 2i
>>>>>>>>>>> implementations: SASI, SAI, and 2i?
>>>>>>>>>>> 
>>>>>>>>>>> I think one of the goals stated in the CEP is for SAI to
>>>>> have
>>>>>>>>> parity
>>>>>>>>>> with
>>>>>>>>>>> 2i such that it could eventually replace it.
>>>>>>>>>>> 
>>>>>>>>>>> On Sep 23, 2020, at 10:34 AM, Oleksandr Petrov <
>>>>>>>>>>> 
>>>>>>>>>>> oleksandr.pet...@gmail.com <mailto:oleksandr.pet...@gmail.com>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Short question: looking forward, how are we going to
>>>>> maintain
>>>>>>>> three
>>>>>>>>>> 2i
>>>>>>>>>>> implementations: SASI, SAI, and 2i?
>>>>>>>>>>> 
>>>>>>>>>>> Another thing I think this CEP is missing is rationale and
>>>>>>>>> motivation
>>>>>>>>>>> about why trie-based indexes were chosen over, say, B-Tree.
>>>>>> We
>>>>>>>> did
>>>>>>>>>> have a
>>>>>>>>>>> short discussion about this on Slack, but both arguments
>>>>> that
>>>>>>>> I've
>>>>>>>>>> heard
>>>>>>>>>>> (space-saving and keeping a small subset of nodes in
>>>>> memory)
>>>>>>> work
>>>>>>>>>> only
>>>>>>>>>>> 
>>>>>>>>>>> for
>>>>>>>>>>> 
>>>>>>>>>>> the most primitive implementation of a B-Tree.
>>>>> Fully-occupied
>>>>>>>>> prefix
>>>>>>>>>>> 
>>>>>>>>>>> B-Tree
>>>>>>>>>>> 
>>>>>>>>>>> can have similar properties. There's been a lot of research
>>>>>> on
>>>>>>>>>> B-Trees
>>>>>>>>>>> 
>>>>>>>>>>> and
>>>>>>>>>>> 
>>>>>>>>>>> optimisations in those. Unfortunately, I do not have an
>>>>>>>>>> implementation
>>>>>>>>>>> sitting around for a direct comparison, but I can imagine
>>>>>>>>> situations
>>>>>>>>>> when
>>>>>>>>>>> B-Trees may perform better because of simpler
>>>>>>>>>>> 
>>>>>>>>>>> construction.
>>>>>>>>>>> 
>>>>>>>>>>> Maybe we should even consider prototyping a prefix B-Tree
>>>>> to
>>>>>>>> have a
>>>>>>>>>> more
>>>>>>>>>>> fair comparison.
>>>>>>>>>>> 
>>>>>>>>>>> Thank you,
>>>>>>>>>>> -- Alex
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, Sep 10, 2020 at 9:12 AM Jasonstack Zhao Yang <
>>>>>>>>>> jasonstack.zhao@
>>>>>>>>>>> gmail.com <http://gmail.com/>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Thank you Patrick for hosting Cassandra Contributor Meeting
>>>>>> for
>>>>>>>>> CEP-7
>>>>>>>>>>> 
>>>>>>>>>>> SAI.
>>>>>>>>>>> 
>>>>>>>>>>> The recorded video is available here:
>>>>>>>>>>> 
>>>>>>>>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/ 
>>>>>>>>>>> <https://cwiki.apache.org/confluence/display/CASSANDRA/>
>>>>>>>>>>> 2020-09-01+Apache+Cassandra+Contributor+Meeting
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, 1 Sep 2020 at 14:34, Jasonstack Zhao Yang <
>>>>>>>>>> jasonstack.zhao@gmail.
>>>>>>>>>>> com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Thank you, Charles and Patrick
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, 1 Sep 2020 at 04:56, Charles Cao <
>>>>>> caohair...@gmail.com <mailto:caohair...@gmail.com>
>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Thank you, Patrick!
>>>>>>>>>>> 
>>>>>>>>>>> On Mon, Aug 31, 2020 at 12:59 PM Patrick McFadin <
>>>>>>>>> pmcfa...@gmail.com <mailto:pmcfa...@gmail.com>
>>>>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> I just moved it to 8AM for this meeting to better
>>>>> accommodate
>>>>>>>> APAC.
>>>>>>>>>>> 
>>>>>>>>>>> Please
>>>>>>>>>>> 
>>>>>>>>>>> see the update here:
>>>>>>>>>>> 
>>>>>>>>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/ 
>>>>>>>>>>> <https://cwiki.apache.org/confluence/display/CASSANDRA/>
>>>>>>>>>>> 2020-08-01+Apache+Cassandra+Contributor+Meeting
>>>>>>>>>>> 
>>>>>>>>>>> Patrick
>>>>>>>>>>> 
>>>>>>>>>>> On Mon, Aug 31, 2020 at 10:04 AM Charles Cao <
>>>>>>>> caohair...@gmail.com <mailto:caohair...@gmail.com>
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Patrick,
>>>>>>>>>>> 
>>>>>>>>>>> 11AM PST is a bad time for the people in the APAC timezone.
>>>>>> Can
>>>>>>>> we
>>>>>>>>>> move it
>>>>>>>>>>> to 7 or 8AM PST in the morning to accommodate their needs ?
>>>>>>>>>>> 
>>>>>>>>>>> ~Charles
>>>>>>>>>>> 
>>>>>>>>>>> On Fri, Aug 28, 2020 at 4:37 PM Patrick McFadin <
>>>>>>>>> pmcfa...@gmail.com <mailto:pmcfa...@gmail.com>
>>>>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Meeting scheduled.
>>>>>>>>>>> 
>>>>>>>>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/ 
>>>>>>>>>>> <https://cwiki.apache.org/confluence/display/CASSANDRA/>
>>>>>>>>>>> 2020-08-01+Apache+Cassandra+Contributor+Meeting
>>>>>>>>>>> 
>>>>>>>>>>> Tuesday September 1st, 11AM PST. I added a basic bullet for
>>>>>> the
>>>>>>>>>>> 
>>>>>>>>>>> agenda
>>>>>>>>>>> 
>>>>>>>>>>> but
>>>>>>>>>>> 
>>>>>>>>>>> if there is more, edit away.
>>>>>>>>>>> 
>>>>>>>>>>> Patrick
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, Aug 27, 2020 at 11:31 AM Jasonstack Zhao Yang <
>>>>>>>>>> jasonstack.zhao@
>>>>>>>>>>> gmail.com <http://gmail.com/>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> +1
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, 27 Aug 2020 at 04:52, Ekaterina Dimitrova <
>>>>>>>>>>> 
>>>>>>>>>>> e.dimitr...@gmail.com <mailto:e.dimitr...@gmail.com>>
>>>>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> +1
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, 26 Aug 2020 at 16:48, Caleb Rackliffe <
>>>>>>>>>>> 
>>>>>>>>>>> calebrackli...@gmail.com <mailto:calebrackli...@gmail.com>>
>>>>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> +1
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Aug 26, 2020, 3:45 PM Patrick McFadin <
>>>>>>>>>>> 
>>>>>>>>>>> pmcfa...@gmail.com <mailto:pmcfa...@gmail.com>>
>>>>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> This is related to the discussion Jordan and I had about
>>>>>>>>>>> 
>>>>>>>>>>> the
>>>>>>>>>>> 
>>>>>>>>>>> contributor
>>>>>>>>>>> 
>>>>>>>>>>> Zoom call. Instead of open mic for any issue, call it
>>>>>>>>>>> 
>>>>>>>>>>> based
>>>>>>>>>>> 
>>>>>>>>>>> on a
>>>>>>>>>>> 
>>>>>>>>>>> discussion
>>>>>>>>>>> 
>>>>>>>>>>> thread or threads for higher bandwidth discussion.
>>>>>>>>>>> 
>>>>>>>>>>> I would be happy to schedule on for next week to
>>>>>>>>>>> 
>>>>>>>>>>> specifically
>>>>>>>>>>> 
>>>>>>>>>>> discuss
>>>>>>>>>>> 
>>>>>>>>>>> CEP-7. I can attach the recorded call to the CEP after.
>>>>>>>>>>> 
>>>>>>>>>>> +1 or -1?
>>>>>>>>>>> 
>>>>>>>>>>> Patrick
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Aug 25, 2020 at 7:03 AM Joshua McKenzie <
>>>>>>>>>>> 
>>>>>>>>>>> jmcken...@apache.org <mailto:jmcken...@apache.org>>
>>>>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Does community plan to open another discussion or CEP
>>>>>>>>>>> 
>>>>>>>>>>> on
>>>>>>>>>>> 
>>>>>>>>>>> modularization?
>>>>>>>>>>> 
>>>>>>>>>>> We probably should have a discussion on the ML or
>>>>>>>>>>> 
>>>>>>>>>>> monthly
>>>>>>>>>>> 
>>>>>>>>>>> contrib
>>>>>>>>>>> 
>>>>>>>>>>> call
>>>>>>>>>>> 
>>>>>>>>>>> about it first to see how aligned the interested
>>>>>>>>>>> 
>>>>>>>>>>> contributors
>>>>>>>>>>> 
>>>>>>>>>>> are.
>>>>>>>>>>> 
>>>>>>>>>>> Could
>>>>>>>>>>> 
>>>>>>>>>>> do
>>>>>>>>>>> 
>>>>>>>>>>> that through CEP as well but CEP's (at least thus far
>>>>>>>>>>> 
>>>>>>>>>>> sans k8s
>>>>>>>>>>> 
>>>>>>>>>>> operator)
>>>>>>>>>>> 
>>>>>>>>>>> tend to start with a strong, deeply thought out point of
>>>>>>>>>>> 
>>>>>>>>>>> view
>>>>>>>>>>> 
>>>>>>>>>>> being
>>>>>>>>>>> 
>>>>>>>>>>> expressed.
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Aug 25, 2020 at 3:26 AM Jasonstack Zhao Yang <
>>>>>>>>>>> 
>>>>>>>>>>> jasonstack.z...@gmail.com <mailto:jasonstack.z...@gmail.com>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> SASI's performance, specifically the search in the
>>>>>>>>>>> 
>>>>>>>>>>> B+
>>>>>>>>>>> 
>>>>>>>>>>> tree
>>>>>>>>>>> 
>>>>>>>>>>> component,
>>>>>>>>>>> 
>>>>>>>>>>> depends a lot on the component file's header being
>>>>>>>>>>> 
>>>>>>>>>>> available
>>>>>>>>>>> 
>>>>>>>>>>> in
>>>>>>>>>>> 
>>>>>>>>>>> the
>>>>>>>>>>> 
>>>>>>>>>>> pagecache. SASI benefits from (needs) nodes with
>>>>>>>>>>> 
>>>>>>>>>>> lots of
>>>>>>>>>>> 
>>>>>>>>>>> RAM.
>>>>>>>>>>> 
>>>>>>>>>>> Is
>>>>>>>>>>> 
>>>>>>>>>>> SAI
>>>>>>>>>>> 
>>>>>>>>>>> bound
>>>>>>>>>>> 
>>>>>>>>>>> to this same or similar limitation?
>>>>>>>>>>> 
>>>>>>>>>>> SAI also benefits from larger memory because SAI puts
>>>>>>>>>>> 
>>>>>>>>>>> block
>>>>>>>>>>> 
>>>>>>>>>>> info
>>>>>>>>>>> 
>>>>>>>>>>> on
>>>>>>>>>>> 
>>>>>>>>>>> heap
>>>>>>>>>>> 
>>>>>>>>>>> for searching on-disk components and having
>>>>>>>>>>> 
>>>>>>>>>>> cross-index
>>>>>>>>>>> 
>>>>>>>>>>> files on
>>>>>>>>>>> 
>>>>>>>>>>> page
>>>>>>>>>>> 
>>>>>>>>>>> cache
>>>>>>>>>>> 
>>>>>>>>>>> improves read performance of different indexes on the
>>>>>>>>>>> 
>>>>>>>>>>> same
>>>>>>>>>>> 
>>>>>>>>>>> table.
>>>>>>>>>>> 
>>>>>>>>>>> Flushing of SASI can be CPU+IO intensive, to the
>>>>>>>>>>> 
>>>>>>>>>>> point of
>>>>>>>>>>> 
>>>>>>>>>>> saturation,
>>>>>>>>>>> 
>>>>>>>>>>> pauses, and crashes on the node. SSDs are a must,
>>>>>>>>>>> 
>>>>>>>>>>> along
>>>>>>>>>>> 
>>>>>>>>>>> with
>>>>>>>>>>> 
>>>>>>>>>>> a
>>>>>>>>>>> 
>>>>>>>>>>> bit
>>>>>>>>>>> 
>>>>>>>>>>> of
>>>>>>>>>>> 
>>>>>>>>>>> tuning, just to avoid bringing down your cluster.
>>>>>>>>>>> 
>>>>>>>>>>> Beyond
>>>>>>>>>>> 
>>>>>>>>>>> reducing
>>>>>>>>>>> 
>>>>>>>>>>> space
>>>>>>>>>>> 
>>>>>>>>>>> requirements, does SAI improve on these things?
>>>>>>>>>>> 
>>>>>>>>>>> Like
>>>>>>>>>>> 
>>>>>>>>>>> SASI how
>>>>>>>>>>> 
>>>>>>>>>>> does
>>>>>>>>>>> 
>>>>>>>>>>> SAI,
>>>>>>>>>>> 
>>>>>>>>>>> in
>>>>>>>>>>> 
>>>>>>>>>>> its own way, change/narrow the recommendations on
>>>>>>>>>>> 
>>>>>>>>>>> node
>>>>>>>>>>> 
>>>>>>>>>>> hardware
>>>>>>>>>>> 
>>>>>>>>>>> specs?
>>>>>>>>>>> 
>>>>>>>>>>> SAI won't crash the node during compaction and
>>>>>>>>>>> 
>>>>>>>>>>> requires
>>>>>>>>>>> 
>>>>>>>>>>> less
>>>>>>>>>>> 
>>>>>>>>>>> CPU/IO.
>>>>>>>>>>> 
>>>>>>>>>>> * SAI defines global memory limit for compaction
>>>>>>>>>>> 
>>>>>>>>>>> instead of
>>>>>>>>>>> 
>>>>>>>>>>> per-index
>>>>>>>>>>> 
>>>>>>>>>>> memory limit used by SASI.
>>>>>>>>>>> 
>>>>>>>>>>> For example, compactions are running on 10 tables
>>>>>>>>>>> 
>>>>>>>>>>> and
>>>>>>>>>>> 
>>>>>>>>>>> each
>>>>>>>>>>> 
>>>>>>>>>>> has
>>>>>>>>>>> 
>>>>>>>>>>> 10
>>>>>>>>>>> 
>>>>>>>>>>> indexes. SAI will cap the
>>>>>>>>>>> 
>>>>>>>>>>> memory usage with global limit while SASI may use up
>>>>>>>>>>> 
>>>>>>>>>>> to
>>>>>>>>>>> 
>>>>>>>>>>> 100 *
>>>>>>>>>>> 
>>>>>>>>>>> per-index
>>>>>>>>>>> 
>>>>>>>>>>> limit.
>>>>>>>>>>> 
>>>>>>>>>>> * After flushing in-memory segments to disk, SAI won't
>>>>>>>>>>> 
>>>>>>>>>>> merge
>>>>>>>>>>> 
>>>>>>>>>>> on-disk
>>>>>>>>>>> 
>>>>>>>>>>> segments while SASI
>>>>>>>>>>> 
>>>>>>>>>>> attempts to merge them at the end.
>>>>>>>>>>> 
>>>>>>>>>>> There are pros and cons of not merging segments:
>>>>>>>>>>> 
>>>>>>>>>>> ** Pros: compaction runs faster and requires fewer
>>>>>>>>>>> 
>>>>>>>>>>> resources.
>>>>>>>>>>> 
>>>>>>>>>>> ** Cons: small segments reduce compression ratio.
>>>>>>>>>>> 
>>>>>>>>>>> * SAI on-disk format with row ids compresses better.
>>>>>>>>>>> 
>>>>>>>>>>> I understand the desire in keeping out of scope
>>>>>>>>>>> 
>>>>>>>>>>> the
>>>>>>>>>>> 
>>>>>>>>>>> longer
>>>>>>>>>>> 
>>>>>>>>>>> term
>>>>>>>>>>> 
>>>>>>>>>>> deprecation
>>>>>>>>>>> 
>>>>>>>>>>> and migration plan, but… if SASI provides
>>>>>>>>>>> 
>>>>>>>>>>> functionality
>>>>>>>>>>> 
>>>>>>>>>>> that
>>>>>>>>>>> 
>>>>>>>>>>> SAI
>>>>>>>>>>> 
>>>>>>>>>>> doesn't,
>>>>>>>>>>> 
>>>>>>>>>>> like tokenisation and DelimiterAnalyzer, yet
>>>>>>>>>>> 
>>>>>>>>>>> introduces a
>>>>>>>>>>> 
>>>>>>>>>>> body
>>>>>>>>>>> 
>>>>>>>>>>> of
>>>>>>>>>>> 
>>>>>>>>>>> code
>>>>>>>>>>> 
>>>>>>>>>>> ~somewhat similar, shouldn't we be roughly
>>>>>>>>>>> 
>>>>>>>>>>> sketching out
>>>>>>>>>>> 
>>>>>>>>>>> how
>>>>>>>>>>> 
>>>>>>>>>>> to
>>>>>>>>>>> 
>>>>>>>>>>> reduce
>>>>>>>>>>> 
>>>>>>>>>>> the
>>>>>>>>>>> 
>>>>>>>>>>> maintenance surface area?
>>>>>>>>>>> 
>>>>>>>>>>> Agreed that we should reduce maintenance area if
>>>>>>>>>>> 
>>>>>>>>>>> possible,
>>>>>>>>>>> 
>>>>>>>>>>> but
>>>>>>>>>>> 
>>>>>>>>>>> only
>>>>>>>>>>> 
>>>>>>>>>>> very
>>>>>>>>>>> 
>>>>>>>>>>> limited
>>>>>>>>>>> 
>>>>>>>>>>> code base (eg. RangeIterator, QueryPlan) can be
>>>>>>>>>>> 
>>>>>>>>>>> shared.
>>>>>>>>>>> 
>>>>>>>>>>> The
>>>>>>>>>>> 
>>>>>>>>>>> rest
>>>>>>>>>>> 
>>>>>>>>>>> of
>>>>>>>>>>> 
>>>>>>>>>>> the
>>>>>>>>>>> 
>>>>>>>>>>> code base
>>>>>>>>>>> 
>>>>>>>>>>> is quite different because of on-disk format and
>>>>>>>>>>> 
>>>>>>>>>>> cross-index
>>>>>>>>>>> 
>>>>>>>>>>> files.
>>>>>>>>>>> 
>>>>>>>>>>> The goal of this CEP is to get community buy-in on
>>>>>>>>>>> 
>>>>>>>>>>> SAI's
>>>>>>>>>>> 
>>>>>>>>>>> design.
>>>>>>>>>>> 
>>>>>>>>>>> Tokenization,
>>>>>>>>>>> 
>>>>>>>>>>> DelimiterAnalyzer should be straightforward to
>>>>>>>>>>> 
>>>>>>>>>>> implement on
>>>>>>>>>>> 
>>>>>>>>>>> top
>>>>>>>>>>> 
>>>>>>>>>>> of
>>>>>>>>>>> 
>>>>>>>>>>> SAI.
>>>>>>>>>>> 
>>>>>>>>>>> Can we list what configurations of SASI will
>>>>>>>>>>> 
>>>>>>>>>>> become
>>>>>>>>>>> 
>>>>>>>>>>> deprecated
>>>>>>>>>>> 
>>>>>>>>>>> once
>>>>>>>>>>> 
>>>>>>>>>>> SAI
>>>>>>>>>>> 
>>>>>>>>>>> becomes non-experimental?
>>>>>>>>>>> 
>>>>>>>>>>> Except for "Like", "Tokenisation",
>>>>>>>>>>> 
>>>>>>>>>>> "DelimiterAnalyzer",
>>>>>>>>>>> 
>>>>>>>>>>> the
>>>>>>>>>>> 
>>>>>>>>>>> rest
>>>>>>>>>>> 
>>>>>>>>>>> of
>>>>>>>>>>> 
>>>>>>>>>>> SASI
>>>>>>>>>>> 
>>>>>>>>>>> can
>>>>>>>>>>> 
>>>>>>>>>>> be replaced by SAI.
>>>>>>>>>>> 
>>>>>>>>>>> Given a few bugs are open against 2i and SASI, can
>>>>>>>>>>> 
>>>>>>>>>>> we
>>>>>>>>>>> 
>>>>>>>>>>> provide
>>>>>>>>>>> 
>>>>>>>>>>> some
>>>>>>>>>>> 
>>>>>>>>>>> overview, or rough indication, of how many of them
>>>>>>>>>>> 
>>>>>>>>>>> we
>>>>>>>>>>> 
>>>>>>>>>>> could
>>>>>>>>>>> 
>>>>>>>>>>> "triage
>>>>>>>>>>> 
>>>>>>>>>>> away"?
>>>>>>>>>>> 
>>>>>>>>>>> I believe most of the known bugs in 2i/SASI either
>>>>>>>>>>> 
>>>>>>>>>>> have
>>>>>>>>>>> 
>>>>>>>>>>> been
>>>>>>>>>>> 
>>>>>>>>>>> addressed
>>>>>>>>>>> 
>>>>>>>>>>> in
>>>>>>>>>>> 
>>>>>>>>>>> SAI or
>>>>>>>>>>> 
>>>>>>>>>>> don't apply to SAI.
>>>>>>>>>>> 
>>>>>>>>>>> And, is it time for the project to start
>>>>>>>>>>> 
>>>>>>>>>>> introducing new
>>>>>>>>>>> 
>>>>>>>>>>> SPI
>>>>>>>>>>> 
>>>>>>>>>>> implementations as separate sub-modules and jar
>>>>>>>>>>> 
>>>>>>>>>>> files
>>>>>>>>>>> 
>>>>>>>>>>> that
>>>>>>>>>>> 
>>>>>>>>>>> are
>>>>>>>>>>> 
>>>>>>>>>>> only
>>>>>>>>>>> 
>>>>>>>>>>> loaded
>>>>>>>>>>> 
>>>>>>>>>>> at runtime based on configuration settings? (sorry
>>>>>>>>>>> 
>>>>>>>>>>> for
>>>>>>>>>>> 
>>>>>>>>>>> the
>>>>>>>>>>> 
>>>>>>>>>>> conflation
>>>>>>>>>>> 
>>>>>>>>>>> on
>>>>>>>>>>> 
>>>>>>>>>>> this one, but maybe it's the right time to raise
>>>>>>>>>>> 
>>>>>>>>>>> it
>>>>>>>>>>> 
>>>>>>>>>>> :shrug:)
>>>>>>>>>>> 
>>>>>>>>>>> Agreed that modularization is the way to go and will
>>>>>>>>>>> 
>>>>>>>>>>> speed up
>>>>>>>>>>> 
>>>>>>>>>>> module
>>>>>>>>>>> 
>>>>>>>>>>> development speed.
>>>>>>>>>>> 
>>>>>>>>>>> Does community plan to open another discussion or CEP
>>>>>>>>>>> 
>>>>>>>>>>> on
>>>>>>>>>>> 
>>>>>>>>>>> modularization?
>>>>>>>>>>> 
>>>>>>>>>>> On Mon, 24 Aug 2020 at 16:43, Mick Semb Wever <
>>>>>>>>>>> 
>>>>>>>>>>> m...@apache.org <mailto:m...@apache.org>>
>>>>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Adding to Duy's questions…
>>>>>>>>>>> 
>>>>>>>>>>> * Hardware specs
>>>>>>>>>>> 
>>>>>>>>>>> SASI's performance, specifically the search in the
>>>>>>>>>>> 
>>>>>>>>>>> B+
>>>>>>>>>>> 
>>>>>>>>>>> tree
>>>>>>>>>>> 
>>>>>>>>>>> component,
>>>>>>>>>>> 
>>>>>>>>>>> depends a lot on the component file's header being
>>>>>>>>>>> 
>>>>>>>>>>> available in
>>>>>>>>>>> 
>>>>>>>>>>> the
>>>>>>>>>>> 
>>>>>>>>>>> pagecache. SASI benefits from (needs) nodes with
>>>>>>>>>>> 
>>>>>>>>>>> lots
>>>>>>>>>>> 
>>>>>>>>>>> of
>>>>>>>>>>> 
>>>>>>>>>>> RAM.
>>>>>>>>>>> 
>>>>>>>>>>> Is
>>>>>>>>>>> 
>>>>>>>>>>> SAI
>>>>>>>>>>> 
>>>>>>>>>>> bound
>>>>>>>>>>> 
>>>>>>>>>>> to this same or similar limitation?
>>>>>>>>>>> 
>>>>>>>>>>> Flushing of SASI can be CPU+IO intensive, to the
>>>>>>>>>>> 
>>>>>>>>>>> point of
>>>>>>>>>>> 
>>>>>>>>>>> saturation,
>>>>>>>>>>> 
>>>>>>>>>>> pauses, and crashes on the node. SSDs are a must,
>>>>>>>>>>> 
>>>>>>>>>>> along
>>>>>>>>>>> 
>>>>>>>>>>> with a
>>>>>>>>>>> 
>>>>>>>>>>> bit
>>>>>>>>>>> 
>>>>>>>>>>> of
>>>>>>>>>>> 
>>>>>>>>>>> tuning, just to avoid bringing down your cluster.
>>>>>>>>>>> 
>>>>>>>>>>> Beyond
>>>>>>>>>>> 
>>>>>>>>>>> reducing
>>>>>>>>>>> 
>>>>>>>>>>> space
>>>>>>>>>>> 
>>>>>>>>>>> requirements, does SAI improve on these things? Like
>>>>>>>>>>> 
>>>>>>>>>>> SASI
>>>>>>>>>>> 
>>>>>>>>>>> how
>>>>>>>>>>> 
>>>>>>>>>>> does
>>>>>>>>>>> 
>>>>>>>>>>> SAI,
>>>>>>>>>>> 
>>>>>>>>>>> in
>>>>>>>>>>> 
>>>>>>>>>>> its own way, change/narrow the recommendations on
>>>>>>>>>>> 
>>>>>>>>>>> node
>>>>>>>>>>> 
>>>>>>>>>>> hardware
>>>>>>>>>>> 
>>>>>>>>>>> specs?
>>>>>>>>>>> 
>>>>>>>>>>> * Code Maintenance
>>>>>>>>>>> 
>>>>>>>>>>> I understand the desire in keeping out of scope the
>>>>>>>>>>> 
>>>>>>>>>>> longer
>>>>>>>>>>> 
>>>>>>>>>>> term
>>>>>>>>>>> 
>>>>>>>>>>> deprecation
>>>>>>>>>>> 
>>>>>>>>>>> and migration plan, but… if SASI provides
>>>>>>>>>>> 
>>>>>>>>>>> functionality
>>>>>>>>>>> 
>>>>>>>>>>> that
>>>>>>>>>>> 
>>>>>>>>>>> SAI
>>>>>>>>>>> 
>>>>>>>>>>> doesn't,
>>>>>>>>>>> 
>>>>>>>>>>> like tokenisation and DelimiterAnalyzer, yet
>>>>>>>>>>> 
>>>>>>>>>>> introduces a
>>>>>>>>>>> 
>>>>>>>>>>> body
>>>>>>>>>>> 
>>>>>>>>>>> of
>>>>>>>>>>> 
>>>>>>>>>>> code
>>>>>>>>>>> 
>>>>>>>>>>> ~somewhat similar, shouldn't we be roughly sketching
>>>>>>>>>>> 
>>>>>>>>>>> out
>>>>>>>>>>> 
>>>>>>>>>>> how to
>>>>>>>>>>> 
>>>>>>>>>>> reduce
>>>>>>>>>>> 
>>>>>>>>>>> the
>>>>>>>>>>> 
>>>>>>>>>>> maintenance surface area?
>>>>>>>>>>> 
>>>>>>>>>>> Can we list what configurations of SASI will become
>>>>>>>>>>> 
>>>>>>>>>>> deprecated
>>>>>>>>>>> 
>>>>>>>>>>> once
>>>>>>>>>>> 
>>>>>>>>>>> SAI
>>>>>>>>>>> 
>>>>>>>>>>> becomes non-experimental?
>>>>>>>>>>> 
>>>>>>>>>>> Given a few bugs are open against 2i and SASI, can
>>>>>>>>>>> 
>>>>>>>>>>> we
>>>>>>>>>>> 
>>>>>>>>>>> provide
>>>>>>>>>>> 
>>>>>>>>>>> some
>>>>>>>>>>> 
>>>>>>>>>>> overview, or rough indication, of how many of them
>>>>>>>>>>> 
>>>>>>>>>>> we
>>>>>>>>>>> 
>>>>>>>>>>> could
>>>>>>>>>>> 
>>>>>>>>>>> "triage
>>>>>>>>>>> 
>>>>>>>>>>> away"?
>>>>>>>>>>> 
>>>>>>>>>>> And, is it time for the project to start introducing
>>>>>>>>>>> 
>>>>>>>>>>> new
>>>>>>>>>>> 
>>>>>>>>>>> SPI
>>>>>>>>>>> 
>>>>>>>>>>> implementations as separate sub-modules and jar
>>>>>>>>>>> 
>>>>>>>>>>> files
>>>>>>>>>>> 
>>>>>>>>>>> that
>>>>>>>>>>> 
>>>>>>>>>>> are
>>>>>>>>>>> 
>>>>>>>>>>> only
>>>>>>>>>>> 
>>>>>>>>>>> loaded
>>>>>>>>>>> 
>>>>>>>>>>> at runtime based on configuration settings? (sorry
>>>>>>>>>>> 
>>>>>>>>>>> for the
>>>>>>>>>>> 
>>>>>>>>>>> conflation
>>>>>>>>>>> 
>>>>>>>>>>> on
>>>>>>>>>>> 
>>>>>>>>>>> this one, but maybe it's the right time to raise it
>>>>>>>>>>> 
>>>>>>>>>>> :shrug:)
>>>>>>>>>>> 
>>>>>>>>>>> regards,
>>>>>>>>>>> 
>>>>>>>>>>> Mick
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, 18 Aug 2020 at 13:05, DuyHai Doan <
>>>>>>>>>>> 
>>>>>>>>>>> doanduy...@gmail.com <mailto:doanduy...@gmail.com>>
>>>>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Thank you Zhao Yang for starting this topic
>>>>>>>>>>> 
>>>>>>>>>>> After reading the short design doc, I have a few
>>>>>>>>>>> 
>>>>>>>>>>> questions
>>>>>>>>>>> 
>>>>>>>>>>> 1) SASI was pretty inefficient indexing wide
>>>>>>>>>>> 
>>>>>>>>>>> partitions
>>>>>>>>>>> 
>>>>>>>>>>> because
>>>>>>>>>>> 
>>>>>>>>>>> the
>>>>>>>>>>> 
>>>>>>>>>>> index
>>>>>>>>>>> 
>>>>>>>>>>> structure only retains the partition token, not
>>>>>>>>>>> 
>>>>>>>>>>> the
>>>>>>>>>>> 
>>>>>>>>>>> clustering
>>>>>>>>>>> 
>>>>>>>>>>> colums.
>>>>>>>>>>> 
>>>>>>>>>>> As
>>>>>>>>>>> 
>>>>>>>>>>> per design doc SAI has row id mapping to partition
>>>>>>>>>>> 
>>>>>>>>>>> offset,
>>>>>>>>>>> 
>>>>>>>>>>> can
>>>>>>>>>>> 
>>>>>>>>>>> we
>>>>>>>>>>> 
>>>>>>>>>>> hope
>>>>>>>>>>> 
>>>>>>>>>>> that
>>>>>>>>>>> 
>>>>>>>>>>> indexing wide partition will be more efficient
>>>>>>>>>>> 
>>>>>>>>>>> with
>>>>>>>>>>> 
>>>>>>>>>>> SAI
>>>>>>>>>>> 
>>>>>>>>>>> ? One
>>>>>>>>>>> 
>>>>>>>>>>> detail
>>>>>>>>>>> 
>>>>>>>>>>> that
>>>>>>>>>>> 
>>>>>>>>>>> worries me is that in the beggining of the design
>>>>>>>>>>> 
>>>>>>>>>>> doc,
>>>>>>>>>>> 
>>>>>>>>>>> it is
>>>>>>>>>>> 
>>>>>>>>>>> said
>>>>>>>>>>> 
>>>>>>>>>>> that
>>>>>>>>>>> 
>>>>>>>>>>> the
>>>>>>>>>>> 
>>>>>>>>>>> matching rows are post filtered while scanning the
>>>>>>>>>>> 
>>>>>>>>>>> partition.
>>>>>>>>>>> 
>>>>>>>>>>> Can
>>>>>>>>>>> 
>>>>>>>>>>> you
>>>>>>>>>>> 
>>>>>>>>>>> confirm or infirm that SAI is efficient with wide
>>>>>>>>>>> 
>>>>>>>>>>> partitions
>>>>>>>>>>> 
>>>>>>>>>>> and
>>>>>>>>>>> 
>>>>>>>>>>> provides
>>>>>>>>>>> 
>>>>>>>>>>> the partition offsets to the matching rows ?
>>>>>>>>>>> 
>>>>>>>>>>> 2) About space efficiency, one of the biggest
>>>>>>>>>>> 
>>>>>>>>>>> drawback of
>>>>>>>>>>> 
>>>>>>>>>>> SASI
>>>>>>>>>>> 
>>>>>>>>>>> was
>>>>>>>>>>> 
>>>>>>>>>>> the
>>>>>>>>>>> 
>>>>>>>>>>> huge
>>>>>>>>>>> 
>>>>>>>>>>> space required for index structure when using
>>>>>>>>>>> 
>>>>>>>>>>> CONTAINS
>>>>>>>>>>> 
>>>>>>>>>>> logic
>>>>>>>>>>> 
>>>>>>>>>>> because
>>>>>>>>>>> 
>>>>>>>>>>> of
>>>>>>>>>>> 
>>>>>>>>>>> the
>>>>>>>>>>> 
>>>>>>>>>>> decomposition of text columns into n-grams. Will
>>>>>>>>>>> 
>>>>>>>>>>> SAI
>>>>>>>>>>> 
>>>>>>>>>>> suffer
>>>>>>>>>>> 
>>>>>>>>>>> from
>>>>>>>>>>> 
>>>>>>>>>>> the
>>>>>>>>>>> 
>>>>>>>>>>> same
>>>>>>>>>>> 
>>>>>>>>>>> issue in future iterations ? I'm anticipating a
>>>>>>>>>>> 
>>>>>>>>>>> bit
>>>>>>>>>>> 
>>>>>>>>>>> 3) If I'm querying using SAI and providing
>>>>>>>>>>> 
>>>>>>>>>>> complete
>>>>>>>>>>> 
>>>>>>>>>>> partition
>>>>>>>>>>> 
>>>>>>>>>>> key,
>>>>>>>>>>> 
>>>>>>>>>>> will
>>>>>>>>>>> 
>>>>>>>>>>> it
>>>>>>>>>>> 
>>>>>>>>>>> be more efficient than querying without partition
>>>>>>>>>>> 
>>>>>>>>>>> key. In
>>>>>>>>>>> 
>>>>>>>>>>> other
>>>>>>>>>>> 
>>>>>>>>>>> words,
>>>>>>>>>>> 
>>>>>>>>>>> does
>>>>>>>>>>> 
>>>>>>>>>>> SAI provide any optimisation when partition key is
>>>>>>>>>>> 
>>>>>>>>>>> specified
>>>>>>>>>>> 
>>>>>>>>>>> ?
>>>>>>>>>>> 
>>>>>>>>>>> Regards
>>>>>>>>>>> 
>>>>>>>>>>> Duy Hai DOAN
>>>>>>>>>>> 
>>>>>>>>>>> Le mar. 18 août 2020 à 11:39, Mick Semb Wever <
>>>>>>>>>>> 
>>>>>>>>>>> m...@apache.org <mailto:m...@apache.org>>
>>>>>>>>>>> 
>>>>>>>>>>> a
>>>>>>>>>>> 
>>>>>>>>>>> écrit :
>>>>>>>>>>> 
>>>>>>>>>>> We are looking forward to the community's
>>>>>>>>>>> 
>>>>>>>>>>> feedback
>>>>>>>>>>> 
>>>>>>>>>>> and
>>>>>>>>>>> 
>>>>>>>>>>> suggestions.
>>>>>>>>>>> 
>>>>>>>>>>> What comes immediately to mind is testing
>>>>>>>>>>> 
>>>>>>>>>>> requirements. It
>>>>>>>>>>> 
>>>>>>>>>>> has
>>>>>>>>>>> 
>>>>>>>>>>> been
>>>>>>>>>>> 
>>>>>>>>>>> mentioned already that the project's testability
>>>>>>>>>>> 
>>>>>>>>>>> and QA
>>>>>>>>>>> 
>>>>>>>>>>> guidelines
>>>>>>>>>>> 
>>>>>>>>>>> are
>>>>>>>>>>> 
>>>>>>>>>>> inadequate to successfully introduce new
>>>>>>>>>>> 
>>>>>>>>>>> features
>>>>>>>>>>> 
>>>>>>>>>>> and
>>>>>>>>>>> 
>>>>>>>>>>> refactorings
>>>>>>>>>>> 
>>>>>>>>>>> to
>>>>>>>>>>> 
>>>>>>>>>>> the
>>>>>>>>>>> 
>>>>>>>>>>> codebase. During the 4.0 beta phase this was
>>>>>>>>>>> 
>>>>>>>>>>> intended
>>>>>>>>>>> 
>>>>>>>>>>> to be
>>>>>>>>>>> 
>>>>>>>>>>> addressed,
>>>>>>>>>>> 
>>>>>>>>>>> i.e.
>>>>>>>>>>> 
>>>>>>>>>>> defining more specific QA guidelines for 4.0-rc.
>>>>>>>>>>> 
>>>>>>>>>>> This
>>>>>>>>>>> 
>>>>>>>>>>> would
>>>>>>>>>>> 
>>>>>>>>>>> be
>>>>>>>>>>> 
>>>>>>>>>>> an
>>>>>>>>>>> 
>>>>>>>>>>> important
>>>>>>>>>>> 
>>>>>>>>>>> step towards QA guidelines for all changes and
>>>>>>>>>>> 
>>>>>>>>>>> CEPs
>>>>>>>>>>> 
>>>>>>>>>>> post-4.0.
>>>>>>>>>>> 
>>>>>>>>>>> Questions from me
>>>>>>>>>>> 
>>>>>>>>>>> - How will this be tested, how will its QA
>>>>>>>>>>> 
>>>>>>>>>>> status and
>>>>>>>>>>> 
>>>>>>>>>>> lifecycle
>>>>>>>>>>> 
>>>>>>>>>>> be
>>>>>>>>>>> 
>>>>>>>>>>> defined? (per above)
>>>>>>>>>>> 
>>>>>>>>>>> - With existing C* code needing to be changed,
>>>>>>>>>>> 
>>>>>>>>>>> what
>>>>>>>>>>> 
>>>>>>>>>>> is the
>>>>>>>>>>> 
>>>>>>>>>>> proposed
>>>>>>>>>>> 
>>>>>>>>>>> plan
>>>>>>>>>>> 
>>>>>>>>>>> for making those changes ensuring maintained QA,
>>>>>>>>>>> 
>>>>>>>>>>> e.g.
>>>>>>>>>>> 
>>>>>>>>>>> is
>>>>>>>>>>> 
>>>>>>>>>>> there
>>>>>>>>>>> 
>>>>>>>>>>> separate
>>>>>>>>>>> 
>>>>>>>>>>> QA
>>>>>>>>>>> 
>>>>>>>>>>> cycles planned for altering the SPI before
>>>>>>>>>>> 
>>>>>>>>>>> adding
>>>>>>>>>>> 
>>>>>>>>>>> a
>>>>>>>>>>> 
>>>>>>>>>>> new SPI
>>>>>>>>>>> 
>>>>>>>>>>> implementation?
>>>>>>>>>>> 
>>>>>>>>>>> - Despite being out of scope, it would be nice
>>>>>>>>>>> 
>>>>>>>>>>> to have
>>>>>>>>>>> 
>>>>>>>>>>> some
>>>>>>>>>>> 
>>>>>>>>>>> idea
>>>>>>>>>>> 
>>>>>>>>>>> from
>>>>>>>>>>> 
>>>>>>>>>>> the
>>>>>>>>>>> 
>>>>>>>>>>> CEP author of when users might still choose
>>>>>>>>>>> 
>>>>>>>>>>> afresh 2i
>>>>>>>>>>> 
>>>>>>>>>>> or
>>>>>>>>>>> 
>>>>>>>>>>> SASI
>>>>>>>>>>> 
>>>>>>>>>>> over
>>>>>>>>>>> 
>>>>>>>>>>> SAI,
>>>>>>>>>>> 
>>>>>>>>>>> - Who fills the roles involved? Who are the
>>>>>>>>>>> 
>>>>>>>>>>> contributors
>>>>>>>>>>> 
>>>>>>>>>>> in
>>>>>>>>>>> 
>>>>>>>>>>> this
>>>>>>>>>>> 
>>>>>>>>>>> DataStax
>>>>>>>>>>> 
>>>>>>>>>>> team? Who is the shepherd? Are there other
>>>>>>>>>>> 
>>>>>>>>>>> stakeholders
>>>>>>>>>>> 
>>>>>>>>>>> willing
>>>>>>>>>>> 
>>>>>>>>>>> to
>>>>>>>>>>> 
>>>>>>>>>>> be
>>>>>>>>>>> 
>>>>>>>>>>> involved?
>>>>>>>>>>> 
>>>>>>>>>>> - Is there a preference to use gdoc instead of
>>>>>>>>>>> 
>>>>>>>>>>> the
>>>>>>>>>>> 
>>>>>>>>>>> project's
>>>>>>>>>>> 
>>>>>>>>>>> wiki,
>>>>>>>>>>> 
>>>>>>>>>>> and
>>>>>>>>>>> 
>>>>>>>>>>> why? (the CEP process suggest a wiki page, and
>>>>>>>>>>> 
>>>>>>>>>>> feedback on
>>>>>>>>>>> 
>>>>>>>>>>> why
>>>>>>>>>>> 
>>>>>>>>>>> another
>>>>>>>>>>> 
>>>>>>>>>>> approach is considered better helps evolve the
>>>>>>>>>>> 
>>>>>>>>>>> CEP
>>>>>>>>>>> 
>>>>>>>>>>> process
>>>>>>>>>>> 
>>>>>>>>>>> itself)
>>>>>>>>>>> 
>>>>>>>>>>> cheers,
>>>>>>>>>>> 
>>>>>>>>>>> Mick
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>>>>>>>> 
>>>>>>>>>>> To unsubscribe, e-mail:
>>>>> dev-unsubscr...@cassandra.apache.org 
>>>>> <mailto:dev-unsubscr...@cassandra.apache.org>
>>>>>>> For
>>>>>>>>>>> additional commands, e-mail: dev-h...@cassandra.apache.org 
>>>>>>>>>>> <mailto:dev-h...@cassandra.apache.org>
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>>>> To
>>>>>>>>>>> unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org 
>>>>>>>>>>> <mailto:dev-unsubscr...@cassandra.apache.org>
>>>>>> For
>>>>>>>>>> additional
>>>>>>>>>>> commands, e-mail: dev-h...@cassandra.apache.org 
>>>>>>>>>>> <mailto:dev-h...@cassandra.apache.org>
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> alex p
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>>>> To
>>>>>>>>>>> unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org 
>>>>>>>>>>> <mailto:dev-unsubscr...@cassandra.apache.org>
>>>>>> For
>>>>>>>>>> additional
>>>>>>>>>>> commands, e-mail: dev-h...@cassandra.apache.org 
>>>>>>>>>>> <mailto:dev-h...@cassandra.apache.org>
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org 
>>>>>>>>>> <mailto:dev-unsubscr...@cassandra.apache.org>
>>>>>>>>>> For additional commands, e-mail: dev-h...@cassandra.apache.org 
>>>>>>>>>> <mailto:dev-h...@cassandra.apache.org>
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> alex p
> 
> 
> 
> -- 
> Henrik Ingo
> +358 40 569 7354 <tel:358405697354>
>  <https://www.datastax.com/>   <https://twitter.com/DataStaxEng>   
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=>
>    <https://www.linkedin.com/in/heingo/>

Re: [DISCUSS] CEP-7 Storage Attached Index

Reply via email to