Re: [DISCUSS] CEP-7 Storage Attached Index

2022-02-16 Thread Caleb Rackliffe
Thanks, Mike.

Are there any other concerns we should address before we move to a vote?

On Wed, Feb 16, 2022 at 5:25 AM Mike Adamson  wrote:

> I have updated the CEP to reflect the recent discussions.
>
> OR support has moved out of version 1 support. Index versioning and
> virtual table support are now covered in the Addenda.
>
> MikeA
>
> On 14 Feb 2022, at 15:35, Caleb Rackliffe 
> wrote:
>
> Agreed there’s no reason to pull it out. I was just wondering what state
> it was in, given I didn’t see it mentioned in the CEP.
>
> On Feb 14, 2022, at 8:12 AM, Mike Adamson  wrote:
>
> > We don't need a whole "codec framework" for V1, but we're still
> embedding some versioning information in the column index on-disk
> structures, right?
>
> I’m not sure why we would want to pull the versioning code only to have to
> put it back in as soon as we need to change the on-disk format. We also
> need to consider whether the legacy format used by DSE is supported in OSS.
> I’m not sure of the policy on this although I strongly suspect that the
> answer is that it won’t be supported. Either way, it would seem to be a lot
> of work to pull the versioning code out at this point since it formed part
> of a major refactor of the SAI framework and plumbing.
>
> MikeA
>
> On 11 Feb 2022, at 18:47, Caleb Rackliffe 
> wrote:
>
> Just finished reading the latest version of the CEP. Here are my thoughts:
>
> - We've already talked about OR queries, so I won't rehash that, but
> tokenization support seems like it might be another one of those places
> where we can cut scope if we want to get V1 out the door. It shouldn't be
> that hard to detangle from the rest of the code.
> - We mention the JMX metric ecosystem in the CEP, but not the related
> virtual tables. This isn't a big issue, and doesn't mean we need to change
> the CEP, but it might be helpful for those not familiar with the existing
> prototype to know they exist :)
> - It's probably below the line for CEP discussion, but the text and
> numeric index formats will probably change over time. We don't need a whole
> "codec framework" for V1, but we're still embedding some versioning
> information in the column index on-disk structures, right?
>
> To offset my obvious partiality around this CEP, I've already made an
> effort to raise some of the issues that may come up to challenge us from a
> macro perspective. It seems like the prevailing opinion here is that they
> are either surmountable or simply basic conceptual difficulties w/
> distributed secondary indexing.
>
> tl;dr I'm +1 on bringing this to a vote and starting to put together all
> the pieces for CASSANDRA-16052
> <https://issues.apache.org/jira/browse/CASSANDRA-16052> :)
>
> On Thu, Feb 10, 2022 at 11:26 AM Mike Adamson 
> wrote:
>
>> > I'd be interested to hear from Mike/Jason on the OR support topic, of
>> course.
>>
>> The support for OR within SAI is fairly minimal and will not work without
>> the non-SAI changes needed. Since the non-SAI OR changes are extensive it
>> would be better to bring those in under their own CEP.
>>
>> I’d leave the decision of whether to put the rest of SAI behind an
>> experimental flag to others. My preference would be to not do so because
>> the non-OR implementation has been tested and used on production for over a
>> year now.
>>
>> MikeA
>>
>> On 9 Feb 2022, at 13:06, bened...@apache.org wrote:
>>
>> > Is there some mechanism such as experimental flags, which would allow
>> the SAI-only OR support to be merged into trunk
>>
>> FWIW, I’m OK with this merging to trunk, either hidden behind a CI-only
>> flag or exposed to the user via some experimental flag (and a suitable
>> NEWS.txt). We’ve discussed the need to periodically merge feature branches
>> with trunk before they are complete. If the work is logically complete for
>> SAI, and we’re only pending work to make OR consistent between SAI and
>> non-SAI queries, I think that more than meets this criterion.
>>
>>
>>
>> *From: *Henrik Ingo 
>> *Date: *Monday, 7 February 2022 at 12:03
>> *To: *dev@cassandra.apache.org 
>> *Subject: *Re: [DISCUSS] CEP-7 Storage Attached Index
>> Thanks Benjamin for reviewing and raising this.
>>
>> While I don't speak for the CEP authors, just some thoughts from me:
>>
>> On Mon, Feb 7, 2022 at 11:18 AM Benjamin Lerer  wrote:
>>
>> I would like to raise 2 points regarding the current CEP proposal:
>>
>> 1. There are mention of some target versions and of the removal of SASI
>>
>> At this point, we have not agreed on any v

Re: [DISCUSS] CEP-7 Storage Attached Index

2022-02-16 Thread Mike Adamson
I have updated the CEP to reflect the recent discussions.

OR support has moved out of version 1 support. Index versioning and virtual 
table support are now covered in the Addenda.

MikeA

> On 14 Feb 2022, at 15:35, Caleb Rackliffe  wrote:
> 
> Agreed there’s no reason to pull it out. I was just wondering what state it 
> was in, given I didn’t see it mentioned in the CEP.
> 
>> On Feb 14, 2022, at 8:12 AM, Mike Adamson  wrote:
>> 
>> > We don't need a whole "codec framework" for V1, but we're still embedding 
>> some versioning information in the column index on-disk structures, right?
>> 
>> I’m not sure why we would want to pull the versioning code only to have to 
>> put it back in as soon as we need to change the on-disk format. We also need 
>> to consider whether the legacy format used by DSE is supported in OSS. I’m 
>> not sure of the policy on this although I strongly suspect that the answer 
>> is that it won’t be supported. Either way, it would seem to be a lot of work 
>> to pull the versioning code out at this point since it formed part of a 
>> major refactor of the SAI framework and plumbing.
>> 
>> MikeA
>> 
>>> On 11 Feb 2022, at 18:47, Caleb Rackliffe >> <mailto:calebrackli...@gmail.com>> wrote:
>>> 
>>> Just finished reading the latest version of the CEP. Here are my thoughts:
>>> 
>>> - We've already talked about OR queries, so I won't rehash that, but 
>>> tokenization support seems like it might be another one of those places 
>>> where we can cut scope if we want to get V1 out the door. It shouldn't be 
>>> that hard to detangle from the rest of the code.
>>> - We mention the JMX metric ecosystem in the CEP, but not the related 
>>> virtual tables. This isn't a big issue, and doesn't mean we need to change 
>>> the CEP, but it might be helpful for those not familiar with the existing 
>>> prototype to know they exist :)
>>> - It's probably below the line for CEP discussion, but the text and numeric 
>>> index formats will probably change over time. We don't need a whole "codec 
>>> framework" for V1, but we're still embedding some versioning information in 
>>> the column index on-disk structures, right?
>>> 
>>> To offset my obvious partiality around this CEP, I've already made an 
>>> effort to raise some of the issues that may come up to challenge us from a 
>>> macro perspective. It seems like the prevailing opinion here is that they 
>>> are either surmountable or simply basic conceptual difficulties w/ 
>>> distributed secondary indexing.
>>> 
>>> tl;dr I'm +1 on bringing this to a vote and starting to put together all 
>>> the pieces for CASSANDRA-16052 
>>> <https://issues.apache.org/jira/browse/CASSANDRA-16052> :)
>>> 
>>> On Thu, Feb 10, 2022 at 11:26 AM Mike Adamson >> <mailto:madam...@datastax.com>> wrote:
>>> > I'd be interested to hear from Mike/Jason on the OR support topic, of 
>>> > course.
>>> 
>>> The support for OR within SAI is fairly minimal and will not work without 
>>> the non-SAI changes needed. Since the non-SAI OR changes are extensive it 
>>> would be better to bring those in under their own CEP. 
>>> 
>>> I’d leave the decision of whether to put the rest of SAI behind an 
>>> experimental flag to others. My preference would be to not do so because 
>>> the non-OR implementation has been tested and used on production for over a 
>>> year now.
>>> 
>>> MikeA
>>> 
>>>> On 9 Feb 2022, at 13:06, bened...@apache.org <mailto:bened...@apache.org> 
>>>> wrote:
>>>> 
>>>> > Is there some mechanism such as experimental flags, which would allow 
>>>> > the SAI-only OR support to be merged into trunk
>>>>  
>>>> FWIW, I’m OK with this merging to trunk, either hidden behind a CI-only 
>>>> flag or exposed to the user via some experimental flag (and a suitable 
>>>> NEWS.txt). We’ve discussed the need to periodically merge feature branches 
>>>> with trunk before they are complete. If the work is logically complete for 
>>>> SAI, and we’re only pending work to make OR consistent between SAI and 
>>>> non-SAI queries, I think that more than meets this criterion.
>>>>  
>>>>  
>>>> From: Henrik Ingo >>> <mailto:henrik.i...@datastax.com>>
>>>> Date: Monday, 7 February 2022 at 12:03
>>>> To: dev@cas

Re: [DISCUSS] CEP-7 Storage Attached Index

2022-02-14 Thread Henrik Ingo
On Fri, Feb 11, 2022 at 8:47 PM Caleb Rackliffe 
wrote:

> Just finished reading the latest version of the CEP. Here are my thoughts:
>
> - We've already talked about OR queries, so I won't rehash that, but
> tokenization support seems like it might be another one of those places
> where we can cut scope if we want to get V1 out the door. It shouldn't be
> that hard to detangle from the rest of the code.
>

The tokenization support is already implemented. It's available in our
public fork but at least last time I was involved, there's not really any
public documentation. Lucene comes with dozens of tokenizers so the
documentation effort will be significant.

So the situation is similar to OR: The community may want to break out a
separate CEP to debate the user facing syntax. Alternatively, this can
simply happen as part of the PR that could be submitted as soon as CEP-7 is
approved.



> - We mention the JMX metric ecosystem in the CEP, but not the related
> virtual tables. This isn't a big issue, and doesn't mean we need to change
> the CEP, but it might be helpful for those not familiar with the existing
> prototype to know they exist :)
>

Thanks for the callout. Maybe they should indeed be mentioned together.


> - It's probably below the line for CEP discussion, but the text and
> numeric index formats will probably change over time. We don't need a whole
> "codec framework" for V1, but we're still embedding some versioning
> information in the column index on-disk structures, right?
>
>
On the contrary, this is a very valid question. As you know SAI has been GA
for over a year in both our DSE and Astra products, and what is described
in CEP-7 to be included in Cassandra is for the SAI team known as V2. (But
to be clear, it's named V1 in the CEP and in the context of Cassandra!) So
the code does contain facilities to support multiple generations of index
formats. If encountering an sstable of the older version, then the relevant
code would be used to read the index files. Upon compaction the newer
version is written. And there needs to be some kind of global check to know
that new features are only available once all sstables cluster wide are of
the required version.


> To offset my obvious partiality around this CEP, I've already made an
> effort to raise some of the issues that may come up to challenge us from a
> macro perspective. It seems like the prevailing opinion here is that they
> are either surmountable or simply basic conceptual difficulties w/
> distributed secondary indexing.
>
>
This might be a good moment to say that we really appreciate your
investment and support in this CEP!

henrik


Re: [DISCUSS] CEP-7 Storage Attached Index

2022-02-14 Thread Caleb Rackliffe
Agreed there’s no reason to pull it out. I was just wondering what state it was 
in, given I didn’t see it mentioned in the CEP.

> On Feb 14, 2022, at 8:12 AM, Mike Adamson  wrote:
> 
> > We don't need a whole "codec framework" for V1, but we're still embedding 
> some versioning information in the column index on-disk structures, right?
> 
> I’m not sure why we would want to pull the versioning code only to have to 
> put it back in as soon as we need to change the on-disk format. We also need 
> to consider whether the legacy format used by DSE is supported in OSS. I’m 
> not sure of the policy on this although I strongly suspect that the answer is 
> that it won’t be supported. Either way, it would seem to be a lot of work to 
> pull the versioning code out at this point since it formed part of a major 
> refactor of the SAI framework and plumbing.
> 
> MikeA
> 
>> On 11 Feb 2022, at 18:47, Caleb Rackliffe  wrote:
>> 
>> Just finished reading the latest version of the CEP. Here are my thoughts:
>> 
>> - We've already talked about OR queries, so I won't rehash that, but 
>> tokenization support seems like it might be another one of those places 
>> where we can cut scope if we want to get V1 out the door. It shouldn't be 
>> that hard to detangle from the rest of the code.
>> - We mention the JMX metric ecosystem in the CEP, but not the related 
>> virtual tables. This isn't a big issue, and doesn't mean we need to change 
>> the CEP, but it might be helpful for those not familiar with the existing 
>> prototype to know they exist :)
>> - It's probably below the line for CEP discussion, but the text and numeric 
>> index formats will probably change over time. We don't need a whole "codec 
>> framework" for V1, but we're still embedding some versioning information in 
>> the column index on-disk structures, right?
>> 
>> To offset my obvious partiality around this CEP, I've already made an effort 
>> to raise some of the issues that may come up to challenge us from a macro 
>> perspective. It seems like the prevailing opinion here is that they are 
>> either surmountable or simply basic conceptual difficulties w/ distributed 
>> secondary indexing.
>> 
>> tl;dr I'm +1 on bringing this to a vote and starting to put together all the 
>> pieces for CASSANDRA-16052 :)
>> 
>>> On Thu, Feb 10, 2022 at 11:26 AM Mike Adamson  wrote:
>>> > I'd be interested to hear from Mike/Jason on the OR support topic, of 
>>> > course.
>>> 
>>> The support for OR within SAI is fairly minimal and will not work without 
>>> the non-SAI changes needed. Since the non-SAI OR changes are extensive it 
>>> would be better to bring those in under their own CEP. 
>>> 
>>> I’d leave the decision of whether to put the rest of SAI behind an 
>>> experimental flag to others. My preference would be to not do so because 
>>> the non-OR implementation has been tested and used on production for over a 
>>> year now.
>>> 
>>> MikeA
>>> 
>>>> On 9 Feb 2022, at 13:06, bened...@apache.org wrote:
>>>> 
>>>> > Is there some mechanism such as experimental flags, which would allow 
>>>> > the SAI-only OR support to be merged into trunk
>>>>  
>>>> FWIW, I’m OK with this merging to trunk, either hidden behind a CI-only 
>>>> flag or exposed to the user via some experimental flag (and a suitable 
>>>> NEWS.txt). We’ve discussed the need to periodically merge feature branches 
>>>> with trunk before they are complete. If the work is logically complete for 
>>>> SAI, and we’re only pending work to make OR consistent between SAI and 
>>>> non-SAI queries, I think that more than meets this criterion.
>>>>  
>>>>  
>>>> From: Henrik Ingo 
>>>> Date: Monday, 7 February 2022 at 12:03
>>>> To: dev@cassandra.apache.org 
>>>> Subject: Re: [DISCUSS] CEP-7 Storage Attached Index
>>>> 
>>>> Thanks Benjamin for reviewing and raising this.
>>>>  
>>>> While I don't speak for the CEP authors, just some thoughts from me:
>>>>  
>>>> On Mon, Feb 7, 2022 at 11:18 AM Benjamin Lerer  wrote:
>>>> I would like to raise 2 points regarding the current CEP proposal:
>>>>  
>>>> 1. There are mention of some target versions and of the removal of SASI 
>>>>  
>>>> At this point, we have not agreed on any version numbers and I do not feel 
>>>> that removing SASI should be

Re: [DISCUSS] CEP-7 Storage Attached Index

2022-02-14 Thread Mike Adamson
> We don't need a whole "codec framework" for V1, but we're still embedding 
> some versioning information in the column index on-disk structures, right?

I’m not sure why we would want to pull the versioning code only to have to put 
it back in as soon as we need to change the on-disk format. We also need to 
consider whether the legacy format used by DSE is supported in OSS. I’m not 
sure of the policy on this although I strongly suspect that the answer is that 
it won’t be supported. Either way, it would seem to be a lot of work to pull 
the versioning code out at this point since it formed part of a major refactor 
of the SAI framework and plumbing.

MikeA

> On 11 Feb 2022, at 18:47, Caleb Rackliffe  wrote:
> 
> Just finished reading the latest version of the CEP. Here are my thoughts:
> 
> - We've already talked about OR queries, so I won't rehash that, but 
> tokenization support seems like it might be another one of those places where 
> we can cut scope if we want to get V1 out the door. It shouldn't be that hard 
> to detangle from the rest of the code.
> - We mention the JMX metric ecosystem in the CEP, but not the related virtual 
> tables. This isn't a big issue, and doesn't mean we need to change the CEP, 
> but it might be helpful for those not familiar with the existing prototype to 
> know they exist :)
> - It's probably below the line for CEP discussion, but the text and numeric 
> index formats will probably change over time. We don't need a whole "codec 
> framework" for V1, but we're still embedding some versioning information in 
> the column index on-disk structures, right?
> 
> To offset my obvious partiality around this CEP, I've already made an effort 
> to raise some of the issues that may come up to challenge us from a macro 
> perspective. It seems like the prevailing opinion here is that they are 
> either surmountable or simply basic conceptual difficulties w/ distributed 
> secondary indexing.
> 
> tl;dr I'm +1 on bringing this to a vote and starting to put together all the 
> pieces for CASSANDRA-16052 
> <https://issues.apache.org/jira/browse/CASSANDRA-16052> :)
> 
> On Thu, Feb 10, 2022 at 11:26 AM Mike Adamson  <mailto:madam...@datastax.com>> wrote:
> > I'd be interested to hear from Mike/Jason on the OR support topic, of 
> > course.
> 
> The support for OR within SAI is fairly minimal and will not work without the 
> non-SAI changes needed. Since the non-SAI OR changes are extensive it would 
> be better to bring those in under their own CEP. 
> 
> I’d leave the decision of whether to put the rest of SAI behind an 
> experimental flag to others. My preference would be to not do so because the 
> non-OR implementation has been tested and used on production for over a year 
> now.
> 
> MikeA
> 
>> On 9 Feb 2022, at 13:06, bened...@apache.org <mailto:bened...@apache.org> 
>> wrote:
>> 
>> > Is there some mechanism such as experimental flags, which would allow the 
>> > SAI-only OR support to be merged into trunk
>>  
>> FWIW, I’m OK with this merging to trunk, either hidden behind a CI-only flag 
>> or exposed to the user via some experimental flag (and a suitable NEWS.txt). 
>> We’ve discussed the need to periodically merge feature branches with trunk 
>> before they are complete. If the work is logically complete for SAI, and 
>> we’re only pending work to make OR consistent between SAI and non-SAI 
>> queries, I think that more than meets this criterion.
>>  
>>  
>> From: Henrik Ingo > <mailto:henrik.i...@datastax.com>>
>> Date: Monday, 7 February 2022 at 12:03
>> To: dev@cassandra.apache.org <mailto:dev@cassandra.apache.org> 
>> mailto:dev@cassandra.apache.org>>
>> Subject: Re: [DISCUSS] CEP-7 Storage Attached Index
>> 
>> Thanks Benjamin for reviewing and raising this.
>>  
>> While I don't speak for the CEP authors, just some thoughts from me:
>>  
>> On Mon, Feb 7, 2022 at 11:18 AM Benjamin Lerer > <mailto:ble...@apache.org>> wrote:
>> I would like to raise 2 points regarding the current CEP proposal:
>>  
>> 1. There are mention of some target versions and of the removal of SASI 
>>  
>> At this point, we have not agreed on any version numbers and I do not feel 
>> that removing SASI should be part of the proposal for now.
>> It seems to me that we should see first the adoption surrounding SAI before 
>> talking about deprecating other solutions.
>>  
>>  
>> This seems rather uncontroversial. I think the CEP template and previous 
>> CEPs invite  the discussion on whether the new feature will or may replace 
>> a

Re: [DISCUSS] CEP-7 Storage Attached Index

2022-02-11 Thread Caleb Rackliffe
Just finished reading the latest version of the CEP. Here are my thoughts:

- We've already talked about OR queries, so I won't rehash that, but
tokenization support seems like it might be another one of those places
where we can cut scope if we want to get V1 out the door. It shouldn't be
that hard to detangle from the rest of the code.
- We mention the JMX metric ecosystem in the CEP, but not the related
virtual tables. This isn't a big issue, and doesn't mean we need to change
the CEP, but it might be helpful for those not familiar with the existing
prototype to know they exist :)
- It's probably below the line for CEP discussion, but the text and numeric
index formats will probably change over time. We don't need a whole "codec
framework" for V1, but we're still embedding some versioning information in
the column index on-disk structures, right?

To offset my obvious partiality around this CEP, I've already made an
effort to raise some of the issues that may come up to challenge us from a
macro perspective. It seems like the prevailing opinion here is that they
are either surmountable or simply basic conceptual difficulties w/
distributed secondary indexing.

tl;dr I'm +1 on bringing this to a vote and starting to put together all
the pieces for CASSANDRA-16052
<https://issues.apache.org/jira/browse/CASSANDRA-16052> :)

On Thu, Feb 10, 2022 at 11:26 AM Mike Adamson  wrote:

> > I'd be interested to hear from Mike/Jason on the OR support topic, of
> course.
>
> The support for OR within SAI is fairly minimal and will not work without
> the non-SAI changes needed. Since the non-SAI OR changes are extensive it
> would be better to bring those in under their own CEP.
>
> I’d leave the decision of whether to put the rest of SAI behind an
> experimental flag to others. My preference would be to not do so because
> the non-OR implementation has been tested and used on production for over a
> year now.
>
> MikeA
>
> On 9 Feb 2022, at 13:06, bened...@apache.org wrote:
>
> > Is there some mechanism such as experimental flags, which would allow
> the SAI-only OR support to be merged into trunk
>
> FWIW, I’m OK with this merging to trunk, either hidden behind a CI-only
> flag or exposed to the user via some experimental flag (and a suitable
> NEWS.txt). We’ve discussed the need to periodically merge feature branches
> with trunk before they are complete. If the work is logically complete for
> SAI, and we’re only pending work to make OR consistent between SAI and
> non-SAI queries, I think that more than meets this criterion.
>
>
>
> *From: *Henrik Ingo 
> *Date: *Monday, 7 February 2022 at 12:03
> *To: *dev@cassandra.apache.org 
> *Subject: *Re: [DISCUSS] CEP-7 Storage Attached Index
> Thanks Benjamin for reviewing and raising this.
>
> While I don't speak for the CEP authors, just some thoughts from me:
>
> On Mon, Feb 7, 2022 at 11:18 AM Benjamin Lerer  wrote:
>
> I would like to raise 2 points regarding the current CEP proposal:
>
> 1. There are mention of some target versions and of the removal of SASI
>
> At this point, we have not agreed on any version numbers and I do not feel
> that removing SASI should be part of the proposal for now.
> It seems to me that we should see first the adoption surrounding SAI
> before talking about deprecating other solutions.
>
>
>
> This seems rather uncontroversial. I think the CEP template and previous
> CEPs invite  the discussion on whether the new feature will or may replace
> an existing feature. But at the same time that's of course out of scope for
> the work at hand. I have no opinion one way or the other myself.
>
>
>
> 2. OR queries
>
> It is unclear to me if the proposal is about adding OR support only for
> SAI index or for other types of queries too.
> In the past, we had the nasty habit for CQL to provide only partialially
> implemented features which resulted in a bad user experience.
> Some examples are:
> * LIKE restrictions which were introduced for the need of SASI and were
> not never supported for other type of queries
> * IS NOT NULL restrictions for MATERIALIZED VIEWS that are not supported
> elsewhere
> * != operator only supported for conditional inserts or updates
> And there are unfortunately many more.
>
> We are currenlty slowly trying to fix those issue and make CQL a more
> mature language. By consequence, I would like that we change our way of
> doing things. If we introduce support for OR it should also cover all the
> other type of queries and be fully tested.
> I also believe that it is a feature that due to its complexity fully
> deserves its own CEP.
>
>
>
> The current code that would be submitted for review after the CEP is
> adopted, contains OR support beyond just

Re: [DISCUSS] CEP-7 Storage Attached Index

2022-02-10 Thread Mike Adamson
> I'd be interested to hear from Mike/Jason on the OR support topic, of course.

The support for OR within SAI is fairly minimal and will not work without the 
non-SAI changes needed. Since the non-SAI OR changes are extensive it would be 
better to bring those in under their own CEP. 

I’d leave the decision of whether to put the rest of SAI behind an experimental 
flag to others. My preference would be to not do so because the non-OR 
implementation has been tested and used on production for over a year now.

MikeA

> On 9 Feb 2022, at 13:06, bened...@apache.org wrote:
> 
> > Is there some mechanism such as experimental flags, which would allow the 
> > SAI-only OR support to be merged into trunk
>  
> FWIW, I’m OK with this merging to trunk, either hidden behind a CI-only flag 
> or exposed to the user via some experimental flag (and a suitable NEWS.txt). 
> We’ve discussed the need to periodically merge feature branches with trunk 
> before they are complete. If the work is logically complete for SAI, and 
> we’re only pending work to make OR consistent between SAI and non-SAI 
> queries, I think that more than meets this criterion.
>  
>  
> From: Henrik Ingo mailto:henrik.i...@datastax.com>>
> Date: Monday, 7 February 2022 at 12:03
> To: dev@cassandra.apache.org <mailto:dev@cassandra.apache.org> 
> mailto:dev@cassandra.apache.org>>
> Subject: Re: [DISCUSS] CEP-7 Storage Attached Index
> 
> Thanks Benjamin for reviewing and raising this.
>  
> While I don't speak for the CEP authors, just some thoughts from me:
>  
> On Mon, Feb 7, 2022 at 11:18 AM Benjamin Lerer  <mailto:ble...@apache.org>> wrote:
> I would like to raise 2 points regarding the current CEP proposal:
>  
> 1. There are mention of some target versions and of the removal of SASI 
>  
> At this point, we have not agreed on any version numbers and I do not feel 
> that removing SASI should be part of the proposal for now.
> It seems to me that we should see first the adoption surrounding SAI before 
> talking about deprecating other solutions.
>  
>  
> This seems rather uncontroversial. I think the CEP template and previous CEPs 
> invite  the discussion on whether the new feature will or may replace an 
> existing feature. But at the same time that's of course out of scope for the 
> work at hand. I have no opinion one way or the other myself.
>  
>  
> 2. OR queries
>  
> It is unclear to me if the proposal is about adding OR support only for SAI 
> index or for other types of queries too.
> In the past, we had the nasty habit for CQL to provide only partialially 
> implemented features which resulted in a bad user experience.
> Some examples are:
> * LIKE restrictions which were introduced for the need of SASI and were not 
> never supported for other type of queries
> * IS NOT NULL restrictions for MATERIALIZED VIEWS that are not supported 
> elsewhere
> * != operator only supported for conditional inserts or updates
> And there are unfortunately many more.
>  
> We are currenlty slowly trying to fix those issue and make CQL a more mature 
> language. By consequence, I would like that we change our way of doing 
> things. If we introduce support for OR it should also cover all the other 
> type of queries and be fully tested.
> I also believe that it is a feature that due to its complexity fully deserves 
> its own CEP.
>  
>  
> The current code that would be submitted for review after the CEP is adopted, 
> contains OR support beyond just SAI indexes. An initial implementation first 
> targeted only such queries where all columns in a WHERE clause using OR 
> needed to be backed by an SAI index. This was since extended to also support 
> ALLOW FILTERING mode as well as OR with clustering key columns. The current 
> implementation is by no means perfect as a general purpose OR support, the 
> focus all the time was on implementing OR support in SAI. I'll leave it to 
> others to enumerate exactly the limitations of the current implementation.
>  
> Seeing that also Benedict supports your point of view, I would steer the 
> conversation more into a project management perspective:
> * How can we advance CEP-7 so that the bulk of the SAI code can still be 
> added to Cassandra, so that  users can benefit from this new index type, 
> albeit without OR?
> * This is also an important question from the point of view that this is a 
> large block of code that will inevitably diverged if it's not in trunk. Also, 
> merging it to trunk will allow future enhancements, including the OR syntax 
> btw, to happen against trunk (aka upstream first).
> * Since OR support nevertheless is a feature of SAI, it needs to be at least 
> unit tested, but ideally even would

Re: [DISCUSS] CEP-7 Storage Attached Index

2022-02-09 Thread bened...@apache.org
> Is there some mechanism such as experimental flags, which would allow the 
> SAI-only OR support to be merged into trunk

FWIW, I’m OK with this merging to trunk, either hidden behind a CI-only flag or 
exposed to the user via some experimental flag (and a suitable NEWS.txt). We’ve 
discussed the need to periodically merge feature branches with trunk before 
they are complete. If the work is logically complete for SAI, and we’re only 
pending work to make OR consistent between SAI and non-SAI queries, I think 
that more than meets this criterion.


From: Henrik Ingo 
Date: Monday, 7 February 2022 at 12:03
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] CEP-7 Storage Attached Index
Thanks Benjamin for reviewing and raising this.

While I don't speak for the CEP authors, just some thoughts from me:

On Mon, Feb 7, 2022 at 11:18 AM Benjamin Lerer 
mailto:ble...@apache.org>> wrote:
I would like to raise 2 points regarding the current CEP proposal:

1. There are mention of some target versions and of the removal of SASI

At this point, we have not agreed on any version numbers and I do not feel that 
removing SASI should be part of the proposal for now.
It seems to me that we should see first the adoption surrounding SAI before 
talking about deprecating other solutions.


This seems rather uncontroversial. I think the CEP template and previous CEPs 
invite  the discussion on whether the new feature will or may replace an 
existing feature. But at the same time that's of course out of scope for the 
work at hand. I have no opinion one way or the other myself.


2. OR queries

It is unclear to me if the proposal is about adding OR support only for SAI 
index or for other types of queries too.
In the past, we had the nasty habit for CQL to provide only partialially 
implemented features which resulted in a bad user experience.
Some examples are:
* LIKE restrictions which were introduced for the need of SASI and were not 
never supported for other type of queries
* IS NOT NULL restrictions for MATERIALIZED VIEWS that are not supported 
elsewhere
* != operator only supported for conditional inserts or updates
And there are unfortunately many more.

We are currenlty slowly trying to fix those issue and make CQL a more mature 
language. By consequence, I would like that we change our way of doing things. 
If we introduce support for OR it should also cover all the other type of 
queries and be fully tested.
I also believe that it is a feature that due to its complexity fully deserves 
its own CEP.


The current code that would be submitted for review after the CEP is adopted, 
contains OR support beyond just SAI indexes. An initial implementation first 
targeted only such queries where all columns in a WHERE clause using OR needed 
to be backed by an SAI index. This was since extended to also support ALLOW 
FILTERING mode as well as OR with clustering key columns. The current 
implementation is by no means perfect as a general purpose OR support, the 
focus all the time was on implementing OR support in SAI. I'll leave it to 
others to enumerate exactly the limitations of the current implementation.

Seeing that also Benedict supports your point of view, I would steer the 
conversation more into a project management perspective:
* How can we advance CEP-7 so that the bulk of the SAI code can still be added 
to Cassandra, so that  users can benefit from this new index type, albeit 
without OR?
* This is also an important question from the point of view that this is a 
large block of code that will inevitably diverged if it's not in trunk. Also, 
merging it to trunk will allow future enhancements, including the OR syntax 
btw, to happen against trunk (aka upstream first).
* Since OR support nevertheless is a feature of SAI, it needs to be at least 
unit tested, but ideally even would be exposed so that it is possible to test 
on the CQL level. Is there some mechanism such as experimental flags, which 
would allow the SAI-only OR support to be merged into trunk, while a separate 
CEP is focused on implementing "proper" general purpose OR support? I should 
note that there is no guarantee that the OR CEP would be implemented in time 
for the next release. So the answer to this point needs to be something that 
doesn't violate the desire for good user experience.

henrik




Re: [DISCUSS] CEP-7 Storage Attached Index

2022-02-08 Thread Caleb Rackliffe
Regarding SASI deprecation and removal, I think I'm on the same page as
Henrik. The grand glorious future involves getting to feature parity with
and then completely replacing legacy 2i and SASI, but the CEP need not
specify a hard timeline for this.

With respect to OR support, I'm actually completely on-board with shipping
a first version of SAI without it. It's possible significant improvements
to the query engine have been made since the last time I worked w/ the SAI
prototype, but disjunctions expose us to some additional risk. They would
absolutely add to our surface area for testing (correctness, performance,
etc.) Doing this in the context of holistic support for OR once SAI is off
the ground and has some traction isn't a bad plan.

I'd be interested to hear from Mike/Jason on the OR support topic, of
course.

On Mon, Feb 7, 2022 at 6:59 AM bened...@apache.org 
wrote:

> I don’t have a strong opinion about CEP-7 taking a hard dependency on any
> new CQL CEP, particularly from a point of view of first landing in the
> codebase.
>
>
>
>
>
> *From: *Henrik Ingo 
> *Date: *Monday, 7 February 2022 at 12:03
> *To: *dev@cassandra.apache.org 
> *Subject: *Re: [DISCUSS] CEP-7 Storage Attached Index
>
> Thanks Benjamin for reviewing and raising this.
>
>
>
> While I don't speak for the CEP authors, just some thoughts from me:
>
>
>
> On Mon, Feb 7, 2022 at 11:18 AM Benjamin Lerer  wrote:
>
> I would like to raise 2 points regarding the current CEP proposal:
>
>
>
> 1. There are mention of some target versions and of the removal of SASI
>
>
>
> At this point, we have not agreed on any version numbers and I do not feel
> that removing SASI should be part of the proposal for now.
>
> It seems to me that we should see first the adoption surrounding SAI
> before talking about deprecating other solutions.
>
>
>
>
>
> This seems rather uncontroversial. I think the CEP template and previous
> CEPs invite  the discussion on whether the new feature will or may replace
> an existing feature. But at the same time that's of course out of scope for
> the work at hand. I have no opinion one way or the other myself.
>
>
>
>
>
> 2. OR queries
>
>
>
> It is unclear to me if the proposal is about adding OR support only for
> SAI index or for other types of queries too.
>
> In the past, we had the nasty habit for CQL to provide only partialially
> implemented features which resulted in a bad user experience.
>
> Some examples are:
>
> * LIKE restrictions which were introduced for the need of SASI and were
> not never supported for other type of queries
>
> * IS NOT NULL restrictions for MATERIALIZED VIEWS that are not supported
> elsewhere
>
> * != operator only supported for conditional inserts or updates
>
> And there are unfortunately many more.
>
>
>
> We are currenlty slowly trying to fix those issue and make CQL a more
> mature language. By consequence, I would like that we change our way of
> doing things. If we introduce support for OR it should also cover all the
> other type of queries and be fully tested.
>
> I also believe that it is a feature that due to its complexity fully
> deserves its own CEP.
>
>
>
>
>
> The current code that would be submitted for review after the CEP is
> adopted, contains OR support beyond just SAI indexes. An initial
> implementation first targeted only such queries where all columns in a
> WHERE clause using OR needed to be backed by an SAI index. This was since
> extended to also support ALLOW FILTERING mode as well as OR with clustering
> key columns. The current implementation is by no means perfect as a general
> purpose OR support, the focus all the time was on implementing OR support
> in SAI. I'll leave it to others to enumerate exactly the limitations of the
> current implementation.
>
>
>
> Seeing that also Benedict supports your point of view, I would steer the
> conversation more into a project management perspective:
>
> * How can we advance CEP-7 so that the bulk of the SAI code can still be
> added to Cassandra, so that  users can benefit from this new index type,
> albeit without OR?
>
> * This is also an important question from the point of view that this is a
> large block of code that will inevitably diverged if it's not in trunk.
> Also, merging it to trunk will allow future enhancements, including the OR
> syntax btw, to happen against trunk (aka upstream first).
>
> * Since OR support nevertheless is a feature of SAI, it needs to be at
> least unit tested, but ideally even would be exposed so that it is possible
> to test on the CQL level. Is there some mechanism such as experimental
> flags, which would allow the SAI-only O

Re: [DISCUSS] CEP-7 Storage Attached Index

2022-02-07 Thread bened...@apache.org
I don’t have a strong opinion about CEP-7 taking a hard dependency on any new 
CQL CEP, particularly from a point of view of first landing in the codebase.


From: Henrik Ingo 
Date: Monday, 7 February 2022 at 12:03
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] CEP-7 Storage Attached Index
Thanks Benjamin for reviewing and raising this.

While I don't speak for the CEP authors, just some thoughts from me:

On Mon, Feb 7, 2022 at 11:18 AM Benjamin Lerer 
mailto:ble...@apache.org>> wrote:
I would like to raise 2 points regarding the current CEP proposal:

1. There are mention of some target versions and of the removal of SASI

At this point, we have not agreed on any version numbers and I do not feel that 
removing SASI should be part of the proposal for now.
It seems to me that we should see first the adoption surrounding SAI before 
talking about deprecating other solutions.


This seems rather uncontroversial. I think the CEP template and previous CEPs 
invite  the discussion on whether the new feature will or may replace an 
existing feature. But at the same time that's of course out of scope for the 
work at hand. I have no opinion one way or the other myself.


2. OR queries

It is unclear to me if the proposal is about adding OR support only for SAI 
index or for other types of queries too.
In the past, we had the nasty habit for CQL to provide only partialially 
implemented features which resulted in a bad user experience.
Some examples are:
* LIKE restrictions which were introduced for the need of SASI and were not 
never supported for other type of queries
* IS NOT NULL restrictions for MATERIALIZED VIEWS that are not supported 
elsewhere
* != operator only supported for conditional inserts or updates
And there are unfortunately many more.

We are currenlty slowly trying to fix those issue and make CQL a more mature 
language. By consequence, I would like that we change our way of doing things. 
If we introduce support for OR it should also cover all the other type of 
queries and be fully tested.
I also believe that it is a feature that due to its complexity fully deserves 
its own CEP.


The current code that would be submitted for review after the CEP is adopted, 
contains OR support beyond just SAI indexes. An initial implementation first 
targeted only such queries where all columns in a WHERE clause using OR needed 
to be backed by an SAI index. This was since extended to also support ALLOW 
FILTERING mode as well as OR with clustering key columns. The current 
implementation is by no means perfect as a general purpose OR support, the 
focus all the time was on implementing OR support in SAI. I'll leave it to 
others to enumerate exactly the limitations of the current implementation.

Seeing that also Benedict supports your point of view, I would steer the 
conversation more into a project management perspective:
* How can we advance CEP-7 so that the bulk of the SAI code can still be added 
to Cassandra, so that  users can benefit from this new index type, albeit 
without OR?
* This is also an important question from the point of view that this is a 
large block of code that will inevitably diverged if it's not in trunk. Also, 
merging it to trunk will allow future enhancements, including the OR syntax 
btw, to happen against trunk (aka upstream first).
* Since OR support nevertheless is a feature of SAI, it needs to be at least 
unit tested, but ideally even would be exposed so that it is possible to test 
on the CQL level. Is there some mechanism such as experimental flags, which 
would allow the SAI-only OR support to be merged into trunk, while a separate 
CEP is focused on implementing "proper" general purpose OR support? I should 
note that there is no guarantee that the OR CEP would be implemented in time 
for the next release. So the answer to this point needs to be something that 
doesn't violate the desire for good user experience.

henrik




Re: [DISCUSS] CEP-7 Storage Attached Index

2022-02-07 Thread J. D. Jordan
Given this discussion +1 from me to move OR to its own CEP separate from the 
new index implementation.

> On Feb 7, 2022, at 6:51 AM, Benjamin Lerer  wrote:
> 
> 
>> This was since extended to also support ALLOW FILTERING mode as well as OR 
>> with clustering key columns.
> 
> If the code is able to support query using clustering columns  without the 
> need for filtering + filtering queries then it should be relatively easy to 
> have full support for CQL.
> We also need some proper test coverage and ideally some validation with Harry.
> 
>>   * Since OR support nevertheless is a feature of SAI, it needs to be at 
>> least unit tested, but ideally even would be exposed so that it is possible 
>> to test on the CQL level. Is there some mechanism such as experimental 
>> flags, which would allow the SAI-only OR support to be merged into trunk, 
>> while a separate CEP is focused on implementing "proper" general purpose OR 
>> support? I should note that there is no guarantee that the OR CEP would be 
>> implemented in time for the next release. So the answer to this point needs 
>> to be something that doesn't violate the desire for good user experience.
> 
> This is currently what we have with SASI. Currently SASI is behind an 
> experimental flag but nevertheless the LIKE restriction code has been 
> introduced as part of the code base and its use will result in an error 
> without a SASI index.
> SASI has been there for multiple years and we still do not support LIKE 
> restrictions for other use cases.
> I am against that approach because I do believe that it is what has led us 
> where we are today. We need to stop adding bits of CQL grammar to fulfill the 
> need of a given feature and start considering CQL as a whole.
> 
> I am in favor of moving forward with SAI without OR support until OR can be 
> properly added to CQL. 
> 
>  
>  
> 
>> Le lun. 7 févr. 2022 à 13:11, Henrik Ingo  a écrit 
>> :
>> Thanks Benjamin for reviewing and raising this.
>> 
>> While I don't speak for the CEP authors, just some thoughts from me:
>> 
>>> On Mon, Feb 7, 2022 at 11:18 AM Benjamin Lerer  wrote:
>> 
>>> I would like to raise 2 points regarding the current CEP proposal:
>>> 
>>> 1. There are mention of some target versions and of the removal of SASI 
>>> 
>>> At this point, we have not agreed on any version numbers and I do not feel 
>>> that removing SASI should be part of the proposal for now.
>>> It seems to me that we should see first the adoption surrounding SAI before 
>>> talking about deprecating other solutions.
>>> 
>> 
>> This seems rather uncontroversial. I think the CEP template and previous 
>> CEPs invite  the discussion on whether the new feature will or may replace 
>> an existing feature. But at the same time that's of course out of scope for 
>> the work at hand. I have no opinion one way or the other myself.
>> 
>>  
>>> 2. OR queries
>>> 
>>> It is unclear to me if the proposal is about adding OR support only for SAI 
>>> index or for other types of queries too.
>>> In the past, we had the nasty habit for CQL to provide only partialially 
>>> implemented features which resulted in a bad user experience.
>>> Some examples are:
>>> * LIKE restrictions which were introduced for the need of SASI and were not 
>>> never supported for other type of queries
>>> * IS NOT NULL restrictions for MATERIALIZED VIEWS that are not supported 
>>> elsewhere
>>> * != operator only supported for conditional inserts or updates
>>> And there are unfortunately many more.
>>> 
>>> We are currenlty slowly trying to fix those issue and make CQL a more 
>>> mature language. By consequence, I would like that we change our way of 
>>> doing things. If we introduce support for OR it should also cover all the 
>>> other type of queries and be fully tested.
>>> I also believe that it is a feature that due to its complexity fully 
>>> deserves its own CEP.
>>> 
>> 
>> The current code that would be submitted for review after the CEP is 
>> adopted, contains OR support beyond just SAI indexes. An initial 
>> implementation first targeted only such queries where all columns in a WHERE 
>> clause using OR needed to be backed by an SAI index. This was since extended 
>> to also support ALLOW FILTERING mode as well as OR with clustering key 
>> columns. The current implementation is by no means perfect as a general 
>> purpose OR support, the focus all the time was on implementing OR support in 
>> SAI. I'll leave it to others to enumerate exactly the limitations of the 
>> current implementation.
>> 
>> Seeing that also Benedict supports your point of view, I would steer the 
>> conversation more into a project management perspective:
>> * How can we advance CEP-7 so that the bulk of the SAI code can still be 
>> added to Cassandra, so that  users can benefit from this new index type, 
>> albeit without OR?
>> * This is also an important question from the point of view that this is a 
>> large block of code that will 

Re: [DISCUSS] CEP-7 Storage Attached Index

2022-02-07 Thread Benjamin Lerer
>
> This was since extended to also support ALLOW FILTERING mode as well as OR
> with clustering key columns.


If the code is able to support query using clustering columns  without the
need for filtering + filtering queries then it should be relatively easy to
have full support for CQL.
We also need some proper test coverage and ideally some validation with
Harry.

  * Since OR support nevertheless is a feature of SAI, it needs to be at
> least unit tested, but ideally even would be exposed so that it is possible
> to test on the CQL level. Is there some mechanism such as experimental
> flags, which would allow the SAI-only OR support to be merged into trunk,
> while a separate CEP is focused on implementing "proper" general purpose OR
> support? I should note that there is no guarantee that the OR CEP would be
> implemented in time for the next release. So the answer to this point needs
> to be something that doesn't violate the desire for good user experience.
>

This is currently what we have with SASI. Currently SASI is behind an
experimental flag but nevertheless the LIKE restriction code has been
introduced as part of the code base and its use will result in an error
without a SASI index.
SASI has been there for multiple years and we still do not support LIKE
restrictions for other use cases.
I am against that approach because I do believe that it is what has led us
where we are today. We need to stop adding bits of CQL grammar to fulfill
the need of a given feature and start considering CQL as a whole.

I am in favor of moving forward with SAI without OR support until OR can be
properly added to CQL.




Le lun. 7 févr. 2022 à 13:11, Henrik Ingo  a
écrit :

> Thanks Benjamin for reviewing and raising this.
>
> While I don't speak for the CEP authors, just some thoughts from me:
>
> On Mon, Feb 7, 2022 at 11:18 AM Benjamin Lerer  wrote:
>
>> I would like to raise 2 points regarding the current CEP proposal:
>>
>> 1. There are mention of some target versions and of the removal of SASI
>>
>> At this point, we have not agreed on any version numbers and I do not
>> feel that removing SASI should be part of the proposal for now.
>> It seems to me that we should see first the adoption surrounding SAI
>> before talking about deprecating other solutions.
>>
>>
> This seems rather uncontroversial. I think the CEP template and previous
> CEPs invite  the discussion on whether the new feature will or may replace
> an existing feature. But at the same time that's of course out of scope for
> the work at hand. I have no opinion one way or the other myself.
>
>
>
>> 2. OR queries
>>
>> It is unclear to me if the proposal is about adding OR support only for
>> SAI index or for other types of queries too.
>> In the past, we had the nasty habit for CQL to provide only partialially
>> implemented features which resulted in a bad user experience.
>> Some examples are:
>> * LIKE restrictions which were introduced for the need of SASI and were
>> not never supported for other type of queries
>> * IS NOT NULL restrictions for MATERIALIZED VIEWS that are not supported
>> elsewhere
>> * != operator only supported for conditional inserts or updates
>> And there are unfortunately many more.
>>
>> We are currenlty slowly trying to fix those issue and make CQL a more
>> mature language. By consequence, I would like that we change our way of
>> doing things. If we introduce support for OR it should also cover all the
>> other type of queries and be fully tested.
>> I also believe that it is a feature that due to its complexity fully
>> deserves its own CEP.
>>
>>
> The current code that would be submitted for review after the CEP is
> adopted, contains OR support beyond just SAI indexes. An initial
> implementation first targeted only such queries where all columns in a
> WHERE clause using OR needed to be backed by an SAI index. This was since
> extended to also support ALLOW FILTERING mode as well as OR with clustering
> key columns. The current implementation is by no means perfect as a general
> purpose OR support, the focus all the time was on implementing OR support
> in SAI. I'll leave it to others to enumerate exactly the limitations of the
> current implementation.
>
> Seeing that also Benedict supports your point of view, I would steer the
> conversation more into a project management perspective:
> * How can we advance CEP-7 so that the bulk of the SAI code can still be
> added to Cassandra, so that  users can benefit from this new index type,
> albeit without OR?
> * This is also an important question from the point of view that this is a
> large block of code that will inevitably diverged if it's not in trunk.
> Also, merging it to trunk will allow future enhancements, including the OR
> syntax btw, to happen against trunk (aka upstream first).
> * Since OR support nevertheless is a feature of SAI, it needs to be at
> least unit tested, but ideally even would be exposed so that it is possible
> to test on the CQL 

Re: [DISCUSS] CEP-7 Storage Attached Index

2022-02-07 Thread Henrik Ingo
Thanks Benjamin for reviewing and raising this.

While I don't speak for the CEP authors, just some thoughts from me:

On Mon, Feb 7, 2022 at 11:18 AM Benjamin Lerer  wrote:

> I would like to raise 2 points regarding the current CEP proposal:
>
> 1. There are mention of some target versions and of the removal of SASI
>
> At this point, we have not agreed on any version numbers and I do not feel
> that removing SASI should be part of the proposal for now.
> It seems to me that we should see first the adoption surrounding SAI
> before talking about deprecating other solutions.
>
>
This seems rather uncontroversial. I think the CEP template and previous
CEPs invite  the discussion on whether the new feature will or may replace
an existing feature. But at the same time that's of course out of scope for
the work at hand. I have no opinion one way or the other myself.



> 2. OR queries
>
> It is unclear to me if the proposal is about adding OR support only for
> SAI index or for other types of queries too.
> In the past, we had the nasty habit for CQL to provide only partialially
> implemented features which resulted in a bad user experience.
> Some examples are:
> * LIKE restrictions which were introduced for the need of SASI and were
> not never supported for other type of queries
> * IS NOT NULL restrictions for MATERIALIZED VIEWS that are not supported
> elsewhere
> * != operator only supported for conditional inserts or updates
> And there are unfortunately many more.
>
> We are currenlty slowly trying to fix those issue and make CQL a more
> mature language. By consequence, I would like that we change our way of
> doing things. If we introduce support for OR it should also cover all the
> other type of queries and be fully tested.
> I also believe that it is a feature that due to its complexity fully
> deserves its own CEP.
>
>
The current code that would be submitted for review after the CEP is
adopted, contains OR support beyond just SAI indexes. An initial
implementation first targeted only such queries where all columns in a
WHERE clause using OR needed to be backed by an SAI index. This was since
extended to also support ALLOW FILTERING mode as well as OR with clustering
key columns. The current implementation is by no means perfect as a general
purpose OR support, the focus all the time was on implementing OR support
in SAI. I'll leave it to others to enumerate exactly the limitations of the
current implementation.

Seeing that also Benedict supports your point of view, I would steer the
conversation more into a project management perspective:
* How can we advance CEP-7 so that the bulk of the SAI code can still be
added to Cassandra, so that  users can benefit from this new index type,
albeit without OR?
* This is also an important question from the point of view that this is a
large block of code that will inevitably diverged if it's not in trunk.
Also, merging it to trunk will allow future enhancements, including the OR
syntax btw, to happen against trunk (aka upstream first).
* Since OR support nevertheless is a feature of SAI, it needs to be at
least unit tested, but ideally even would be exposed so that it is possible
to test on the CQL level. Is there some mechanism such as experimental
flags, which would allow the SAI-only OR support to be merged into trunk,
while a separate CEP is focused on implementing "proper" general purpose OR
support? I should note that there is no guarantee that the OR CEP would be
implemented in time for the next release. So the answer to this point needs
to be something that doesn't violate the desire for good user experience.

henrik


Re: [DISCUSS] CEP-7 Storage Attached Index

2022-02-03 Thread Mike Adamson
I can’t why there would be any objection to adding a guardrail. I think this is 
a good idea.

MikeA

"I see this as a task for a follow-up ticket so long as the CEP’s contributors 
would not oppose the addition of such a guardrail."

> On 3 Feb 2022, at 16:06, C. Scott Andreas  wrote:
> 
> I see this as a task for a follow-up ticket so long as the CEP’s contributors 
> would not oppose the addition of such a guardrail.



Re: [DISCUSS] CEP-7 Storage Attached Index

2022-02-02 Thread Jeremiah D Jordan
Given the distributed search part is an issue with our secondary indexes in 
general, and not with any implementation, I don’t see a reason to hold up a 
vote on CEP-7 for it?

-Jeremiah

> On Feb 2, 2022, at 10:01 AM, Henrik Ingo  wrote:
> 
> So this is an area I've thought about and in fact the overall dynamics are 
> the same as for MongoDB secondary indexes in a sharded cluster. The TL:DR; is 
> that the benefits far outweigh the limitations:
> 
> * There's a large area of queries where you have the partition key but not 
> the full Primary Key. SAI (now with row awareness) is an efficient solution 
> for such queries.
> * As a special case of the above would be that you have a partition key (or 
> keys) but want to sort by something else than the clustering key. However, 
> note that the current version of SAI doesn't actually support sorting.
> * Your cluster has at most 10-20 nodes and the share of queries that lack a 
> partition key is at most 5% - 10%.
> * Even for very large clusters, a low frequency of queries without partition 
> key is fine.
> 
> If all of the above was obvious and the discussion was only about what 
> Guardrails we may want to set to warn or stop the use, then apologies... I 
> would suggest the guardrail could be that if share of non-pk queries *on each 
> node* is above 33% guardrails should warn and if it's above 66% it should 
> fail the non-pk queries.
> 
> I blogged about the math behind scalability of secondary indexes a year ago: 
> https://web.archive.org/web/20210814021809/https://www.openlife.cc/blogs/2020/november/scalability-model-cassandra
>  
> 
> 
> henrik
> 
> On Wed, Feb 2, 2022 at 3:59 PM Joshua McKenzie  > wrote:
> To me the outstanding thing worth tackling is the Challenges section Caleb 
> added in the CEP. Specifically:
> "The only "easy" way around these two challenges is to focus our efforts on 
> queries that are restricted to either partitions or small token ranges. These 
> queries behave well locally even on LCS (given levels contain token-disjoint 
> SSTables, and assuming a low number of unleveled SSTables), avoid fan-out and 
> all of its secondary pitfalls, and allow us to make queries at varying CLs 
> with reasonable performance. Attempting to fix the local problems around 
> compaction strategy could mean either restricted strategy usage or partially 
> abandoning SSTable-attachment. Attempting to fix distributed read path 
> problems by pushing the design towards IR systems like ES could compromise 
> our ability to use higher read CLs."
> 
> This is probably something we could integrate with Guardrails out of the gate 
> to discourage suboptimal use right? Or at least allude to in the CEP so it's 
> something on our rader.
> 
> One of the big downfalls of Materialized Views (aside from the orphaned data 
> and inconsistency pains) was the lack of limits on creation of them (either 
> number or structure / data amount) with serious un-inspectable implications 
> on disk usage and performance. The more we can learn from those missteps the 
> better.
> 
> On Wed, Feb 2, 2022 at 8:24 AM Mike Adamson  > wrote:
> Hi,
> 
> I’d like to restart this thread.
> 
> We merged the row-aware branch to the SAI codebase just before Christmas and 
> have subsequently updated the CEP to reflect these changes.
> 
> I would like to move the discussion forward as to how we move this CEP 
> towards a vote.
> 
> MikeA
> 
>> On 16 Sep 2021, at 19:49, DuyHai Doan > > wrote:
>> 
>> Good new Mike that row based indexing will be available, this was a major
>> lacking from SASI at that time !
>> 
>> Le jeu. 16 sept. 2021 à 15:38, Mike Adamson > > a
>> écrit :
>> 
>>> Hi,
>>> 
>>> Just to keep this thread up to date with development progress, we will be
>>> adding row-aware support to SAI in the next few weeks. This is currently
>>> going through the final stages of review and testing.
>>> 
>>> This feature also adds on-disk versioning to SAI. This allows SAI to
>>> support multiple on-disk formats during upgrades.
>>> 
>>> I am mentioning this now because the CEP mentions “Partition Based
>>> Iteration” as an initial feature. We will change that to “Row Based
>>> Iteration” when the feature is merged.
>>> 
>>> MikeA
>>> 
 On 15 Sep 2021, at 19:42, Caleb Rackliffe >>> >
>>> wrote:
 
 Hey there,
 
 In the spirit of trying to get as many possible objections to a
>>> successful
 vote out of the way, I've added a "Challenges" section to the CEP:
 
 
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-7%3A+Storage+Attached+Index#CEP7:StorageAttachedIndex-Challenges
>>>  
>>> 

Re: [DISCUSS] CEP-7 Storage Attached Index

2022-02-02 Thread Henrik Ingo
So this is an area I've thought about and in fact the overall dynamics are
the same as for MongoDB secondary indexes in a sharded cluster. The TL:DR;
is that the benefits far outweigh the limitations:

* There's a large area of queries where you have the partition key but not
the full Primary Key. SAI (now with row awareness) is an efficient solution
for such queries.
* As a special case of the above would be that you have a partition key (or
keys) but want to sort by something else than the clustering key. However,
note that the current version of SAI doesn't actually support sorting.
* Your cluster has at most 10-20 nodes and the share of queries that lack a
partition key is at most 5% - 10%.
* Even for very large clusters, a low frequency of queries without
partition key is fine.

If all of the above was obvious and the discussion was only about what
Guardrails we may want to set to warn or stop the use, then apologies... I
would suggest the guardrail could be that if share of non-pk queries *on
each node* is above 33% guardrails should warn and if it's above 66% it
should fail the non-pk queries.

I blogged about the math behind scalability of secondary indexes a year
ago:
https://web.archive.org/web/20210814021809/https://www.openlife.cc/blogs/2020/november/scalability-model-cassandra

henrik

On Wed, Feb 2, 2022 at 3:59 PM Joshua McKenzie  wrote:

> To me the outstanding thing worth tackling is the Challenges section Caleb
> added in the CEP. Specifically:
> "The only "easy" way around these two challenges is to focus our efforts
> on queries that are restricted to either partitions or small token ranges.
> These queries behave well locally even on LCS (given levels contain
> token-disjoint SSTables, and assuming a low number of unleveled SSTables),
> avoid fan-out and all of its secondary pitfalls, and allow us to make
> queries at varying CLs with reasonable performance. Attempting to fix the
> local problems around compaction strategy could mean either restricted
> strategy usage or partially abandoning SSTable-attachment. Attempting to
> fix distributed read path problems by pushing the design towards IR systems
> like ES could compromise our ability to use higher read CLs."
>
> This is probably something we could integrate with Guardrails out of the
> gate to discourage suboptimal use right? Or at least allude to in the CEP
> so it's something on our rader.
>
> One of the big downfalls of Materialized Views (aside from the orphaned
> data and inconsistency pains) was the lack of limits on creation of them
> (either number or structure / data amount) with serious un-inspectable
> implications on disk usage and performance. The more we can learn from
> those missteps the better.
>
> On Wed, Feb 2, 2022 at 8:24 AM Mike Adamson  wrote:
>
>> Hi,
>>
>> I’d like to restart this thread.
>>
>> We merged the row-aware branch to the SAI codebase just before Christmas
>> and have subsequently updated the CEP to reflect these changes.
>>
>> I would like to move the discussion forward as to how we move this CEP
>> towards a vote.
>>
>> MikeA
>>
>> On 16 Sep 2021, at 19:49, DuyHai Doan  wrote:
>>
>> Good new Mike that row based indexing will be available, this was a major
>> lacking from SASI at that time !
>>
>> Le jeu. 16 sept. 2021 à 15:38, Mike Adamson  a
>> écrit :
>>
>> Hi,
>>
>> Just to keep this thread up to date with development progress, we will be
>> adding row-aware support to SAI in the next few weeks. This is currently
>> going through the final stages of review and testing.
>>
>> This feature also adds on-disk versioning to SAI. This allows SAI to
>> support multiple on-disk formats during upgrades.
>>
>> I am mentioning this now because the CEP mentions “Partition Based
>> Iteration” as an initial feature. We will change that to “Row Based
>> Iteration” when the feature is merged.
>>
>> MikeA
>>
>> On 15 Sep 2021, at 19:42, Caleb Rackliffe 
>>
>> wrote:
>>
>>
>> Hey there,
>>
>> In the spirit of trying to get as many possible objections to a
>>
>> successful
>>
>> vote out of the way, I've added a "Challenges" section to the CEP:
>>
>>
>>
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-7%3A+Storage+Attached+Index#CEP7:StorageAttachedIndex-Challenges
>> <
>>
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-7%3A+Storage+Attached+Index#CEP7:StorageAttachedIndex-Challenges
>>
>>
>>
>> Most of you will be familiar with these, but I think we need to be as
>> open/candid as possible about the potential risk they pose to SAI's
>>
>> broader
>>
>> usability. I've described them from the point of view that they are not
>> intractable, but if anyone thinks they are, let's hash that disagreement
>> out.
>>
>> Thanks!
>>
>> On Thu, Sep 9, 2021 at 11:13 AM Patrick McFadin >
>> > wrote:
>>
>>
>> +1 on introducing this in an incremental manner and after reading
>>
>> through
>>
>> CASSANDRA-16092 that seems like a perfect place to start. I see that
>>
>> 

Re: [DISCUSS] CEP-7 Storage Attached Index

2022-02-02 Thread Joshua McKenzie
To me the outstanding thing worth tackling is the Challenges section Caleb
added in the CEP. Specifically:
"The only "easy" way around these two challenges is to focus our efforts on
queries that are restricted to either partitions or small token ranges.
These queries behave well locally even on LCS (given levels contain
token-disjoint SSTables, and assuming a low number of unleveled SSTables),
avoid fan-out and all of its secondary pitfalls, and allow us to make
queries at varying CLs with reasonable performance. Attempting to fix the
local problems around compaction strategy could mean either restricted
strategy usage or partially abandoning SSTable-attachment. Attempting to
fix distributed read path problems by pushing the design towards IR systems
like ES could compromise our ability to use higher read CLs."

This is probably something we could integrate with Guardrails out of the
gate to discourage suboptimal use right? Or at least allude to in the CEP
so it's something on our rader.

One of the big downfalls of Materialized Views (aside from the orphaned
data and inconsistency pains) was the lack of limits on creation of them
(either number or structure / data amount) with serious un-inspectable
implications on disk usage and performance. The more we can learn from
those missteps the better.

On Wed, Feb 2, 2022 at 8:24 AM Mike Adamson  wrote:

> Hi,
>
> I’d like to restart this thread.
>
> We merged the row-aware branch to the SAI codebase just before Christmas
> and have subsequently updated the CEP to reflect these changes.
>
> I would like to move the discussion forward as to how we move this CEP
> towards a vote.
>
> MikeA
>
> On 16 Sep 2021, at 19:49, DuyHai Doan  wrote:
>
> Good new Mike that row based indexing will be available, this was a major
> lacking from SASI at that time !
>
> Le jeu. 16 sept. 2021 à 15:38, Mike Adamson  a
> écrit :
>
> Hi,
>
> Just to keep this thread up to date with development progress, we will be
> adding row-aware support to SAI in the next few weeks. This is currently
> going through the final stages of review and testing.
>
> This feature also adds on-disk versioning to SAI. This allows SAI to
> support multiple on-disk formats during upgrades.
>
> I am mentioning this now because the CEP mentions “Partition Based
> Iteration” as an initial feature. We will change that to “Row Based
> Iteration” when the feature is merged.
>
> MikeA
>
> On 15 Sep 2021, at 19:42, Caleb Rackliffe 
>
> wrote:
>
>
> Hey there,
>
> In the spirit of trying to get as many possible objections to a
>
> successful
>
> vote out of the way, I've added a "Challenges" section to the CEP:
>
>
>
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-7%3A+Storage+Attached+Index#CEP7:StorageAttachedIndex-Challenges
> <
>
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-7%3A+Storage+Attached+Index#CEP7:StorageAttachedIndex-Challenges
>
>
>
> Most of you will be familiar with these, but I think we need to be as
> open/candid as possible about the potential risk they pose to SAI's
>
> broader
>
> usability. I've described them from the point of view that they are not
> intractable, but if anyone thinks they are, let's hash that disagreement
> out.
>
> Thanks!
>
> On Thu, Sep 9, 2021 at 11:13 AM Patrick McFadin 
> > wrote:
>
>
> +1 on introducing this in an incremental manner and after reading
>
> through
>
> CASSANDRA-16092 that seems like a perfect place to start. I see that
>
> work
>
> on that Jira has stopped until direction for CEP-7 has been voted in.
>
> I say start the vote and let's get this really valuable developer
>
> feature
>
> underway.
>
> Patrick
>
> On Tue, Sep 7, 2021 at 10:40 AM Caleb Rackliffe <
>
> calebrackli...@gmail.com>
>
> wrote:
>
> So this thread stalled almost a year ago. (Wow, time flies when you're
> trying to release 4.0.) My synthesis of the conversation to this point
>
> is
>
> that while there are some open questions about testing
> methodology/"definition of done" and our choice of particular on-disk
>
> data
>
> structures, neither of these should be a serious obstacle to moving
>
> forward
>
> w/ a vote. Having said that, is there anything left around the CEP that
>
> we
>
> feel should prevent it from moving to a vote?
>
> In terms of how we would proceed from the point a vote passes, it seems
> like there have been enough concerns around the proposed/necessary
>
> breaking
>
> changes to the 2i API, that we will start development by introducing
> components as incrementally as possible into a long-running feature
>
> branch
>
> off trunk. (This work would likely start w/ *CASSANDRA-16092*
> , which we
>
> could
>
> resolve as a sub-task of the SAI epic without interfering with other
>
> trunk
>
> development likely destined for a 4.x minor, etc.)
>
> On Thu, Sep 24, 2020 at 2:47 AM Jasonstack Zhao Yang <
> jasonstack.z...@gmail.com> wrote:
>
> Question is: is 

Re: [DISCUSS] CEP-7 Storage Attached Index

2022-02-02 Thread Mike Adamson
Hi,

I’d like to restart this thread.

We merged the row-aware branch to the SAI codebase just before Christmas and 
have subsequently updated the CEP to reflect these changes.

I would like to move the discussion forward as to how we move this CEP towards 
a vote.

MikeA

> On 16 Sep 2021, at 19:49, DuyHai Doan  wrote:
> 
> Good new Mike that row based indexing will be available, this was a major
> lacking from SASI at that time !
> 
> Le jeu. 16 sept. 2021 à 15:38, Mike Adamson  > a
> écrit :
> 
>> Hi,
>> 
>> Just to keep this thread up to date with development progress, we will be
>> adding row-aware support to SAI in the next few weeks. This is currently
>> going through the final stages of review and testing.
>> 
>> This feature also adds on-disk versioning to SAI. This allows SAI to
>> support multiple on-disk formats during upgrades.
>> 
>> I am mentioning this now because the CEP mentions “Partition Based
>> Iteration” as an initial feature. We will change that to “Row Based
>> Iteration” when the feature is merged.
>> 
>> MikeA
>> 
>>> On 15 Sep 2021, at 19:42, Caleb Rackliffe 
>> wrote:
>>> 
>>> Hey there,
>>> 
>>> In the spirit of trying to get as many possible objections to a
>> successful
>>> vote out of the way, I've added a "Challenges" section to the CEP:
>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-7%3A+Storage+Attached+Index#CEP7:StorageAttachedIndex-Challenges
>> <
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-7%3A+Storage+Attached+Index#CEP7:StorageAttachedIndex-Challenges
>>  
>> 
>>> 
>>> 
>>> Most of you will be familiar with these, but I think we need to be as
>>> open/candid as possible about the potential risk they pose to SAI's
>> broader
>>> usability. I've described them from the point of view that they are not
>>> intractable, but if anyone thinks they are, let's hash that disagreement
>>> out.
>>> 
>>> Thanks!
>>> 
>>> On Thu, Sep 9, 2021 at 11:13 AM Patrick McFadin > >> wrote:
>>> 
 +1 on introducing this in an incremental manner and after reading
>> through
 CASSANDRA-16092 that seems like a perfect place to start. I see that
>> work
 on that Jira has stopped until direction for CEP-7 has been voted in.
 
 I say start the vote and let's get this really valuable developer
>> feature
 underway.
 
 Patrick
 
 On Tue, Sep 7, 2021 at 10:40 AM Caleb Rackliffe <
>> calebrackli...@gmail.com >
 wrote:
 
> So this thread stalled almost a year ago. (Wow, time flies when you're
> trying to release 4.0.) My synthesis of the conversation to this point
>> is
> that while there are some open questions about testing
> methodology/"definition of done" and our choice of particular on-disk
 data
> structures, neither of these should be a serious obstacle to moving
 forward
> w/ a vote. Having said that, is there anything left around the CEP that
 we
> feel should prevent it from moving to a vote?
> 
> In terms of how we would proceed from the point a vote passes, it seems
> like there have been enough concerns around the proposed/necessary
 breaking
> changes to the 2i API, that we will start development by introducing
> components as incrementally as possible into a long-running feature
 branch
> off trunk. (This work would likely start w/ *CASSANDRA-16092*
>  >, which we
>> could
> resolve as a sub-task of the SAI epic without interfering with other
 trunk
> development likely destined for a 4.x minor, etc.)
> 
> On Thu, Sep 24, 2020 at 2:47 AM Jasonstack Zhao Yang <
> jasonstack.z...@gmail.com > wrote:
> 
 Question is: is this planned as a next step?
 If yes, how are we going to mark SAI as experimental until it gets
 row offsets? Also, it is likely that index format is going to change
>> when
 row offsets are added, so my concern is that we may have to support
> two
 versions of a format for a smooth migration.
>> 
>> The goal is to support row-level index when merging SAI, I will update
> the
>> CEP about it.
>> 
 I think switching to row
 offsets also has a huge impact on interaction with SPRC and has some
 potential for optimisations.
>> 
>> Can you share more details on the optimizations?
>> 
>> 
>> 
>> On Thu, 24 Sep 2020 at 15:20, Oleksandr Petrov <
> oleksandr.pet...@gmail.com 
>>> 
>> wrote:
>> 
 But for improving overall index read 

Re: [DISCUSS] CEP-7 Storage Attached Index

2021-09-16 Thread DuyHai Doan
Good new Mike that row based indexing will be available, this was a major
lacking from SASI at that time !

Le jeu. 16 sept. 2021 à 15:38, Mike Adamson  a
écrit :

> Hi,
>
> Just to keep this thread up to date with development progress, we will be
> adding row-aware support to SAI in the next few weeks. This is currently
> going through the final stages of review and testing.
>
> This feature also adds on-disk versioning to SAI. This allows SAI to
> support multiple on-disk formats during upgrades.
>
> I am mentioning this now because the CEP mentions “Partition Based
> Iteration” as an initial feature. We will change that to “Row Based
> Iteration” when the feature is merged.
>
> MikeA
>
> > On 15 Sep 2021, at 19:42, Caleb Rackliffe 
> wrote:
> >
> > Hey there,
> >
> > In the spirit of trying to get as many possible objections to a
> successful
> > vote out of the way, I've added a "Challenges" section to the CEP:
> >
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-7%3A+Storage+Attached+Index#CEP7:StorageAttachedIndex-Challenges
> <
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-7%3A+Storage+Attached+Index#CEP7:StorageAttachedIndex-Challenges
> >
> >
> > Most of you will be familiar with these, but I think we need to be as
> > open/candid as possible about the potential risk they pose to SAI's
> broader
> > usability. I've described them from the point of view that they are not
> > intractable, but if anyone thinks they are, let's hash that disagreement
> > out.
> >
> > Thanks!
> >
> > On Thu, Sep 9, 2021 at 11:13 AM Patrick McFadin  > wrote:
> >
> >> +1 on introducing this in an incremental manner and after reading
> through
> >> CASSANDRA-16092 that seems like a perfect place to start. I see that
> work
> >> on that Jira has stopped until direction for CEP-7 has been voted in.
> >>
> >> I say start the vote and let's get this really valuable developer
> feature
> >> underway.
> >>
> >> Patrick
> >>
> >> On Tue, Sep 7, 2021 at 10:40 AM Caleb Rackliffe <
> calebrackli...@gmail.com>
> >> wrote:
> >>
> >>> So this thread stalled almost a year ago. (Wow, time flies when you're
> >>> trying to release 4.0.) My synthesis of the conversation to this point
> is
> >>> that while there are some open questions about testing
> >>> methodology/"definition of done" and our choice of particular on-disk
> >> data
> >>> structures, neither of these should be a serious obstacle to moving
> >> forward
> >>> w/ a vote. Having said that, is there anything left around the CEP that
> >> we
> >>> feel should prevent it from moving to a vote?
> >>>
> >>> In terms of how we would proceed from the point a vote passes, it seems
> >>> like there have been enough concerns around the proposed/necessary
> >> breaking
> >>> changes to the 2i API, that we will start development by introducing
> >>> components as incrementally as possible into a long-running feature
> >> branch
> >>> off trunk. (This work would likely start w/ *CASSANDRA-16092*
> >>> , which we
> could
> >>> resolve as a sub-task of the SAI epic without interfering with other
> >> trunk
> >>> development likely destined for a 4.x minor, etc.)
> >>>
> >>> On Thu, Sep 24, 2020 at 2:47 AM Jasonstack Zhao Yang <
> >>> jasonstack.z...@gmail.com> wrote:
> >>>
> >> Question is: is this planned as a next step?
> >> If yes, how are we going to mark SAI as experimental until it gets
> >> row offsets? Also, it is likely that index format is going to change
>  when
> >> row offsets are added, so my concern is that we may have to support
> >>> two
> >> versions of a format for a smooth migration.
> 
>  The goal is to support row-level index when merging SAI, I will update
> >>> the
>  CEP about it.
> 
> >> I think switching to row
> >> offsets also has a huge impact on interaction with SPRC and has some
> >> potential for optimisations.
> 
>  Can you share more details on the optimizations?
> 
> 
> 
>  On Thu, 24 Sep 2020 at 15:20, Oleksandr Petrov <
> >>> oleksandr.pet...@gmail.com
> >
>  wrote:
> 
> >> But for improving overall index read performance, I think improving
>  base
> > table read perf  (because SAI/SASI executes LOTS of
> > SinglePartitionReadCommand after searching on-disk index) is more
>  effective
> > than switching from Trie to Prefix BTree.
> >
> > I haven't suggested switching to Prefix B-Tree or any other
> >> structure,
>  the
> > question was about rationale and motivation of picking one over the
>  other,
> > which I am curious about for personal reasons/interests that lie
> >>> outside
>  of
> > Cassandra. Having this listed in CEP could have been helpful for
> >> future
> > guidance. It's ok if this question is outside of the CEP scope.
> >
> > I also agree that there are many areas that require 

Re: [DISCUSS] CEP-7 Storage Attached Index

2021-09-16 Thread Mike Adamson
Hi,

Just to keep this thread up to date with development progress, we will be 
adding row-aware support to SAI in the next few weeks. This is currently going 
through the final stages of review and testing. 

This feature also adds on-disk versioning to SAI. This allows SAI to support 
multiple on-disk formats during upgrades. 

I am mentioning this now because the CEP mentions “Partition Based Iteration” 
as an initial feature. We will change that to “Row Based Iteration” when the 
feature is merged.

MikeA

> On 15 Sep 2021, at 19:42, Caleb Rackliffe  wrote:
> 
> Hey there,
> 
> In the spirit of trying to get as many possible objections to a successful
> vote out of the way, I've added a "Challenges" section to the CEP:
> 
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-7%3A+Storage+Attached+Index#CEP7:StorageAttachedIndex-Challenges
>  
> 
> 
> Most of you will be familiar with these, but I think we need to be as
> open/candid as possible about the potential risk they pose to SAI's broader
> usability. I've described them from the point of view that they are not
> intractable, but if anyone thinks they are, let's hash that disagreement
> out.
> 
> Thanks!
> 
> On Thu, Sep 9, 2021 at 11:13 AM Patrick McFadin  > wrote:
> 
>> +1 on introducing this in an incremental manner and after reading through
>> CASSANDRA-16092 that seems like a perfect place to start. I see that work
>> on that Jira has stopped until direction for CEP-7 has been voted in.
>> 
>> I say start the vote and let's get this really valuable developer feature
>> underway.
>> 
>> Patrick
>> 
>> On Tue, Sep 7, 2021 at 10:40 AM Caleb Rackliffe 
>> wrote:
>> 
>>> So this thread stalled almost a year ago. (Wow, time flies when you're
>>> trying to release 4.0.) My synthesis of the conversation to this point is
>>> that while there are some open questions about testing
>>> methodology/"definition of done" and our choice of particular on-disk
>> data
>>> structures, neither of these should be a serious obstacle to moving
>> forward
>>> w/ a vote. Having said that, is there anything left around the CEP that
>> we
>>> feel should prevent it from moving to a vote?
>>> 
>>> In terms of how we would proceed from the point a vote passes, it seems
>>> like there have been enough concerns around the proposed/necessary
>> breaking
>>> changes to the 2i API, that we will start development by introducing
>>> components as incrementally as possible into a long-running feature
>> branch
>>> off trunk. (This work would likely start w/ *CASSANDRA-16092*
>>> , which we could
>>> resolve as a sub-task of the SAI epic without interfering with other
>> trunk
>>> development likely destined for a 4.x minor, etc.)
>>> 
>>> On Thu, Sep 24, 2020 at 2:47 AM Jasonstack Zhao Yang <
>>> jasonstack.z...@gmail.com> wrote:
>>> 
>> Question is: is this planned as a next step?
>> If yes, how are we going to mark SAI as experimental until it gets
>> row offsets? Also, it is likely that index format is going to change
 when
>> row offsets are added, so my concern is that we may have to support
>>> two
>> versions of a format for a smooth migration.
 
 The goal is to support row-level index when merging SAI, I will update
>>> the
 CEP about it.
 
>> I think switching to row
>> offsets also has a huge impact on interaction with SPRC and has some
>> potential for optimisations.
 
 Can you share more details on the optimizations?
 
 
 
 On Thu, 24 Sep 2020 at 15:20, Oleksandr Petrov <
>>> oleksandr.pet...@gmail.com
> 
 wrote:
 
>> But for improving overall index read performance, I think improving
 base
> table read perf  (because SAI/SASI executes LOTS of
> SinglePartitionReadCommand after searching on-disk index) is more
 effective
> than switching from Trie to Prefix BTree.
> 
> I haven't suggested switching to Prefix B-Tree or any other
>> structure,
 the
> question was about rationale and motivation of picking one over the
 other,
> which I am curious about for personal reasons/interests that lie
>>> outside
 of
> Cassandra. Having this listed in CEP could have been helpful for
>> future
> guidance. It's ok if this question is outside of the CEP scope.
> 
> I also agree that there are many areas that require improvement
>> around
 the
> read/write path and 2i, many of which (even outside of base table
>>> format
 or
> read perf) can yield positive performance results.
> 
>> FWIW, I personally look forward to receiving that contribution when
>>> the
> time is right.
> 
> I am very excited for this contribution, too, and it looks like very
 solid
> work.
> 
> I 

Re: [DISCUSS] CEP-7 Storage Attached Index

2021-09-16 Thread Henrik Ingo
Thanks Caleb.

Those observations are valid factual statements, and it's good to be clear
where limitations are. I'd like to add that the usefulness of
fan-out/broadcast secondary index queries depends on cluster size. I have
noticed that everything in Cassandra tends to be designed for extremely
large scale, with hundred or more nodes in mind. In the case of SAI or
other indexes it is however the case that they can be more useful in
smaller clusters, where the read amplification from fan-out is moderate.
Cassandra userbase is probably different, but my experience from the
Mongodb world was that less than 10% of users even need to use sharding.
Hence 90% of apps can get the full benefit of a good secondary index
implementation, including using queries without partition key. Similarly,
at the other end someone with a large cluster could benefit from the
ability to execute some infrequent query that needs to be broadcast across
the cluster, as long as this query is insignificant in the total workload
of the cluster. (Like once per day or once per hour.)

henrik

On Wed, Sep 15, 2021 at 9:42 PM Caleb Rackliffe 
wrote:

> Hey there,
>
> In the spirit of trying to get as many possible objections to a successful
> vote out of the way, I've added a "Challenges" section to the CEP:
>
>
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-7%3A+Storage+Attached+Index#CEP7:StorageAttachedIndex-Challenges
>
> Most of you will be familiar with these, but I think we need to be as
> open/candid as possible about the potential risk they pose to SAI's broader
> usability. I've described them from the point of view that they are not
> intractable, but if anyone thinks they are, let's hash that disagreement
> out.
>
> Thanks!
>
> On Thu, Sep 9, 2021 at 11:13 AM Patrick McFadin 
> wrote:
>
> > +1 on introducing this in an incremental manner and after reading through
> > CASSANDRA-16092 that seems like a perfect place to start. I see that work
> > on that Jira has stopped until direction for CEP-7 has been voted in.
> >
> > I say start the vote and let's get this really valuable developer feature
> > underway.
> >
> > Patrick
> >
> > On Tue, Sep 7, 2021 at 10:40 AM Caleb Rackliffe <
> calebrackli...@gmail.com>
> > wrote:
> >
> > > So this thread stalled almost a year ago. (Wow, time flies when you're
> > > trying to release 4.0.) My synthesis of the conversation to this point
> is
> > > that while there are some open questions about testing
> > > methodology/"definition of done" and our choice of particular on-disk
> > data
> > > structures, neither of these should be a serious obstacle to moving
> > forward
> > > w/ a vote. Having said that, is there anything left around the CEP that
> > we
> > > feel should prevent it from moving to a vote?
> > >
> > > In terms of how we would proceed from the point a vote passes, it seems
> > > like there have been enough concerns around the proposed/necessary
> > breaking
> > > changes to the 2i API, that we will start development by introducing
> > > components as incrementally as possible into a long-running feature
> > branch
> > > off trunk. (This work would likely start w/ *CASSANDRA-16092*
> > > , which we
> could
> > > resolve as a sub-task of the SAI epic without interfering with other
> > trunk
> > > development likely destined for a 4.x minor, etc.)
> > >
> > > On Thu, Sep 24, 2020 at 2:47 AM Jasonstack Zhao Yang <
> > > jasonstack.z...@gmail.com> wrote:
> > >
> > > > >> Question is: is this planned as a next step?
> > > > >> If yes, how are we going to mark SAI as experimental until it gets
> > > > >> row offsets? Also, it is likely that index format is going to
> change
> > > > when
> > > > >> row offsets are added, so my concern is that we may have to
> support
> > > two
> > > > >> versions of a format for a smooth migration.
> > > >
> > > > The goal is to support row-level index when merging SAI, I will
> update
> > > the
> > > > CEP about it.
> > > >
> > > > >> I think switching to row
> > > > >> offsets also has a huge impact on interaction with SPRC and has
> some
> > > > >> potential for optimisations.
> > > >
> > > > Can you share more details on the optimizations?
> > > >
> > > >
> > > >
> > > > On Thu, 24 Sep 2020 at 15:20, Oleksandr Petrov <
> > > oleksandr.pet...@gmail.com
> > > > >
> > > > wrote:
> > > >
> > > > > > But for improving overall index read performance, I think
> improving
> > > > base
> > > > > table read perf  (because SAI/SASI executes LOTS of
> > > > > SinglePartitionReadCommand after searching on-disk index) is more
> > > > effective
> > > > > than switching from Trie to Prefix BTree.
> > > > >
> > > > > I haven't suggested switching to Prefix B-Tree or any other
> > structure,
> > > > the
> > > > > question was about rationale and motivation of picking one over the
> > > > other,
> > > > > which I am curious about for personal reasons/interests that lie
> > > outside
> > > > of
> > > > 

Re: [DISCUSS] CEP-7 Storage Attached Index

2021-09-15 Thread Caleb Rackliffe
Hey there,

In the spirit of trying to get as many possible objections to a successful
vote out of the way, I've added a "Challenges" section to the CEP:

https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-7%3A+Storage+Attached+Index#CEP7:StorageAttachedIndex-Challenges

Most of you will be familiar with these, but I think we need to be as
open/candid as possible about the potential risk they pose to SAI's broader
usability. I've described them from the point of view that they are not
intractable, but if anyone thinks they are, let's hash that disagreement
out.

Thanks!

On Thu, Sep 9, 2021 at 11:13 AM Patrick McFadin  wrote:

> +1 on introducing this in an incremental manner and after reading through
> CASSANDRA-16092 that seems like a perfect place to start. I see that work
> on that Jira has stopped until direction for CEP-7 has been voted in.
>
> I say start the vote and let's get this really valuable developer feature
> underway.
>
> Patrick
>
> On Tue, Sep 7, 2021 at 10:40 AM Caleb Rackliffe 
> wrote:
>
> > So this thread stalled almost a year ago. (Wow, time flies when you're
> > trying to release 4.0.) My synthesis of the conversation to this point is
> > that while there are some open questions about testing
> > methodology/"definition of done" and our choice of particular on-disk
> data
> > structures, neither of these should be a serious obstacle to moving
> forward
> > w/ a vote. Having said that, is there anything left around the CEP that
> we
> > feel should prevent it from moving to a vote?
> >
> > In terms of how we would proceed from the point a vote passes, it seems
> > like there have been enough concerns around the proposed/necessary
> breaking
> > changes to the 2i API, that we will start development by introducing
> > components as incrementally as possible into a long-running feature
> branch
> > off trunk. (This work would likely start w/ *CASSANDRA-16092*
> > , which we could
> > resolve as a sub-task of the SAI epic without interfering with other
> trunk
> > development likely destined for a 4.x minor, etc.)
> >
> > On Thu, Sep 24, 2020 at 2:47 AM Jasonstack Zhao Yang <
> > jasonstack.z...@gmail.com> wrote:
> >
> > > >> Question is: is this planned as a next step?
> > > >> If yes, how are we going to mark SAI as experimental until it gets
> > > >> row offsets? Also, it is likely that index format is going to change
> > > when
> > > >> row offsets are added, so my concern is that we may have to support
> > two
> > > >> versions of a format for a smooth migration.
> > >
> > > The goal is to support row-level index when merging SAI, I will update
> > the
> > > CEP about it.
> > >
> > > >> I think switching to row
> > > >> offsets also has a huge impact on interaction with SPRC and has some
> > > >> potential for optimisations.
> > >
> > > Can you share more details on the optimizations?
> > >
> > >
> > >
> > > On Thu, 24 Sep 2020 at 15:20, Oleksandr Petrov <
> > oleksandr.pet...@gmail.com
> > > >
> > > wrote:
> > >
> > > > > But for improving overall index read performance, I think improving
> > > base
> > > > table read perf  (because SAI/SASI executes LOTS of
> > > > SinglePartitionReadCommand after searching on-disk index) is more
> > > effective
> > > > than switching from Trie to Prefix BTree.
> > > >
> > > > I haven't suggested switching to Prefix B-Tree or any other
> structure,
> > > the
> > > > question was about rationale and motivation of picking one over the
> > > other,
> > > > which I am curious about for personal reasons/interests that lie
> > outside
> > > of
> > > > Cassandra. Having this listed in CEP could have been helpful for
> future
> > > > guidance. It's ok if this question is outside of the CEP scope.
> > > >
> > > > I also agree that there are many areas that require improvement
> around
> > > the
> > > > read/write path and 2i, many of which (even outside of base table
> > format
> > > or
> > > > read perf) can yield positive performance results.
> > > >
> > > > > FWIW, I personally look forward to receiving that contribution when
> > the
> > > > time is right.
> > > >
> > > > I am very excited for this contribution, too, and it looks like very
> > > solid
> > > > work.
> > > >
> > > > I have one more question, about "Upon resolving partition keys, rows
> > are
> > > > loaded using Cassandra’s internal partition read command across
> > SSTables
> > > > and are post filtered". One of the criticisms of SASI and reasons for
> > > > marking it as experimental was CASSANDRA-11990. I think switching to
> > row
> > > > offsets also has a huge impact on interaction with SPRC and has some
> > > > potential for optimisations. Question is: is this planned as a next
> > step?
> > > > If yes, how are we going to mark SAI as experimental until it gets
> > > > row offsets? Also, it is likely that index format is going to change
> > when
> > > > row offsets are added, so my concern is that we may have to 

Re: [DISCUSS] CEP-7 Storage Attached Index

2021-09-09 Thread Patrick McFadin
+1 on introducing this in an incremental manner and after reading through
CASSANDRA-16092 that seems like a perfect place to start. I see that work
on that Jira has stopped until direction for CEP-7 has been voted in.

I say start the vote and let's get this really valuable developer feature
underway.

Patrick

On Tue, Sep 7, 2021 at 10:40 AM Caleb Rackliffe 
wrote:

> So this thread stalled almost a year ago. (Wow, time flies when you're
> trying to release 4.0.) My synthesis of the conversation to this point is
> that while there are some open questions about testing
> methodology/"definition of done" and our choice of particular on-disk data
> structures, neither of these should be a serious obstacle to moving forward
> w/ a vote. Having said that, is there anything left around the CEP that we
> feel should prevent it from moving to a vote?
>
> In terms of how we would proceed from the point a vote passes, it seems
> like there have been enough concerns around the proposed/necessary breaking
> changes to the 2i API, that we will start development by introducing
> components as incrementally as possible into a long-running feature branch
> off trunk. (This work would likely start w/ *CASSANDRA-16092*
> , which we could
> resolve as a sub-task of the SAI epic without interfering with other trunk
> development likely destined for a 4.x minor, etc.)
>
> On Thu, Sep 24, 2020 at 2:47 AM Jasonstack Zhao Yang <
> jasonstack.z...@gmail.com> wrote:
>
> > >> Question is: is this planned as a next step?
> > >> If yes, how are we going to mark SAI as experimental until it gets
> > >> row offsets? Also, it is likely that index format is going to change
> > when
> > >> row offsets are added, so my concern is that we may have to support
> two
> > >> versions of a format for a smooth migration.
> >
> > The goal is to support row-level index when merging SAI, I will update
> the
> > CEP about it.
> >
> > >> I think switching to row
> > >> offsets also has a huge impact on interaction with SPRC and has some
> > >> potential for optimisations.
> >
> > Can you share more details on the optimizations?
> >
> >
> >
> > On Thu, 24 Sep 2020 at 15:20, Oleksandr Petrov <
> oleksandr.pet...@gmail.com
> > >
> > wrote:
> >
> > > > But for improving overall index read performance, I think improving
> > base
> > > table read perf  (because SAI/SASI executes LOTS of
> > > SinglePartitionReadCommand after searching on-disk index) is more
> > effective
> > > than switching from Trie to Prefix BTree.
> > >
> > > I haven't suggested switching to Prefix B-Tree or any other structure,
> > the
> > > question was about rationale and motivation of picking one over the
> > other,
> > > which I am curious about for personal reasons/interests that lie
> outside
> > of
> > > Cassandra. Having this listed in CEP could have been helpful for future
> > > guidance. It's ok if this question is outside of the CEP scope.
> > >
> > > I also agree that there are many areas that require improvement around
> > the
> > > read/write path and 2i, many of which (even outside of base table
> format
> > or
> > > read perf) can yield positive performance results.
> > >
> > > > FWIW, I personally look forward to receiving that contribution when
> the
> > > time is right.
> > >
> > > I am very excited for this contribution, too, and it looks like very
> > solid
> > > work.
> > >
> > > I have one more question, about "Upon resolving partition keys, rows
> are
> > > loaded using Cassandra’s internal partition read command across
> SSTables
> > > and are post filtered". One of the criticisms of SASI and reasons for
> > > marking it as experimental was CASSANDRA-11990. I think switching to
> row
> > > offsets also has a huge impact on interaction with SPRC and has some
> > > potential for optimisations. Question is: is this planned as a next
> step?
> > > If yes, how are we going to mark SAI as experimental until it gets
> > > row offsets? Also, it is likely that index format is going to change
> when
> > > row offsets are added, so my concern is that we may have to support two
> > > versions of a format for a smooth migration.
> > >
> > >
> > >
> > > On Thu, Sep 24, 2020 at 6:53 AM Jasonstack Zhao Yang <
> > > jasonstack.z...@gmail.com> wrote:
> > >
> > > > >> I think CEP should be more upfront with "eventually replace
> > > > >>  it" bit, since it raises the question about what the people who
> are
> > > > using
> > > > >> other index implementations can expect.
> > > >
> > > > Will update the CEP to emphasize: SAI will replace other indexes.
> > > >
> > > > >> Unfortunately, I do not have an
> > > > >> implementation sitting around for a direct comparison, but I can
> > > imagine
> > > > >> situations when B-Trees may perform better because of simpler
> > > > construction.
> > > > >> Maybe we should even consider prototyping a prefix B-Tree to have
> a
> > > more
> > > > >> fair comparison.
> > > >
> > > > As long 

Re: [DISCUSS] CEP-7 Storage Attached Index

2021-09-07 Thread Caleb Rackliffe
So this thread stalled almost a year ago. (Wow, time flies when you're
trying to release 4.0.) My synthesis of the conversation to this point is
that while there are some open questions about testing
methodology/"definition of done" and our choice of particular on-disk data
structures, neither of these should be a serious obstacle to moving forward
w/ a vote. Having said that, is there anything left around the CEP that we
feel should prevent it from moving to a vote?

In terms of how we would proceed from the point a vote passes, it seems
like there have been enough concerns around the proposed/necessary breaking
changes to the 2i API, that we will start development by introducing
components as incrementally as possible into a long-running feature branch
off trunk. (This work would likely start w/ *CASSANDRA-16092*
, which we could
resolve as a sub-task of the SAI epic without interfering with other trunk
development likely destined for a 4.x minor, etc.)

On Thu, Sep 24, 2020 at 2:47 AM Jasonstack Zhao Yang <
jasonstack.z...@gmail.com> wrote:

> >> Question is: is this planned as a next step?
> >> If yes, how are we going to mark SAI as experimental until it gets
> >> row offsets? Also, it is likely that index format is going to change
> when
> >> row offsets are added, so my concern is that we may have to support two
> >> versions of a format for a smooth migration.
>
> The goal is to support row-level index when merging SAI, I will update the
> CEP about it.
>
> >> I think switching to row
> >> offsets also has a huge impact on interaction with SPRC and has some
> >> potential for optimisations.
>
> Can you share more details on the optimizations?
>
>
>
> On Thu, 24 Sep 2020 at 15:20, Oleksandr Petrov  >
> wrote:
>
> > > But for improving overall index read performance, I think improving
> base
> > table read perf  (because SAI/SASI executes LOTS of
> > SinglePartitionReadCommand after searching on-disk index) is more
> effective
> > than switching from Trie to Prefix BTree.
> >
> > I haven't suggested switching to Prefix B-Tree or any other structure,
> the
> > question was about rationale and motivation of picking one over the
> other,
> > which I am curious about for personal reasons/interests that lie outside
> of
> > Cassandra. Having this listed in CEP could have been helpful for future
> > guidance. It's ok if this question is outside of the CEP scope.
> >
> > I also agree that there are many areas that require improvement around
> the
> > read/write path and 2i, many of which (even outside of base table format
> or
> > read perf) can yield positive performance results.
> >
> > > FWIW, I personally look forward to receiving that contribution when the
> > time is right.
> >
> > I am very excited for this contribution, too, and it looks like very
> solid
> > work.
> >
> > I have one more question, about "Upon resolving partition keys, rows are
> > loaded using Cassandra’s internal partition read command across SSTables
> > and are post filtered". One of the criticisms of SASI and reasons for
> > marking it as experimental was CASSANDRA-11990. I think switching to row
> > offsets also has a huge impact on interaction with SPRC and has some
> > potential for optimisations. Question is: is this planned as a next step?
> > If yes, how are we going to mark SAI as experimental until it gets
> > row offsets? Also, it is likely that index format is going to change when
> > row offsets are added, so my concern is that we may have to support two
> > versions of a format for a smooth migration.
> >
> >
> >
> > On Thu, Sep 24, 2020 at 6:53 AM Jasonstack Zhao Yang <
> > jasonstack.z...@gmail.com> wrote:
> >
> > > >> I think CEP should be more upfront with "eventually replace
> > > >>  it" bit, since it raises the question about what the people who are
> > > using
> > > >> other index implementations can expect.
> > >
> > > Will update the CEP to emphasize: SAI will replace other indexes.
> > >
> > > >> Unfortunately, I do not have an
> > > >> implementation sitting around for a direct comparison, but I can
> > imagine
> > > >> situations when B-Trees may perform better because of simpler
> > > construction.
> > > >> Maybe we should even consider prototyping a prefix B-Tree to have a
> > more
> > > >> fair comparison.
> > >
> > > As long as prefix BTree supports range/prefix aggregation (which is
> used
> > to
> > > speed up
> > > range/prefix query when matching entire subtree), we can plug it in and
> > > compare. It won't
> > > affect the CEP design which focuses on sharing data across indexes and
> > > posting aggregation.
> > >
> > > But for improving overall index read performance, I think improving
> base
> > > table read perf
> > >  (because SAI/SASI executes LOTS of SinglePartitionReadCommand after
> > > searching on-disk index)
> > > is more effective than switching from Trie to Prefix BTree.
> > >
> > >
> > >
> > > On Thu, 24 Sep 2020 at 05:33, 

Re: [DISCUSS] CEP-7 Storage Attached Index

2020-09-24 Thread Jasonstack Zhao Yang
>> Question is: is this planned as a next step?
>> If yes, how are we going to mark SAI as experimental until it gets
>> row offsets? Also, it is likely that index format is going to change when
>> row offsets are added, so my concern is that we may have to support two
>> versions of a format for a smooth migration.

The goal is to support row-level index when merging SAI, I will update the
CEP about it.

>> I think switching to row
>> offsets also has a huge impact on interaction with SPRC and has some
>> potential for optimisations.

Can you share more details on the optimizations?



On Thu, 24 Sep 2020 at 15:20, Oleksandr Petrov 
wrote:

> > But for improving overall index read performance, I think improving base
> table read perf  (because SAI/SASI executes LOTS of
> SinglePartitionReadCommand after searching on-disk index) is more effective
> than switching from Trie to Prefix BTree.
>
> I haven't suggested switching to Prefix B-Tree or any other structure, the
> question was about rationale and motivation of picking one over the other,
> which I am curious about for personal reasons/interests that lie outside of
> Cassandra. Having this listed in CEP could have been helpful for future
> guidance. It's ok if this question is outside of the CEP scope.
>
> I also agree that there are many areas that require improvement around the
> read/write path and 2i, many of which (even outside of base table format or
> read perf) can yield positive performance results.
>
> > FWIW, I personally look forward to receiving that contribution when the
> time is right.
>
> I am very excited for this contribution, too, and it looks like very solid
> work.
>
> I have one more question, about "Upon resolving partition keys, rows are
> loaded using Cassandra’s internal partition read command across SSTables
> and are post filtered". One of the criticisms of SASI and reasons for
> marking it as experimental was CASSANDRA-11990. I think switching to row
> offsets also has a huge impact on interaction with SPRC and has some
> potential for optimisations. Question is: is this planned as a next step?
> If yes, how are we going to mark SAI as experimental until it gets
> row offsets? Also, it is likely that index format is going to change when
> row offsets are added, so my concern is that we may have to support two
> versions of a format for a smooth migration.
>
>
>
> On Thu, Sep 24, 2020 at 6:53 AM Jasonstack Zhao Yang <
> jasonstack.z...@gmail.com> wrote:
>
> > >> I think CEP should be more upfront with "eventually replace
> > >>  it" bit, since it raises the question about what the people who are
> > using
> > >> other index implementations can expect.
> >
> > Will update the CEP to emphasize: SAI will replace other indexes.
> >
> > >> Unfortunately, I do not have an
> > >> implementation sitting around for a direct comparison, but I can
> imagine
> > >> situations when B-Trees may perform better because of simpler
> > construction.
> > >> Maybe we should even consider prototyping a prefix B-Tree to have a
> more
> > >> fair comparison.
> >
> > As long as prefix BTree supports range/prefix aggregation (which is used
> to
> > speed up
> > range/prefix query when matching entire subtree), we can plug it in and
> > compare. It won't
> > affect the CEP design which focuses on sharing data across indexes and
> > posting aggregation.
> >
> > But for improving overall index read performance, I think improving base
> > table read perf
> >  (because SAI/SASI executes LOTS of SinglePartitionReadCommand after
> > searching on-disk index)
> > is more effective than switching from Trie to Prefix BTree.
> >
> >
> >
> > On Thu, 24 Sep 2020 at 05:33, Benedict Elliott Smith <
> bened...@apache.org>
> > wrote:
> >
> > > FWIW, I personally look forward to receiving that contribution when the
> > > time is right.
> > >
> > > On 23/09/2020, 18:45, "Josh McKenzie"  wrote:
> > >
> > > talking about that would involve some bits of information DataStax
> > > might
> > > not be ready to share?
> > >
> > > At the risk of derailing, I've been poking and prodding this week
> at
> > we
> > > contributors at DS getting our act together w/a draft CEP for
> > donating
> > > the
> > > trie-based indices to the ASF project.
> > >
> > > More to come; the intention is certainly to contribute that code.
> The
> > > lack
> > > of a destination to merge it into (i.e. no 5.0-dev branch) is
> > removing
> > > significant urgency from the process as well (not to open a 3rd
> > > Pandora's
> > > box), but there's certainly an interrelatedness to the
> conversations
> > > going
> > > on.
> > >
> > > ---
> > > Josh McKenzie
> > >
> > >
> > > Sent via Superhuman 
> > >
> > >
> > > On Wed, Sep 23, 2020 at 12:48 PM, Caleb Rackliffe <
> > > calebrackli...@gmail.com>
> > > wrote:
> > >
> > > > As long as we can construct the on-disk indexes
> > efficiently/directly
> 

Re: [DISCUSS] CEP-7 Storage Attached Index

2020-09-24 Thread Oleksandr Petrov
> But for improving overall index read performance, I think improving base
table read perf  (because SAI/SASI executes LOTS of
SinglePartitionReadCommand after searching on-disk index) is more effective
than switching from Trie to Prefix BTree.

I haven't suggested switching to Prefix B-Tree or any other structure, the
question was about rationale and motivation of picking one over the other,
which I am curious about for personal reasons/interests that lie outside of
Cassandra. Having this listed in CEP could have been helpful for future
guidance. It's ok if this question is outside of the CEP scope.

I also agree that there are many areas that require improvement around the
read/write path and 2i, many of which (even outside of base table format or
read perf) can yield positive performance results.

> FWIW, I personally look forward to receiving that contribution when the
time is right.

I am very excited for this contribution, too, and it looks like very solid
work.

I have one more question, about "Upon resolving partition keys, rows are
loaded using Cassandra’s internal partition read command across SSTables
and are post filtered". One of the criticisms of SASI and reasons for
marking it as experimental was CASSANDRA-11990. I think switching to row
offsets also has a huge impact on interaction with SPRC and has some
potential for optimisations. Question is: is this planned as a next step?
If yes, how are we going to mark SAI as experimental until it gets
row offsets? Also, it is likely that index format is going to change when
row offsets are added, so my concern is that we may have to support two
versions of a format for a smooth migration.



On Thu, Sep 24, 2020 at 6:53 AM Jasonstack Zhao Yang <
jasonstack.z...@gmail.com> wrote:

> >> I think CEP should be more upfront with "eventually replace
> >>  it" bit, since it raises the question about what the people who are
> using
> >> other index implementations can expect.
>
> Will update the CEP to emphasize: SAI will replace other indexes.
>
> >> Unfortunately, I do not have an
> >> implementation sitting around for a direct comparison, but I can imagine
> >> situations when B-Trees may perform better because of simpler
> construction.
> >> Maybe we should even consider prototyping a prefix B-Tree to have a more
> >> fair comparison.
>
> As long as prefix BTree supports range/prefix aggregation (which is used to
> speed up
> range/prefix query when matching entire subtree), we can plug it in and
> compare. It won't
> affect the CEP design which focuses on sharing data across indexes and
> posting aggregation.
>
> But for improving overall index read performance, I think improving base
> table read perf
>  (because SAI/SASI executes LOTS of SinglePartitionReadCommand after
> searching on-disk index)
> is more effective than switching from Trie to Prefix BTree.
>
>
>
> On Thu, 24 Sep 2020 at 05:33, Benedict Elliott Smith 
> wrote:
>
> > FWIW, I personally look forward to receiving that contribution when the
> > time is right.
> >
> > On 23/09/2020, 18:45, "Josh McKenzie"  wrote:
> >
> > talking about that would involve some bits of information DataStax
> > might
> > not be ready to share?
> >
> > At the risk of derailing, I've been poking and prodding this week at
> we
> > contributors at DS getting our act together w/a draft CEP for
> donating
> > the
> > trie-based indices to the ASF project.
> >
> > More to come; the intention is certainly to contribute that code. The
> > lack
> > of a destination to merge it into (i.e. no 5.0-dev branch) is
> removing
> > significant urgency from the process as well (not to open a 3rd
> > Pandora's
> > box), but there's certainly an interrelatedness to the conversations
> > going
> > on.
> >
> > ---
> > Josh McKenzie
> >
> >
> > Sent via Superhuman 
> >
> >
> > On Wed, Sep 23, 2020 at 12:48 PM, Caleb Rackliffe <
> > calebrackli...@gmail.com>
> > wrote:
> >
> > > As long as we can construct the on-disk indexes
> efficiently/directly
> > from
> > > a Memtable-attached index on flush, there's room to try other data
> > > structures. Most of the innovation in SAI is around the layout of
> > postings
> > > (something we can expand on if people are interested) and having a
> > > natively row-oriented design that scales w/ multiple indexed
> columns
> > on
> > > single SSTables. There are some broader implications of using the
> > trie that
> > > reach outside SAI itself, but talking about that would involve some
> > bits of
> > > information DataStax might not be ready to share?
> > >
> > > On Wed, Sep 23, 2020 at 11:00 AM Jeremiah D Jordan <
> jeremiah.jordan@
> > > gmail.com> wrote:
> > >
> > > Short question: looking forward, how are we going to maintain three
> > 2i
> > > implementations: SASI, SAI, and 2i?
> > >
> > > I think one of the goals stated in 

Re: [DISCUSS] CEP-7 Storage Attached Index

2020-09-23 Thread Jasonstack Zhao Yang
>> I think CEP should be more upfront with "eventually replace
>>  it" bit, since it raises the question about what the people who are
using
>> other index implementations can expect.

Will update the CEP to emphasize: SAI will replace other indexes.

>> Unfortunately, I do not have an
>> implementation sitting around for a direct comparison, but I can imagine
>> situations when B-Trees may perform better because of simpler
construction.
>> Maybe we should even consider prototyping a prefix B-Tree to have a more
>> fair comparison.

As long as prefix BTree supports range/prefix aggregation (which is used to
speed up
range/prefix query when matching entire subtree), we can plug it in and
compare. It won't
affect the CEP design which focuses on sharing data across indexes and
posting aggregation.

But for improving overall index read performance, I think improving base
table read perf
 (because SAI/SASI executes LOTS of SinglePartitionReadCommand after
searching on-disk index)
is more effective than switching from Trie to Prefix BTree.



On Thu, 24 Sep 2020 at 05:33, Benedict Elliott Smith 
wrote:

> FWIW, I personally look forward to receiving that contribution when the
> time is right.
>
> On 23/09/2020, 18:45, "Josh McKenzie"  wrote:
>
> talking about that would involve some bits of information DataStax
> might
> not be ready to share?
>
> At the risk of derailing, I've been poking and prodding this week at we
> contributors at DS getting our act together w/a draft CEP for donating
> the
> trie-based indices to the ASF project.
>
> More to come; the intention is certainly to contribute that code. The
> lack
> of a destination to merge it into (i.e. no 5.0-dev branch) is removing
> significant urgency from the process as well (not to open a 3rd
> Pandora's
> box), but there's certainly an interrelatedness to the conversations
> going
> on.
>
> ---
> Josh McKenzie
>
>
> Sent via Superhuman 
>
>
> On Wed, Sep 23, 2020 at 12:48 PM, Caleb Rackliffe <
> calebrackli...@gmail.com>
> wrote:
>
> > As long as we can construct the on-disk indexes efficiently/directly
> from
> > a Memtable-attached index on flush, there's room to try other data
> > structures. Most of the innovation in SAI is around the layout of
> postings
> > (something we can expand on if people are interested) and having a
> > natively row-oriented design that scales w/ multiple indexed columns
> on
> > single SSTables. There are some broader implications of using the
> trie that
> > reach outside SAI itself, but talking about that would involve some
> bits of
> > information DataStax might not be ready to share?
> >
> > On Wed, Sep 23, 2020 at 11:00 AM Jeremiah D Jordan < jeremiah.jordan@
> > gmail.com> wrote:
> >
> > Short question: looking forward, how are we going to maintain three
> 2i
> > implementations: SASI, SAI, and 2i?
> >
> > I think one of the goals stated in the CEP is for SAI to have parity
> with
> > 2i such that it could eventually replace it.
> >
> > On Sep 23, 2020, at 10:34 AM, Oleksandr Petrov <
> >
> > oleksandr.pet...@gmail.com> wrote:
> >
> > Short question: looking forward, how are we going to maintain three
> 2i
> > implementations: SASI, SAI, and 2i?
> >
> > Another thing I think this CEP is missing is rationale and motivation
> > about why trie-based indexes were chosen over, say, B-Tree. We did
> have a
> > short discussion about this on Slack, but both arguments that I've
> heard
> > (space-saving and keeping a small subset of nodes in memory) work
> only
> >
> > for
> >
> > the most primitive implementation of a B-Tree. Fully-occupied prefix
> >
> > B-Tree
> >
> > can have similar properties. There's been a lot of research on
> B-Trees
> >
> > and
> >
> > optimisations in those. Unfortunately, I do not have an
> implementation
> > sitting around for a direct comparison, but I can imagine situations
> when
> > B-Trees may perform better because of simpler
> >
> > construction.
> >
> > Maybe we should even consider prototyping a prefix B-Tree to have a
> more
> > fair comparison.
> >
> > Thank you,
> > -- Alex
> >
> > On Thu, Sep 10, 2020 at 9:12 AM Jasonstack Zhao Yang <
> jasonstack.zhao@
> > gmail.com> wrote:
> >
> > Thank you Patrick for hosting Cassandra Contributor Meeting for CEP-7
> >
> > SAI.
> >
> > The recorded video is available here:
> >
> > https://cwiki.apache.org/confluence/display/CASSANDRA/
> > 2020-09-01+Apache+Cassandra+Contributor+Meeting
> >
> > On Tue, 1 Sep 2020 at 14:34, Jasonstack Zhao Yang <
> jasonstack.zhao@gmail.
> > com>
> > wrote:
> >
> > Thank you, Charles and Patrick
> >
> > On Tue, 1 Sep 2020 at 04:56, 

Re: [DISCUSS] CEP-7 Storage Attached Index

2020-09-23 Thread Benedict Elliott Smith
FWIW, I personally look forward to receiving that contribution when the time is 
right.

On 23/09/2020, 18:45, "Josh McKenzie"  wrote:

talking about that would involve some bits of information DataStax might
not be ready to share?

At the risk of derailing, I've been poking and prodding this week at we
contributors at DS getting our act together w/a draft CEP for donating the
trie-based indices to the ASF project.

More to come; the intention is certainly to contribute that code. The lack
of a destination to merge it into (i.e. no 5.0-dev branch) is removing
significant urgency from the process as well (not to open a 3rd Pandora's
box), but there's certainly an interrelatedness to the conversations going
on.

---
Josh McKenzie


Sent via Superhuman 


On Wed, Sep 23, 2020 at 12:48 PM, Caleb Rackliffe 
wrote:

> As long as we can construct the on-disk indexes efficiently/directly from
> a Memtable-attached index on flush, there's room to try other data
> structures. Most of the innovation in SAI is around the layout of postings
> (something we can expand on if people are interested) and having a
> natively row-oriented design that scales w/ multiple indexed columns on
> single SSTables. There are some broader implications of using the trie 
that
> reach outside SAI itself, but talking about that would involve some bits 
of
> information DataStax might not be ready to share?
>
> On Wed, Sep 23, 2020 at 11:00 AM Jeremiah D Jordan < jeremiah.jordan@
> gmail.com> wrote:
>
> Short question: looking forward, how are we going to maintain three 2i
> implementations: SASI, SAI, and 2i?
>
> I think one of the goals stated in the CEP is for SAI to have parity with
> 2i such that it could eventually replace it.
>
> On Sep 23, 2020, at 10:34 AM, Oleksandr Petrov <
>
> oleksandr.pet...@gmail.com> wrote:
>
> Short question: looking forward, how are we going to maintain three 2i
> implementations: SASI, SAI, and 2i?
>
> Another thing I think this CEP is missing is rationale and motivation
> about why trie-based indexes were chosen over, say, B-Tree. We did have a
> short discussion about this on Slack, but both arguments that I've heard
> (space-saving and keeping a small subset of nodes in memory) work only
>
> for
>
> the most primitive implementation of a B-Tree. Fully-occupied prefix
>
> B-Tree
>
> can have similar properties. There's been a lot of research on B-Trees
>
> and
>
> optimisations in those. Unfortunately, I do not have an implementation
> sitting around for a direct comparison, but I can imagine situations when
> B-Trees may perform better because of simpler
>
> construction.
>
> Maybe we should even consider prototyping a prefix B-Tree to have a more
> fair comparison.
>
> Thank you,
> -- Alex
>
> On Thu, Sep 10, 2020 at 9:12 AM Jasonstack Zhao Yang < jasonstack.zhao@
> gmail.com> wrote:
>
> Thank you Patrick for hosting Cassandra Contributor Meeting for CEP-7
>
> SAI.
>
> The recorded video is available here:
>
> https://cwiki.apache.org/confluence/display/CASSANDRA/
> 2020-09-01+Apache+Cassandra+Contributor+Meeting
>
> On Tue, 1 Sep 2020 at 14:34, Jasonstack Zhao Yang < jasonstack.zhao@gmail.
> com>
> wrote:
>
> Thank you, Charles and Patrick
>
> On Tue, 1 Sep 2020 at 04:56, Charles Cao  wrote:
>
> Thank you, Patrick!
>
> On Mon, Aug 31, 2020 at 12:59 PM Patrick McFadin 
> wrote:
>
> I just moved it to 8AM for this meeting to better accommodate APAC.
>
> Please
>
> see the update here:
>
> https://cwiki.apache.org/confluence/display/CASSANDRA/
> 2020-08-01+Apache+Cassandra+Contributor+Meeting
>
> Patrick
>
> On Mon, Aug 31, 2020 at 10:04 AM Charles Cao 
>
> wrote:
>
> Patrick,
>
> 11AM PST is a bad time for the people in the APAC timezone. Can we move it
> to 7 or 8AM PST in the morning to accommodate their needs ?
>
> ~Charles
>
> On Fri, Aug 28, 2020 at 4:37 PM Patrick McFadin 
> wrote:
>
> Meeting scheduled.
>
> https://cwiki.apache.org/confluence/display/CASSANDRA/
> 2020-08-01+Apache+Cassandra+Contributor+Meeting
>
> Tuesday September 1st, 11AM PST. I added a basic bullet for the
>
> agenda
>
> but
>
> if there is more, edit away.
>
> Patrick
>
> On Thu, Aug 27, 2020 at 11:31 AM Jasonstack Zhao Yang < jasonstack.zhao@
> gmail.com> wrote:
>
> +1
>
> On Thu, 27 Aug 2020 at 04:52, Ekaterina Dimitrova <
>
> e.dimitr...@gmail.com>
>
> wrote:
>
> +1
>
> On Wed, 26 Aug 2020 at 16:48, 

Re: [DISCUSS] CEP-7 Storage Attached Index

2020-09-23 Thread Josh McKenzie
talking about that would involve some bits of information DataStax might
not be ready to share?

At the risk of derailing, I've been poking and prodding this week at we
contributors at DS getting our act together w/a draft CEP for donating the
trie-based indices to the ASF project.

More to come; the intention is certainly to contribute that code. The lack
of a destination to merge it into (i.e. no 5.0-dev branch) is removing
significant urgency from the process as well (not to open a 3rd Pandora's
box), but there's certainly an interrelatedness to the conversations going
on.

---
Josh McKenzie


Sent via Superhuman 


On Wed, Sep 23, 2020 at 12:48 PM, Caleb Rackliffe 
wrote:

> As long as we can construct the on-disk indexes efficiently/directly from
> a Memtable-attached index on flush, there's room to try other data
> structures. Most of the innovation in SAI is around the layout of postings
> (something we can expand on if people are interested) and having a
> natively row-oriented design that scales w/ multiple indexed columns on
> single SSTables. There are some broader implications of using the trie that
> reach outside SAI itself, but talking about that would involve some bits of
> information DataStax might not be ready to share?
>
> On Wed, Sep 23, 2020 at 11:00 AM Jeremiah D Jordan < jeremiah.jordan@
> gmail.com> wrote:
>
> Short question: looking forward, how are we going to maintain three 2i
> implementations: SASI, SAI, and 2i?
>
> I think one of the goals stated in the CEP is for SAI to have parity with
> 2i such that it could eventually replace it.
>
> On Sep 23, 2020, at 10:34 AM, Oleksandr Petrov <
>
> oleksandr.pet...@gmail.com> wrote:
>
> Short question: looking forward, how are we going to maintain three 2i
> implementations: SASI, SAI, and 2i?
>
> Another thing I think this CEP is missing is rationale and motivation
> about why trie-based indexes were chosen over, say, B-Tree. We did have a
> short discussion about this on Slack, but both arguments that I've heard
> (space-saving and keeping a small subset of nodes in memory) work only
>
> for
>
> the most primitive implementation of a B-Tree. Fully-occupied prefix
>
> B-Tree
>
> can have similar properties. There's been a lot of research on B-Trees
>
> and
>
> optimisations in those. Unfortunately, I do not have an implementation
> sitting around for a direct comparison, but I can imagine situations when
> B-Trees may perform better because of simpler
>
> construction.
>
> Maybe we should even consider prototyping a prefix B-Tree to have a more
> fair comparison.
>
> Thank you,
> -- Alex
>
> On Thu, Sep 10, 2020 at 9:12 AM Jasonstack Zhao Yang < jasonstack.zhao@
> gmail.com> wrote:
>
> Thank you Patrick for hosting Cassandra Contributor Meeting for CEP-7
>
> SAI.
>
> The recorded video is available here:
>
> https://cwiki.apache.org/confluence/display/CASSANDRA/
> 2020-09-01+Apache+Cassandra+Contributor+Meeting
>
> On Tue, 1 Sep 2020 at 14:34, Jasonstack Zhao Yang < jasonstack.zhao@gmail.
> com>
> wrote:
>
> Thank you, Charles and Patrick
>
> On Tue, 1 Sep 2020 at 04:56, Charles Cao  wrote:
>
> Thank you, Patrick!
>
> On Mon, Aug 31, 2020 at 12:59 PM Patrick McFadin 
> wrote:
>
> I just moved it to 8AM for this meeting to better accommodate APAC.
>
> Please
>
> see the update here:
>
> https://cwiki.apache.org/confluence/display/CASSANDRA/
> 2020-08-01+Apache+Cassandra+Contributor+Meeting
>
> Patrick
>
> On Mon, Aug 31, 2020 at 10:04 AM Charles Cao 
>
> wrote:
>
> Patrick,
>
> 11AM PST is a bad time for the people in the APAC timezone. Can we move it
> to 7 or 8AM PST in the morning to accommodate their needs ?
>
> ~Charles
>
> On Fri, Aug 28, 2020 at 4:37 PM Patrick McFadin 
> wrote:
>
> Meeting scheduled.
>
> https://cwiki.apache.org/confluence/display/CASSANDRA/
> 2020-08-01+Apache+Cassandra+Contributor+Meeting
>
> Tuesday September 1st, 11AM PST. I added a basic bullet for the
>
> agenda
>
> but
>
> if there is more, edit away.
>
> Patrick
>
> On Thu, Aug 27, 2020 at 11:31 AM Jasonstack Zhao Yang < jasonstack.zhao@
> gmail.com> wrote:
>
> +1
>
> On Thu, 27 Aug 2020 at 04:52, Ekaterina Dimitrova <
>
> e.dimitr...@gmail.com>
>
> wrote:
>
> +1
>
> On Wed, 26 Aug 2020 at 16:48, Caleb Rackliffe <
>
> calebrackli...@gmail.com>
>
> wrote:
>
> +1
>
> On Wed, Aug 26, 2020, 3:45 PM Patrick McFadin <
>
> pmcfa...@gmail.com>
>
> wrote:
>
> This is related to the discussion Jordan and I had about
>
> the
>
> contributor
>
> Zoom call. Instead of open mic for any issue, call it
>
> based
>
> on a
>
> discussion
>
> thread or threads for higher bandwidth discussion.
>
> I would be happy to schedule on for next week to
>
> specifically
>
> discuss
>
> CEP-7. I can attach the recorded call to the CEP after.
>
> +1 or -1?
>
> Patrick
>
> On Tue, Aug 25, 2020 at 7:03 AM Joshua McKenzie <
>
> jmcken...@apache.org>
>
> wrote:
>
> Does community plan to open another discussion or CEP
>
> on
>
> 

Re: [DISCUSS] CEP-7 Storage Attached Index

2020-09-23 Thread Caleb Rackliffe
As long as we can construct the on-disk indexes efficiently/directly from a
Memtable-attached index on flush, there's room to try other data
structures. Most of the innovation in SAI is around the layout of postings
(something we can expand on if people are interested) and having a natively
row-oriented design that scales w/ multiple indexed columns on single
SSTables. There are some broader implications of using the trie that reach
outside SAI itself, but talking about that would involve some bits of
information DataStax might not be ready to share?

On Wed, Sep 23, 2020 at 11:00 AM Jeremiah D Jordan <
jeremiah.jor...@gmail.com> wrote:

> > Short question: looking forward, how are we going to maintain three 2i
> > implementations: SASI, SAI, and 2i?
>
> I think one of the goals stated in the CEP is for SAI to have parity with
> 2i such that it could eventually replace it.
>
>
> > On Sep 23, 2020, at 10:34 AM, Oleksandr Petrov <
> oleksandr.pet...@gmail.com> wrote:
> >
> > Short question: looking forward, how are we going to maintain three 2i
> > implementations: SASI, SAI, and 2i?
> >
> > Another thing I think this CEP is missing is rationale and motivation
> > about why trie-based indexes were chosen over, say, B-Tree. We did have a
> > short discussion about this on Slack, but both arguments that I've heard
> > (space-saving and keeping a small subset of nodes in memory) work only
> for
> > the most primitive implementation of a B-Tree. Fully-occupied prefix
> B-Tree
> > can have similar properties. There's been a lot of research on B-Trees
> and
> > optimisations in those. Unfortunately, I do not have an
> > implementation sitting around for a direct comparison, but I can imagine
> > situations when B-Trees may perform better because of simpler
> construction.
> > Maybe we should even consider prototyping a prefix B-Tree to have a more
> > fair comparison.
> >
> > Thank you,
> > -- Alex
> >
> >
> >
> > On Thu, Sep 10, 2020 at 9:12 AM Jasonstack Zhao Yang <
> > jasonstack.z...@gmail.com> wrote:
> >
> >> Thank you Patrick for hosting Cassandra Contributor Meeting for CEP-7
> SAI.
> >>
> >> The recorded video is available here:
> >>
> >>
> https://cwiki.apache.org/confluence/display/CASSANDRA/2020-09-01+Apache+Cassandra+Contributor+Meeting
> >>
> >> On Tue, 1 Sep 2020 at 14:34, Jasonstack Zhao Yang <
> >> jasonstack.z...@gmail.com>
> >> wrote:
> >>
> >>> Thank you, Charles and Patrick
> >>>
> >>> On Tue, 1 Sep 2020 at 04:56, Charles Cao  wrote:
> >>>
>  Thank you, Patrick!
> 
>  On Mon, Aug 31, 2020 at 12:59 PM Patrick McFadin 
>  wrote:
> >
> > I just moved it to 8AM for this meeting to better accommodate APAC.
>  Please
> > see the update here:
> >
> 
> >>
> https://cwiki.apache.org/confluence/display/CASSANDRA/2020-08-01+Apache+Cassandra+Contributor+Meeting
> >
> > Patrick
> >
> > On Mon, Aug 31, 2020 at 10:04 AM Charles Cao 
>  wrote:
> >
> >> Patrick,
> >>
> >> 11AM PST is a bad time for the people in the APAC timezone. Can we
> >> move it to 7 or 8AM PST in the morning to accommodate their needs ?
> >>
> >> ~Charles
> >>
> >> On Fri, Aug 28, 2020 at 4:37 PM Patrick McFadin  >>>
> >> wrote:
> >>>
> >>> Meeting scheduled.
> >>>
> >>
> 
> >>
> https://cwiki.apache.org/confluence/display/CASSANDRA/2020-08-01+Apache+Cassandra+Contributor+Meeting
> >>>
> >>> Tuesday September 1st, 11AM PST. I added a basic bullet for the
>  agenda
> >> but
> >>> if there is more, edit away.
> >>>
> >>> Patrick
> >>>
> >>> On Thu, Aug 27, 2020 at 11:31 AM Jasonstack Zhao Yang <
> >>> jasonstack.z...@gmail.com> wrote:
> >>>
>  +1
> 
>  On Thu, 27 Aug 2020 at 04:52, Ekaterina Dimitrova <
> >> e.dimitr...@gmail.com>
>  wrote:
> 
> > +1
> >
> > On Wed, 26 Aug 2020 at 16:48, Caleb Rackliffe <
> >> calebrackli...@gmail.com>
> > wrote:
> >
> >> +1
> >>
> >>
> >>
> >> On Wed, Aug 26, 2020, 3:45 PM Patrick McFadin <
>  pmcfa...@gmail.com>
> > wrote:
> >>
> >>
> >>
> >>> This is related to the discussion Jordan and I had about
> >> the
> > contributor
> >>
> >>> Zoom call. Instead of open mic for any issue, call it
> >> based
>  on a
> >> discussion
> >>
> >>> thread or threads for higher bandwidth discussion.
> >>
> >>>
> >>
> >>> I would be happy to schedule on for next week to
>  specifically
> >> discuss
> >>
> >>> CEP-7. I can attach the recorded call to the CEP after.
> >>
> >>>
> >>
> >>> +1 or -1?
> >>
> >>>
> >>
> >>> Patrick
> >>
> >>>
> >>
> >>> On Tue, Aug 25, 2020 at 7:03 AM Joshua 

Re: [DISCUSS] CEP-7 Storage Attached Index

2020-09-23 Thread Oleksandr Petrov
I did see a bit about "future parity and beyond" which is more or less an
obvious goal. I think CEP should be more upfront with "eventually replace
it" bit, since it raises the question about what the people who are using
other index implementations can expect.

On Wed, Sep 23, 2020 at 6:00 PM Jeremiah D Jordan 
wrote:

> > Short question: looking forward, how are we going to maintain three 2i
> > implementations: SASI, SAI, and 2i?
>
> I think one of the goals stated in the CEP is for SAI to have parity with
> 2i such that it could eventually replace it.
>
>
> > On Sep 23, 2020, at 10:34 AM, Oleksandr Petrov <
> oleksandr.pet...@gmail.com> wrote:
> >
> > Short question: looking forward, how are we going to maintain three 2i
> > implementations: SASI, SAI, and 2i?
> >
> > Another thing I think this CEP is missing is rationale and motivation
> > about why trie-based indexes were chosen over, say, B-Tree. We did have a
> > short discussion about this on Slack, but both arguments that I've heard
> > (space-saving and keeping a small subset of nodes in memory) work only
> for
> > the most primitive implementation of a B-Tree. Fully-occupied prefix
> B-Tree
> > can have similar properties. There's been a lot of research on B-Trees
> and
> > optimisations in those. Unfortunately, I do not have an
> > implementation sitting around for a direct comparison, but I can imagine
> > situations when B-Trees may perform better because of simpler
> construction.
> > Maybe we should even consider prototyping a prefix B-Tree to have a more
> > fair comparison.
> >
> > Thank you,
> > -- Alex
> >
> >
> >
> > On Thu, Sep 10, 2020 at 9:12 AM Jasonstack Zhao Yang <
> > jasonstack.z...@gmail.com> wrote:
> >
> >> Thank you Patrick for hosting Cassandra Contributor Meeting for CEP-7
> SAI.
> >>
> >> The recorded video is available here:
> >>
> >>
> https://cwiki.apache.org/confluence/display/CASSANDRA/2020-09-01+Apache+Cassandra+Contributor+Meeting
> >>
> >> On Tue, 1 Sep 2020 at 14:34, Jasonstack Zhao Yang <
> >> jasonstack.z...@gmail.com>
> >> wrote:
> >>
> >>> Thank you, Charles and Patrick
> >>>
> >>> On Tue, 1 Sep 2020 at 04:56, Charles Cao  wrote:
> >>>
>  Thank you, Patrick!
> 
>  On Mon, Aug 31, 2020 at 12:59 PM Patrick McFadin 
>  wrote:
> >
> > I just moved it to 8AM for this meeting to better accommodate APAC.
>  Please
> > see the update here:
> >
> 
> >>
> https://cwiki.apache.org/confluence/display/CASSANDRA/2020-08-01+Apache+Cassandra+Contributor+Meeting
> >
> > Patrick
> >
> > On Mon, Aug 31, 2020 at 10:04 AM Charles Cao 
>  wrote:
> >
> >> Patrick,
> >>
> >> 11AM PST is a bad time for the people in the APAC timezone. Can we
> >> move it to 7 or 8AM PST in the morning to accommodate their needs ?
> >>
> >> ~Charles
> >>
> >> On Fri, Aug 28, 2020 at 4:37 PM Patrick McFadin  >>>
> >> wrote:
> >>>
> >>> Meeting scheduled.
> >>>
> >>
> 
> >>
> https://cwiki.apache.org/confluence/display/CASSANDRA/2020-08-01+Apache+Cassandra+Contributor+Meeting
> >>>
> >>> Tuesday September 1st, 11AM PST. I added a basic bullet for the
>  agenda
> >> but
> >>> if there is more, edit away.
> >>>
> >>> Patrick
> >>>
> >>> On Thu, Aug 27, 2020 at 11:31 AM Jasonstack Zhao Yang <
> >>> jasonstack.z...@gmail.com> wrote:
> >>>
>  +1
> 
>  On Thu, 27 Aug 2020 at 04:52, Ekaterina Dimitrova <
> >> e.dimitr...@gmail.com>
>  wrote:
> 
> > +1
> >
> > On Wed, 26 Aug 2020 at 16:48, Caleb Rackliffe <
> >> calebrackli...@gmail.com>
> > wrote:
> >
> >> +1
> >>
> >>
> >>
> >> On Wed, Aug 26, 2020, 3:45 PM Patrick McFadin <
>  pmcfa...@gmail.com>
> > wrote:
> >>
> >>
> >>
> >>> This is related to the discussion Jordan and I had about
> >> the
> > contributor
> >>
> >>> Zoom call. Instead of open mic for any issue, call it
> >> based
>  on a
> >> discussion
> >>
> >>> thread or threads for higher bandwidth discussion.
> >>
> >>>
> >>
> >>> I would be happy to schedule on for next week to
>  specifically
> >> discuss
> >>
> >>> CEP-7. I can attach the recorded call to the CEP after.
> >>
> >>>
> >>
> >>> +1 or -1?
> >>
> >>>
> >>
> >>> Patrick
> >>
> >>>
> >>
> >>> On Tue, Aug 25, 2020 at 7:03 AM Joshua McKenzie <
>  jmcken...@apache.org>
> >>
> >>> wrote:
> >>
> >>>
> >>
> >
> >>
> > Does community plan to open another discussion or CEP
> >> on
> >>
> >>> modularization?
> >>
> 
> >>
>  We 

Re: [DISCUSS] CEP-7 Storage Attached Index

2020-09-23 Thread Oleksandr Petrov
Short question: looking forward, how are we going to maintain three 2i
implementations: SASI, SAI, and 2i?

Another thing I think this CEP is missing is rationale and motivation
about why trie-based indexes were chosen over, say, B-Tree. We did have a
short discussion about this on Slack, but both arguments that I've heard
(space-saving and keeping a small subset of nodes in memory) work only for
the most primitive implementation of a B-Tree. Fully-occupied prefix B-Tree
can have similar properties. There's been a lot of research on B-Trees and
optimisations in those. Unfortunately, I do not have an
implementation sitting around for a direct comparison, but I can imagine
situations when B-Trees may perform better because of simpler construction.
Maybe we should even consider prototyping a prefix B-Tree to have a more
fair comparison.

Thank you,
-- Alex



On Thu, Sep 10, 2020 at 9:12 AM Jasonstack Zhao Yang <
jasonstack.z...@gmail.com> wrote:

> Thank you Patrick for hosting Cassandra Contributor Meeting for CEP-7 SAI.
>
> The recorded video is available here:
>
> https://cwiki.apache.org/confluence/display/CASSANDRA/2020-09-01+Apache+Cassandra+Contributor+Meeting
>
> On Tue, 1 Sep 2020 at 14:34, Jasonstack Zhao Yang <
> jasonstack.z...@gmail.com>
> wrote:
>
> > Thank you, Charles and Patrick
> >
> > On Tue, 1 Sep 2020 at 04:56, Charles Cao  wrote:
> >
> >> Thank you, Patrick!
> >>
> >> On Mon, Aug 31, 2020 at 12:59 PM Patrick McFadin 
> >> wrote:
> >> >
> >> > I just moved it to 8AM for this meeting to better accommodate APAC.
> >> Please
> >> > see the update here:
> >> >
> >>
> https://cwiki.apache.org/confluence/display/CASSANDRA/2020-08-01+Apache+Cassandra+Contributor+Meeting
> >> >
> >> > Patrick
> >> >
> >> > On Mon, Aug 31, 2020 at 10:04 AM Charles Cao 
> >> wrote:
> >> >
> >> > > Patrick,
> >> > >
> >> > > 11AM PST is a bad time for the people in the APAC timezone. Can we
> >> > > move it to 7 or 8AM PST in the morning to accommodate their needs ?
> >> > >
> >> > > ~Charles
> >> > >
> >> > > On Fri, Aug 28, 2020 at 4:37 PM Patrick McFadin  >
> >> > > wrote:
> >> > > >
> >> > > > Meeting scheduled.
> >> > > >
> >> > >
> >>
> https://cwiki.apache.org/confluence/display/CASSANDRA/2020-08-01+Apache+Cassandra+Contributor+Meeting
> >> > > >
> >> > > > Tuesday September 1st, 11AM PST. I added a basic bullet for the
> >> agenda
> >> > > but
> >> > > > if there is more, edit away.
> >> > > >
> >> > > > Patrick
> >> > > >
> >> > > > On Thu, Aug 27, 2020 at 11:31 AM Jasonstack Zhao Yang <
> >> > > > jasonstack.z...@gmail.com> wrote:
> >> > > >
> >> > > > > +1
> >> > > > >
> >> > > > > On Thu, 27 Aug 2020 at 04:52, Ekaterina Dimitrova <
> >> > > e.dimitr...@gmail.com>
> >> > > > > wrote:
> >> > > > >
> >> > > > > > +1
> >> > > > > >
> >> > > > > > On Wed, 26 Aug 2020 at 16:48, Caleb Rackliffe <
> >> > > calebrackli...@gmail.com>
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > +1
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Wed, Aug 26, 2020, 3:45 PM Patrick McFadin <
> >> pmcfa...@gmail.com>
> >> > > > > > wrote:
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > > This is related to the discussion Jordan and I had about
> the
> >> > > > > > contributor
> >> > > > > > >
> >> > > > > > > > Zoom call. Instead of open mic for any issue, call it
> based
> >> on a
> >> > > > > > > discussion
> >> > > > > > >
> >> > > > > > > > thread or threads for higher bandwidth discussion.
> >> > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > > > > I would be happy to schedule on for next week to
> >> specifically
> >> > > discuss
> >> > > > > > >
> >> > > > > > > > CEP-7. I can attach the recorded call to the CEP after.
> >> > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > > > > +1 or -1?
> >> > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > > > > Patrick
> >> > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > > > > On Tue, Aug 25, 2020 at 7:03 AM Joshua McKenzie <
> >> > > > > jmcken...@apache.org>
> >> > > > > > >
> >> > > > > > > > wrote:
> >> > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > > > > > >
> >> > > > > > >
> >> > > > > > > > > > Does community plan to open another discussion or CEP
> on
> >> > > > > > >
> >> > > > > > > > modularization?
> >> > > > > > >
> >> > > > > > > > >
> >> > > > > > >
> >> > > > > > > > > We probably should have a discussion on the ML or
> monthly
> >> > > contrib
> >> > > > > > call
> >> > > > > > >
> >> > > > > > > > > about it first to see how aligned the interested
> >> contributors
> >> > > are.
> >> > > > > > > Could
> >> > > > > > >
> >> > > > > > > > do
> >> > > > > > >
> >> > > > > > > > > that through CEP as well but CEP's (at least thus far
> >> sans k8s
> >> > > > > > > operator)
> >> > > > > > >
> >> > > > > > > > > tend to start with a strong, deeply thought out point of
> >> view
> >> > > being
> >> > > > > > >
> >> > > > > > > > > expressed.
> >> > > > > > >
> >> > 

Re: [DISCUSS] CEP-7 Storage Attached Index

2020-09-10 Thread Jasonstack Zhao Yang
Thank you Patrick for hosting Cassandra Contributor Meeting for CEP-7 SAI.

The recorded video is available here:
https://cwiki.apache.org/confluence/display/CASSANDRA/2020-09-01+Apache+Cassandra+Contributor+Meeting

On Tue, 1 Sep 2020 at 14:34, Jasonstack Zhao Yang 
wrote:

> Thank you, Charles and Patrick
>
> On Tue, 1 Sep 2020 at 04:56, Charles Cao  wrote:
>
>> Thank you, Patrick!
>>
>> On Mon, Aug 31, 2020 at 12:59 PM Patrick McFadin 
>> wrote:
>> >
>> > I just moved it to 8AM for this meeting to better accommodate APAC.
>> Please
>> > see the update here:
>> >
>> https://cwiki.apache.org/confluence/display/CASSANDRA/2020-08-01+Apache+Cassandra+Contributor+Meeting
>> >
>> > Patrick
>> >
>> > On Mon, Aug 31, 2020 at 10:04 AM Charles Cao 
>> wrote:
>> >
>> > > Patrick,
>> > >
>> > > 11AM PST is a bad time for the people in the APAC timezone. Can we
>> > > move it to 7 or 8AM PST in the morning to accommodate their needs ?
>> > >
>> > > ~Charles
>> > >
>> > > On Fri, Aug 28, 2020 at 4:37 PM Patrick McFadin 
>> > > wrote:
>> > > >
>> > > > Meeting scheduled.
>> > > >
>> > >
>> https://cwiki.apache.org/confluence/display/CASSANDRA/2020-08-01+Apache+Cassandra+Contributor+Meeting
>> > > >
>> > > > Tuesday September 1st, 11AM PST. I added a basic bullet for the
>> agenda
>> > > but
>> > > > if there is more, edit away.
>> > > >
>> > > > Patrick
>> > > >
>> > > > On Thu, Aug 27, 2020 at 11:31 AM Jasonstack Zhao Yang <
>> > > > jasonstack.z...@gmail.com> wrote:
>> > > >
>> > > > > +1
>> > > > >
>> > > > > On Thu, 27 Aug 2020 at 04:52, Ekaterina Dimitrova <
>> > > e.dimitr...@gmail.com>
>> > > > > wrote:
>> > > > >
>> > > > > > +1
>> > > > > >
>> > > > > > On Wed, 26 Aug 2020 at 16:48, Caleb Rackliffe <
>> > > calebrackli...@gmail.com>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > +1
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > On Wed, Aug 26, 2020, 3:45 PM Patrick McFadin <
>> pmcfa...@gmail.com>
>> > > > > > wrote:
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > > This is related to the discussion Jordan and I had about the
>> > > > > > contributor
>> > > > > > >
>> > > > > > > > Zoom call. Instead of open mic for any issue, call it based
>> on a
>> > > > > > > discussion
>> > > > > > >
>> > > > > > > > thread or threads for higher bandwidth discussion.
>> > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > > > > I would be happy to schedule on for next week to
>> specifically
>> > > discuss
>> > > > > > >
>> > > > > > > > CEP-7. I can attach the recorded call to the CEP after.
>> > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > > > > +1 or -1?
>> > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > > > > Patrick
>> > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > > > > On Tue, Aug 25, 2020 at 7:03 AM Joshua McKenzie <
>> > > > > jmcken...@apache.org>
>> > > > > > >
>> > > > > > > > wrote:
>> > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > > > > > >
>> > > > > > >
>> > > > > > > > > > Does community plan to open another discussion or CEP on
>> > > > > > >
>> > > > > > > > modularization?
>> > > > > > >
>> > > > > > > > >
>> > > > > > >
>> > > > > > > > > We probably should have a discussion on the ML or monthly
>> > > contrib
>> > > > > > call
>> > > > > > >
>> > > > > > > > > about it first to see how aligned the interested
>> contributors
>> > > are.
>> > > > > > > Could
>> > > > > > >
>> > > > > > > > do
>> > > > > > >
>> > > > > > > > > that through CEP as well but CEP's (at least thus far
>> sans k8s
>> > > > > > > operator)
>> > > > > > >
>> > > > > > > > > tend to start with a strong, deeply thought out point of
>> view
>> > > being
>> > > > > > >
>> > > > > > > > > expressed.
>> > > > > > >
>> > > > > > > > >
>> > > > > > >
>> > > > > > > > > On Tue, Aug 25, 2020 at 3:26 AM Jasonstack Zhao Yang <
>> > > > > > >
>> > > > > > > > > jasonstack.z...@gmail.com> wrote:
>> > > > > > >
>> > > > > > > > >
>> > > > > > >
>> > > > > > > > > > >>> SASI's performance, specifically the search in the
>> B+
>> > > tree
>> > > > > > >
>> > > > > > > > component,
>> > > > > > >
>> > > > > > > > > > >>> depends a lot on the component file's header being
>> > > available
>> > > > > in
>> > > > > > > the
>> > > > > > >
>> > > > > > > > > > >>> pagecache. SASI benefits from (needs) nodes with
>> lots of
>> > > RAM.
>> > > > > > Is
>> > > > > > >
>> > > > > > > > SAI
>> > > > > > >
>> > > > > > > > > > bound
>> > > > > > >
>> > > > > > > > > > >>> to this same or similar limitation?
>> > > > > > >
>> > > > > > > > > >
>> > > > > > >
>> > > > > > > > > > SAI also benefits from larger memory because SAI puts
>> block
>> > > info
>> > > > > on
>> > > > > > >
>> > > > > > > > heap
>> > > > > > >
>> > > > > > > > > > for searching on-disk components and having cross-index
>> > > files on
>> > > > > > page
>> > > > > > >
>> > > > > > > > > cache
>> > > > > > >
>> > > > > > > > > > improves read performance of different indexes on the
>> same
>> > > table.
>> > > > > > >
>> > > > 

Re: [DISCUSS] CEP-7 Storage Attached Index

2020-09-01 Thread Jasonstack Zhao Yang
Thank you, Charles and Patrick

On Tue, 1 Sep 2020 at 04:56, Charles Cao  wrote:

> Thank you, Patrick!
>
> On Mon, Aug 31, 2020 at 12:59 PM Patrick McFadin 
> wrote:
> >
> > I just moved it to 8AM for this meeting to better accommodate APAC.
> Please
> > see the update here:
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/2020-08-01+Apache+Cassandra+Contributor+Meeting
> >
> > Patrick
> >
> > On Mon, Aug 31, 2020 at 10:04 AM Charles Cao 
> wrote:
> >
> > > Patrick,
> > >
> > > 11AM PST is a bad time for the people in the APAC timezone. Can we
> > > move it to 7 or 8AM PST in the morning to accommodate their needs ?
> > >
> > > ~Charles
> > >
> > > On Fri, Aug 28, 2020 at 4:37 PM Patrick McFadin 
> > > wrote:
> > > >
> > > > Meeting scheduled.
> > > >
> > >
> https://cwiki.apache.org/confluence/display/CASSANDRA/2020-08-01+Apache+Cassandra+Contributor+Meeting
> > > >
> > > > Tuesday September 1st, 11AM PST. I added a basic bullet for the
> agenda
> > > but
> > > > if there is more, edit away.
> > > >
> > > > Patrick
> > > >
> > > > On Thu, Aug 27, 2020 at 11:31 AM Jasonstack Zhao Yang <
> > > > jasonstack.z...@gmail.com> wrote:
> > > >
> > > > > +1
> > > > >
> > > > > On Thu, 27 Aug 2020 at 04:52, Ekaterina Dimitrova <
> > > e.dimitr...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > +1
> > > > > >
> > > > > > On Wed, 26 Aug 2020 at 16:48, Caleb Rackliffe <
> > > calebrackli...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > +1
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Aug 26, 2020, 3:45 PM Patrick McFadin <
> pmcfa...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > This is related to the discussion Jordan and I had about the
> > > > > > contributor
> > > > > > >
> > > > > > > > Zoom call. Instead of open mic for any issue, call it based
> on a
> > > > > > > discussion
> > > > > > >
> > > > > > > > thread or threads for higher bandwidth discussion.
> > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > > I would be happy to schedule on for next week to specifically
> > > discuss
> > > > > > >
> > > > > > > > CEP-7. I can attach the recorded call to the CEP after.
> > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > > +1 or -1?
> > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > > Patrick
> > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > > On Tue, Aug 25, 2020 at 7:03 AM Joshua McKenzie <
> > > > > jmcken...@apache.org>
> > > > > > >
> > > > > > > > wrote:
> > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > > > >
> > > > > > >
> > > > > > > > > > Does community plan to open another discussion or CEP on
> > > > > > >
> > > > > > > > modularization?
> > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > > > > We probably should have a discussion on the ML or monthly
> > > contrib
> > > > > > call
> > > > > > >
> > > > > > > > > about it first to see how aligned the interested
> contributors
> > > are.
> > > > > > > Could
> > > > > > >
> > > > > > > > do
> > > > > > >
> > > > > > > > > that through CEP as well but CEP's (at least thus far sans
> k8s
> > > > > > > operator)
> > > > > > >
> > > > > > > > > tend to start with a strong, deeply thought out point of
> view
> > > being
> > > > > > >
> > > > > > > > > expressed.
> > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > > > > On Tue, Aug 25, 2020 at 3:26 AM Jasonstack Zhao Yang <
> > > > > > >
> > > > > > > > > jasonstack.z...@gmail.com> wrote:
> > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > > > > > >>> SASI's performance, specifically the search in the B+
> > > tree
> > > > > > >
> > > > > > > > component,
> > > > > > >
> > > > > > > > > > >>> depends a lot on the component file's header being
> > > available
> > > > > in
> > > > > > > the
> > > > > > >
> > > > > > > > > > >>> pagecache. SASI benefits from (needs) nodes with
> lots of
> > > RAM.
> > > > > > Is
> > > > > > >
> > > > > > > > SAI
> > > > > > >
> > > > > > > > > > bound
> > > > > > >
> > > > > > > > > > >>> to this same or similar limitation?
> > > > > > >
> > > > > > > > > >
> > > > > > >
> > > > > > > > > > SAI also benefits from larger memory because SAI puts
> block
> > > info
> > > > > on
> > > > > > >
> > > > > > > > heap
> > > > > > >
> > > > > > > > > > for searching on-disk components and having cross-index
> > > files on
> > > > > > page
> > > > > > >
> > > > > > > > > cache
> > > > > > >
> > > > > > > > > > improves read performance of different indexes on the
> same
> > > table.
> > > > > > >
> > > > > > > > > >
> > > > > > >
> > > > > > > > > >
> > > > > > >
> > > > > > > > > > >>> Flushing of SASI can be CPU+IO intensive, to the
> point of
> > > > > > >
> > > > > > > > saturation,
> > > > > > >
> > > > > > > > > > >>> pauses, and crashes on the node. SSDs are a must,
> along
> > > with
> > > > > a
> > > > > > > bit
> > > > > > >
> > > > > > > > of
> > > > > > >
> > > > > > > > > > >>> tuning, just to avoid bringing down your cluster.
> Beyond
> > > > 

Re: [DISCUSS] CEP-7 Storage Attached Index

2020-08-31 Thread Charles Cao
Thank you, Patrick!

On Mon, Aug 31, 2020 at 12:59 PM Patrick McFadin  wrote:
>
> I just moved it to 8AM for this meeting to better accommodate APAC. Please
> see the update here:
> https://cwiki.apache.org/confluence/display/CASSANDRA/2020-08-01+Apache+Cassandra+Contributor+Meeting
>
> Patrick
>
> On Mon, Aug 31, 2020 at 10:04 AM Charles Cao  wrote:
>
> > Patrick,
> >
> > 11AM PST is a bad time for the people in the APAC timezone. Can we
> > move it to 7 or 8AM PST in the morning to accommodate their needs ?
> >
> > ~Charles
> >
> > On Fri, Aug 28, 2020 at 4:37 PM Patrick McFadin 
> > wrote:
> > >
> > > Meeting scheduled.
> > >
> > https://cwiki.apache.org/confluence/display/CASSANDRA/2020-08-01+Apache+Cassandra+Contributor+Meeting
> > >
> > > Tuesday September 1st, 11AM PST. I added a basic bullet for the agenda
> > but
> > > if there is more, edit away.
> > >
> > > Patrick
> > >
> > > On Thu, Aug 27, 2020 at 11:31 AM Jasonstack Zhao Yang <
> > > jasonstack.z...@gmail.com> wrote:
> > >
> > > > +1
> > > >
> > > > On Thu, 27 Aug 2020 at 04:52, Ekaterina Dimitrova <
> > e.dimitr...@gmail.com>
> > > > wrote:
> > > >
> > > > > +1
> > > > >
> > > > > On Wed, 26 Aug 2020 at 16:48, Caleb Rackliffe <
> > calebrackli...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > +1
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Aug 26, 2020, 3:45 PM Patrick McFadin 
> > > > > wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > > This is related to the discussion Jordan and I had about the
> > > > > contributor
> > > > > >
> > > > > > > Zoom call. Instead of open mic for any issue, call it based on a
> > > > > > discussion
> > > > > >
> > > > > > > thread or threads for higher bandwidth discussion.
> > > > > >
> > > > > > >
> > > > > >
> > > > > > > I would be happy to schedule on for next week to specifically
> > discuss
> > > > > >
> > > > > > > CEP-7. I can attach the recorded call to the CEP after.
> > > > > >
> > > > > > >
> > > > > >
> > > > > > > +1 or -1?
> > > > > >
> > > > > > >
> > > > > >
> > > > > > > Patrick
> > > > > >
> > > > > > >
> > > > > >
> > > > > > > On Tue, Aug 25, 2020 at 7:03 AM Joshua McKenzie <
> > > > jmcken...@apache.org>
> > > > > >
> > > > > > > wrote:
> > > > > >
> > > > > > >
> > > > > >
> > > > > > > > >
> > > > > >
> > > > > > > > > Does community plan to open another discussion or CEP on
> > > > > >
> > > > > > > modularization?
> > > > > >
> > > > > > > >
> > > > > >
> > > > > > > > We probably should have a discussion on the ML or monthly
> > contrib
> > > > > call
> > > > > >
> > > > > > > > about it first to see how aligned the interested contributors
> > are.
> > > > > > Could
> > > > > >
> > > > > > > do
> > > > > >
> > > > > > > > that through CEP as well but CEP's (at least thus far sans k8s
> > > > > > operator)
> > > > > >
> > > > > > > > tend to start with a strong, deeply thought out point of view
> > being
> > > > > >
> > > > > > > > expressed.
> > > > > >
> > > > > > > >
> > > > > >
> > > > > > > > On Tue, Aug 25, 2020 at 3:26 AM Jasonstack Zhao Yang <
> > > > > >
> > > > > > > > jasonstack.z...@gmail.com> wrote:
> > > > > >
> > > > > > > >
> > > > > >
> > > > > > > > > >>> SASI's performance, specifically the search in the B+
> > tree
> > > > > >
> > > > > > > component,
> > > > > >
> > > > > > > > > >>> depends a lot on the component file's header being
> > available
> > > > in
> > > > > > the
> > > > > >
> > > > > > > > > >>> pagecache. SASI benefits from (needs) nodes with lots of
> > RAM.
> > > > > Is
> > > > > >
> > > > > > > SAI
> > > > > >
> > > > > > > > > bound
> > > > > >
> > > > > > > > > >>> to this same or similar limitation?
> > > > > >
> > > > > > > > >
> > > > > >
> > > > > > > > > SAI also benefits from larger memory because SAI puts block
> > info
> > > > on
> > > > > >
> > > > > > > heap
> > > > > >
> > > > > > > > > for searching on-disk components and having cross-index
> > files on
> > > > > page
> > > > > >
> > > > > > > > cache
> > > > > >
> > > > > > > > > improves read performance of different indexes on the same
> > table.
> > > > > >
> > > > > > > > >
> > > > > >
> > > > > > > > >
> > > > > >
> > > > > > > > > >>> Flushing of SASI can be CPU+IO intensive, to the point of
> > > > > >
> > > > > > > saturation,
> > > > > >
> > > > > > > > > >>> pauses, and crashes on the node. SSDs are a must, along
> > with
> > > > a
> > > > > > bit
> > > > > >
> > > > > > > of
> > > > > >
> > > > > > > > > >>> tuning, just to avoid bringing down your cluster. Beyond
> > > > > reducing
> > > > > >
> > > > > > > > space
> > > > > >
> > > > > > > > > >>> requirements, does SAI improve on these things? Like
> > SASI how
> > > > > > does
> > > > > >
> > > > > > > > SAI,
> > > > > >
> > > > > > > > > in
> > > > > >
> > > > > > > > > >>> its own way, change/narrow the recommendations on node
> > > > hardware
> > > > > >
> > > > > > > > specs?
> > > > > >
> > > > > > > > >
> > > > > >
> > > > > > > > > SAI won't crash the node during compaction and requires 

Re: [DISCUSS] CEP-7 Storage Attached Index

2020-08-31 Thread Patrick McFadin
I just moved it to 8AM for this meeting to better accommodate APAC. Please
see the update here:
https://cwiki.apache.org/confluence/display/CASSANDRA/2020-08-01+Apache+Cassandra+Contributor+Meeting

Patrick

On Mon, Aug 31, 2020 at 10:04 AM Charles Cao  wrote:

> Patrick,
>
> 11AM PST is a bad time for the people in the APAC timezone. Can we
> move it to 7 or 8AM PST in the morning to accommodate their needs ?
>
> ~Charles
>
> On Fri, Aug 28, 2020 at 4:37 PM Patrick McFadin 
> wrote:
> >
> > Meeting scheduled.
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/2020-08-01+Apache+Cassandra+Contributor+Meeting
> >
> > Tuesday September 1st, 11AM PST. I added a basic bullet for the agenda
> but
> > if there is more, edit away.
> >
> > Patrick
> >
> > On Thu, Aug 27, 2020 at 11:31 AM Jasonstack Zhao Yang <
> > jasonstack.z...@gmail.com> wrote:
> >
> > > +1
> > >
> > > On Thu, 27 Aug 2020 at 04:52, Ekaterina Dimitrova <
> e.dimitr...@gmail.com>
> > > wrote:
> > >
> > > > +1
> > > >
> > > > On Wed, 26 Aug 2020 at 16:48, Caleb Rackliffe <
> calebrackli...@gmail.com>
> > > > wrote:
> > > >
> > > > > +1
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Aug 26, 2020, 3:45 PM Patrick McFadin 
> > > > wrote:
> > > > >
> > > > >
> > > > >
> > > > > > This is related to the discussion Jordan and I had about the
> > > > contributor
> > > > >
> > > > > > Zoom call. Instead of open mic for any issue, call it based on a
> > > > > discussion
> > > > >
> > > > > > thread or threads for higher bandwidth discussion.
> > > > >
> > > > > >
> > > > >
> > > > > > I would be happy to schedule on for next week to specifically
> discuss
> > > > >
> > > > > > CEP-7. I can attach the recorded call to the CEP after.
> > > > >
> > > > > >
> > > > >
> > > > > > +1 or -1?
> > > > >
> > > > > >
> > > > >
> > > > > > Patrick
> > > > >
> > > > > >
> > > > >
> > > > > > On Tue, Aug 25, 2020 at 7:03 AM Joshua McKenzie <
> > > jmcken...@apache.org>
> > > > >
> > > > > > wrote:
> > > > >
> > > > > >
> > > > >
> > > > > > > >
> > > > >
> > > > > > > > Does community plan to open another discussion or CEP on
> > > > >
> > > > > > modularization?
> > > > >
> > > > > > >
> > > > >
> > > > > > > We probably should have a discussion on the ML or monthly
> contrib
> > > > call
> > > > >
> > > > > > > about it first to see how aligned the interested contributors
> are.
> > > > > Could
> > > > >
> > > > > > do
> > > > >
> > > > > > > that through CEP as well but CEP's (at least thus far sans k8s
> > > > > operator)
> > > > >
> > > > > > > tend to start with a strong, deeply thought out point of view
> being
> > > > >
> > > > > > > expressed.
> > > > >
> > > > > > >
> > > > >
> > > > > > > On Tue, Aug 25, 2020 at 3:26 AM Jasonstack Zhao Yang <
> > > > >
> > > > > > > jasonstack.z...@gmail.com> wrote:
> > > > >
> > > > > > >
> > > > >
> > > > > > > > >>> SASI's performance, specifically the search in the B+
> tree
> > > > >
> > > > > > component,
> > > > >
> > > > > > > > >>> depends a lot on the component file's header being
> available
> > > in
> > > > > the
> > > > >
> > > > > > > > >>> pagecache. SASI benefits from (needs) nodes with lots of
> RAM.
> > > > Is
> > > > >
> > > > > > SAI
> > > > >
> > > > > > > > bound
> > > > >
> > > > > > > > >>> to this same or similar limitation?
> > > > >
> > > > > > > >
> > > > >
> > > > > > > > SAI also benefits from larger memory because SAI puts block
> info
> > > on
> > > > >
> > > > > > heap
> > > > >
> > > > > > > > for searching on-disk components and having cross-index
> files on
> > > > page
> > > > >
> > > > > > > cache
> > > > >
> > > > > > > > improves read performance of different indexes on the same
> table.
> > > > >
> > > > > > > >
> > > > >
> > > > > > > >
> > > > >
> > > > > > > > >>> Flushing of SASI can be CPU+IO intensive, to the point of
> > > > >
> > > > > > saturation,
> > > > >
> > > > > > > > >>> pauses, and crashes on the node. SSDs are a must, along
> with
> > > a
> > > > > bit
> > > > >
> > > > > > of
> > > > >
> > > > > > > > >>> tuning, just to avoid bringing down your cluster. Beyond
> > > > reducing
> > > > >
> > > > > > > space
> > > > >
> > > > > > > > >>> requirements, does SAI improve on these things? Like
> SASI how
> > > > > does
> > > > >
> > > > > > > SAI,
> > > > >
> > > > > > > > in
> > > > >
> > > > > > > > >>> its own way, change/narrow the recommendations on node
> > > hardware
> > > > >
> > > > > > > specs?
> > > > >
> > > > > > > >
> > > > >
> > > > > > > > SAI won't crash the node during compaction and requires less
> > > > CPU/IO.
> > > > >
> > > > > > > >
> > > > >
> > > > > > > > * SAI defines global memory limit for compaction instead of
> > > > per-index
> > > > >
> > > > > > > > memory limit used by SASI.
> > > > >
> > > > > > > >   For example, compactions are running on 10 tables and each
> has
> > > 10
> > > > >
> > > > > > > > indexes. SAI will cap the
> > > > >
> > > > > > > >   memory usage with global limit while SASI may use up to
> 100 *
> > > > >

Re: [DISCUSS] CEP-7 Storage Attached Index

2020-08-31 Thread Charles Cao
Patrick,

11AM PST is a bad time for the people in the APAC timezone. Can we
move it to 7 or 8AM PST in the morning to accommodate their needs ?

~Charles

On Fri, Aug 28, 2020 at 4:37 PM Patrick McFadin  wrote:
>
> Meeting scheduled.
> https://cwiki.apache.org/confluence/display/CASSANDRA/2020-08-01+Apache+Cassandra+Contributor+Meeting
>
> Tuesday September 1st, 11AM PST. I added a basic bullet for the agenda but
> if there is more, edit away.
>
> Patrick
>
> On Thu, Aug 27, 2020 at 11:31 AM Jasonstack Zhao Yang <
> jasonstack.z...@gmail.com> wrote:
>
> > +1
> >
> > On Thu, 27 Aug 2020 at 04:52, Ekaterina Dimitrova 
> > wrote:
> >
> > > +1
> > >
> > > On Wed, 26 Aug 2020 at 16:48, Caleb Rackliffe 
> > > wrote:
> > >
> > > > +1
> > > >
> > > >
> > > >
> > > > On Wed, Aug 26, 2020, 3:45 PM Patrick McFadin 
> > > wrote:
> > > >
> > > >
> > > >
> > > > > This is related to the discussion Jordan and I had about the
> > > contributor
> > > >
> > > > > Zoom call. Instead of open mic for any issue, call it based on a
> > > > discussion
> > > >
> > > > > thread or threads for higher bandwidth discussion.
> > > >
> > > > >
> > > >
> > > > > I would be happy to schedule on for next week to specifically discuss
> > > >
> > > > > CEP-7. I can attach the recorded call to the CEP after.
> > > >
> > > > >
> > > >
> > > > > +1 or -1?
> > > >
> > > > >
> > > >
> > > > > Patrick
> > > >
> > > > >
> > > >
> > > > > On Tue, Aug 25, 2020 at 7:03 AM Joshua McKenzie <
> > jmcken...@apache.org>
> > > >
> > > > > wrote:
> > > >
> > > > >
> > > >
> > > > > > >
> > > >
> > > > > > > Does community plan to open another discussion or CEP on
> > > >
> > > > > modularization?
> > > >
> > > > > >
> > > >
> > > > > > We probably should have a discussion on the ML or monthly contrib
> > > call
> > > >
> > > > > > about it first to see how aligned the interested contributors are.
> > > > Could
> > > >
> > > > > do
> > > >
> > > > > > that through CEP as well but CEP's (at least thus far sans k8s
> > > > operator)
> > > >
> > > > > > tend to start with a strong, deeply thought out point of view being
> > > >
> > > > > > expressed.
> > > >
> > > > > >
> > > >
> > > > > > On Tue, Aug 25, 2020 at 3:26 AM Jasonstack Zhao Yang <
> > > >
> > > > > > jasonstack.z...@gmail.com> wrote:
> > > >
> > > > > >
> > > >
> > > > > > > >>> SASI's performance, specifically the search in the B+ tree
> > > >
> > > > > component,
> > > >
> > > > > > > >>> depends a lot on the component file's header being available
> > in
> > > > the
> > > >
> > > > > > > >>> pagecache. SASI benefits from (needs) nodes with lots of RAM.
> > > Is
> > > >
> > > > > SAI
> > > >
> > > > > > > bound
> > > >
> > > > > > > >>> to this same or similar limitation?
> > > >
> > > > > > >
> > > >
> > > > > > > SAI also benefits from larger memory because SAI puts block info
> > on
> > > >
> > > > > heap
> > > >
> > > > > > > for searching on-disk components and having cross-index files on
> > > page
> > > >
> > > > > > cache
> > > >
> > > > > > > improves read performance of different indexes on the same table.
> > > >
> > > > > > >
> > > >
> > > > > > >
> > > >
> > > > > > > >>> Flushing of SASI can be CPU+IO intensive, to the point of
> > > >
> > > > > saturation,
> > > >
> > > > > > > >>> pauses, and crashes on the node. SSDs are a must, along with
> > a
> > > > bit
> > > >
> > > > > of
> > > >
> > > > > > > >>> tuning, just to avoid bringing down your cluster. Beyond
> > > reducing
> > > >
> > > > > > space
> > > >
> > > > > > > >>> requirements, does SAI improve on these things? Like SASI how
> > > > does
> > > >
> > > > > > SAI,
> > > >
> > > > > > > in
> > > >
> > > > > > > >>> its own way, change/narrow the recommendations on node
> > hardware
> > > >
> > > > > > specs?
> > > >
> > > > > > >
> > > >
> > > > > > > SAI won't crash the node during compaction and requires less
> > > CPU/IO.
> > > >
> > > > > > >
> > > >
> > > > > > > * SAI defines global memory limit for compaction instead of
> > > per-index
> > > >
> > > > > > > memory limit used by SASI.
> > > >
> > > > > > >   For example, compactions are running on 10 tables and each has
> > 10
> > > >
> > > > > > > indexes. SAI will cap the
> > > >
> > > > > > >   memory usage with global limit while SASI may use up to 100 *
> > > >
> > > > > per-index
> > > >
> > > > > > > limit.
> > > >
> > > > > > >
> > > >
> > > > > > > * After flushing in-memory segments to disk, SAI won't merge
> > > on-disk
> > > >
> > > > > > > segments while SASI
> > > >
> > > > > > >   attempts to merge them at the end.
> > > >
> > > > > > >
> > > >
> > > > > > >   There are pros and cons of not merging segments:
> > > >
> > > > > > > ** Pros: compaction runs faster and requires fewer resources.
> > > >
> > > > > > > ** Cons: small segments reduce compression ratio.
> > > >
> > > > > > >
> > > >
> > > > > > > * SAI on-disk format with row ids compresses better.
> > > >
> > > > > > >
> > > >
> > > > > > >
> > > >
> > > > > > > >>> I understand the 

Re: [DISCUSS] CEP-7 Storage Attached Index

2020-08-28 Thread Patrick McFadin
Meeting scheduled.
https://cwiki.apache.org/confluence/display/CASSANDRA/2020-08-01+Apache+Cassandra+Contributor+Meeting

Tuesday September 1st, 11AM PST. I added a basic bullet for the agenda but
if there is more, edit away.

Patrick

On Thu, Aug 27, 2020 at 11:31 AM Jasonstack Zhao Yang <
jasonstack.z...@gmail.com> wrote:

> +1
>
> On Thu, 27 Aug 2020 at 04:52, Ekaterina Dimitrova 
> wrote:
>
> > +1
> >
> > On Wed, 26 Aug 2020 at 16:48, Caleb Rackliffe 
> > wrote:
> >
> > > +1
> > >
> > >
> > >
> > > On Wed, Aug 26, 2020, 3:45 PM Patrick McFadin 
> > wrote:
> > >
> > >
> > >
> > > > This is related to the discussion Jordan and I had about the
> > contributor
> > >
> > > > Zoom call. Instead of open mic for any issue, call it based on a
> > > discussion
> > >
> > > > thread or threads for higher bandwidth discussion.
> > >
> > > >
> > >
> > > > I would be happy to schedule on for next week to specifically discuss
> > >
> > > > CEP-7. I can attach the recorded call to the CEP after.
> > >
> > > >
> > >
> > > > +1 or -1?
> > >
> > > >
> > >
> > > > Patrick
> > >
> > > >
> > >
> > > > On Tue, Aug 25, 2020 at 7:03 AM Joshua McKenzie <
> jmcken...@apache.org>
> > >
> > > > wrote:
> > >
> > > >
> > >
> > > > > >
> > >
> > > > > > Does community plan to open another discussion or CEP on
> > >
> > > > modularization?
> > >
> > > > >
> > >
> > > > > We probably should have a discussion on the ML or monthly contrib
> > call
> > >
> > > > > about it first to see how aligned the interested contributors are.
> > > Could
> > >
> > > > do
> > >
> > > > > that through CEP as well but CEP's (at least thus far sans k8s
> > > operator)
> > >
> > > > > tend to start with a strong, deeply thought out point of view being
> > >
> > > > > expressed.
> > >
> > > > >
> > >
> > > > > On Tue, Aug 25, 2020 at 3:26 AM Jasonstack Zhao Yang <
> > >
> > > > > jasonstack.z...@gmail.com> wrote:
> > >
> > > > >
> > >
> > > > > > >>> SASI's performance, specifically the search in the B+ tree
> > >
> > > > component,
> > >
> > > > > > >>> depends a lot on the component file's header being available
> in
> > > the
> > >
> > > > > > >>> pagecache. SASI benefits from (needs) nodes with lots of RAM.
> > Is
> > >
> > > > SAI
> > >
> > > > > > bound
> > >
> > > > > > >>> to this same or similar limitation?
> > >
> > > > > >
> > >
> > > > > > SAI also benefits from larger memory because SAI puts block info
> on
> > >
> > > > heap
> > >
> > > > > > for searching on-disk components and having cross-index files on
> > page
> > >
> > > > > cache
> > >
> > > > > > improves read performance of different indexes on the same table.
> > >
> > > > > >
> > >
> > > > > >
> > >
> > > > > > >>> Flushing of SASI can be CPU+IO intensive, to the point of
> > >
> > > > saturation,
> > >
> > > > > > >>> pauses, and crashes on the node. SSDs are a must, along with
> a
> > > bit
> > >
> > > > of
> > >
> > > > > > >>> tuning, just to avoid bringing down your cluster. Beyond
> > reducing
> > >
> > > > > space
> > >
> > > > > > >>> requirements, does SAI improve on these things? Like SASI how
> > > does
> > >
> > > > > SAI,
> > >
> > > > > > in
> > >
> > > > > > >>> its own way, change/narrow the recommendations on node
> hardware
> > >
> > > > > specs?
> > >
> > > > > >
> > >
> > > > > > SAI won't crash the node during compaction and requires less
> > CPU/IO.
> > >
> > > > > >
> > >
> > > > > > * SAI defines global memory limit for compaction instead of
> > per-index
> > >
> > > > > > memory limit used by SASI.
> > >
> > > > > >   For example, compactions are running on 10 tables and each has
> 10
> > >
> > > > > > indexes. SAI will cap the
> > >
> > > > > >   memory usage with global limit while SASI may use up to 100 *
> > >
> > > > per-index
> > >
> > > > > > limit.
> > >
> > > > > >
> > >
> > > > > > * After flushing in-memory segments to disk, SAI won't merge
> > on-disk
> > >
> > > > > > segments while SASI
> > >
> > > > > >   attempts to merge them at the end.
> > >
> > > > > >
> > >
> > > > > >   There are pros and cons of not merging segments:
> > >
> > > > > > ** Pros: compaction runs faster and requires fewer resources.
> > >
> > > > > > ** Cons: small segments reduce compression ratio.
> > >
> > > > > >
> > >
> > > > > > * SAI on-disk format with row ids compresses better.
> > >
> > > > > >
> > >
> > > > > >
> > >
> > > > > > >>> I understand the desire in keeping out of scope the longer
> term
> > >
> > > > > > deprecation
> > >
> > > > > > >>> and migration plan, but… if SASI provides functionality that
> > SAI
> > >
> > > > > > doesn't,
> > >
> > > > > > >>> like tokenisation and DelimiterAnalyzer, yet introduces a
> body
> > of
> > >
> > > > > code
> > >
> > > > > > >>> ~somewhat similar, shouldn't we be roughly sketching out how
> to
> > >
> > > > > reduce
> > >
> > > > > > the
> > >
> > > > > > >>> maintenance surface area?
> > >
> > > > > >
> > >
> > > > > > Agreed that we should reduce maintenance area if possible, but
> only
> > >
> > > > 

Re: [DISCUSS] CEP-7 Storage Attached Index

2020-08-28 Thread Jason Rutherglen
+1

On Thu, Aug 27, 2020 at 1:31 PM Jasonstack Zhao Yang
 wrote:
>
> +1
>
> On Thu, 27 Aug 2020 at 04:52, Ekaterina Dimitrova 
> wrote:
>
> > +1
> >
> > On Wed, 26 Aug 2020 at 16:48, Caleb Rackliffe 
> > wrote:
> >
> > > +1
> > >
> > >
> > >
> > > On Wed, Aug 26, 2020, 3:45 PM Patrick McFadin 
> > wrote:
> > >
> > >
> > >
> > > > This is related to the discussion Jordan and I had about the
> > contributor
> > >
> > > > Zoom call. Instead of open mic for any issue, call it based on a
> > > discussion
> > >
> > > > thread or threads for higher bandwidth discussion.
> > >
> > > >
> > >
> > > > I would be happy to schedule on for next week to specifically discuss
> > >
> > > > CEP-7. I can attach the recorded call to the CEP after.
> > >
> > > >
> > >
> > > > +1 or -1?
> > >
> > > >
> > >
> > > > Patrick
> > >
> > > >
> > >
> > > > On Tue, Aug 25, 2020 at 7:03 AM Joshua McKenzie 
> > >
> > > > wrote:
> > >
> > > >
> > >
> > > > > >
> > >
> > > > > > Does community plan to open another discussion or CEP on
> > >
> > > > modularization?
> > >
> > > > >
> > >
> > > > > We probably should have a discussion on the ML or monthly contrib
> > call
> > >
> > > > > about it first to see how aligned the interested contributors are.
> > > Could
> > >
> > > > do
> > >
> > > > > that through CEP as well but CEP's (at least thus far sans k8s
> > > operator)
> > >
> > > > > tend to start with a strong, deeply thought out point of view being
> > >
> > > > > expressed.
> > >
> > > > >
> > >
> > > > > On Tue, Aug 25, 2020 at 3:26 AM Jasonstack Zhao Yang <
> > >
> > > > > jasonstack.z...@gmail.com> wrote:
> > >
> > > > >
> > >
> > > > > > >>> SASI's performance, specifically the search in the B+ tree
> > >
> > > > component,
> > >
> > > > > > >>> depends a lot on the component file's header being available in
> > > the
> > >
> > > > > > >>> pagecache. SASI benefits from (needs) nodes with lots of RAM.
> > Is
> > >
> > > > SAI
> > >
> > > > > > bound
> > >
> > > > > > >>> to this same or similar limitation?
> > >
> > > > > >
> > >
> > > > > > SAI also benefits from larger memory because SAI puts block info on
> > >
> > > > heap
> > >
> > > > > > for searching on-disk components and having cross-index files on
> > page
> > >
> > > > > cache
> > >
> > > > > > improves read performance of different indexes on the same table.
> > >
> > > > > >
> > >
> > > > > >
> > >
> > > > > > >>> Flushing of SASI can be CPU+IO intensive, to the point of
> > >
> > > > saturation,
> > >
> > > > > > >>> pauses, and crashes on the node. SSDs are a must, along with a
> > > bit
> > >
> > > > of
> > >
> > > > > > >>> tuning, just to avoid bringing down your cluster. Beyond
> > reducing
> > >
> > > > > space
> > >
> > > > > > >>> requirements, does SAI improve on these things? Like SASI how
> > > does
> > >
> > > > > SAI,
> > >
> > > > > > in
> > >
> > > > > > >>> its own way, change/narrow the recommendations on node hardware
> > >
> > > > > specs?
> > >
> > > > > >
> > >
> > > > > > SAI won't crash the node during compaction and requires less
> > CPU/IO.
> > >
> > > > > >
> > >
> > > > > > * SAI defines global memory limit for compaction instead of
> > per-index
> > >
> > > > > > memory limit used by SASI.
> > >
> > > > > >   For example, compactions are running on 10 tables and each has 10
> > >
> > > > > > indexes. SAI will cap the
> > >
> > > > > >   memory usage with global limit while SASI may use up to 100 *
> > >
> > > > per-index
> > >
> > > > > > limit.
> > >
> > > > > >
> > >
> > > > > > * After flushing in-memory segments to disk, SAI won't merge
> > on-disk
> > >
> > > > > > segments while SASI
> > >
> > > > > >   attempts to merge them at the end.
> > >
> > > > > >
> > >
> > > > > >   There are pros and cons of not merging segments:
> > >
> > > > > > ** Pros: compaction runs faster and requires fewer resources.
> > >
> > > > > > ** Cons: small segments reduce compression ratio.
> > >
> > > > > >
> > >
> > > > > > * SAI on-disk format with row ids compresses better.
> > >
> > > > > >
> > >
> > > > > >
> > >
> > > > > > >>> I understand the desire in keeping out of scope the longer term
> > >
> > > > > > deprecation
> > >
> > > > > > >>> and migration plan, but… if SASI provides functionality that
> > SAI
> > >
> > > > > > doesn't,
> > >
> > > > > > >>> like tokenisation and DelimiterAnalyzer, yet introduces a body
> > of
> > >
> > > > > code
> > >
> > > > > > >>> ~somewhat similar, shouldn't we be roughly sketching out how to
> > >
> > > > > reduce
> > >
> > > > > > the
> > >
> > > > > > >>> maintenance surface area?
> > >
> > > > > >
> > >
> > > > > > Agreed that we should reduce maintenance area if possible, but only
> > >
> > > > very
> > >
> > > > > > limited
> > >
> > > > > > code base (eg. RangeIterator, QueryPlan) can be shared. The rest of
> > > the
> > >
> > > > > > code base
> > >
> > > > > > is quite different because of on-disk format and cross-index files.
> > >
> > > > > >
> > >
> > > > > > The goal of this CEP is to 

Re: [DISCUSS] CEP-7 Storage Attached Index

2020-08-27 Thread Jasonstack Zhao Yang
+1

On Thu, 27 Aug 2020 at 04:52, Ekaterina Dimitrova 
wrote:

> +1
>
> On Wed, 26 Aug 2020 at 16:48, Caleb Rackliffe 
> wrote:
>
> > +1
> >
> >
> >
> > On Wed, Aug 26, 2020, 3:45 PM Patrick McFadin 
> wrote:
> >
> >
> >
> > > This is related to the discussion Jordan and I had about the
> contributor
> >
> > > Zoom call. Instead of open mic for any issue, call it based on a
> > discussion
> >
> > > thread or threads for higher bandwidth discussion.
> >
> > >
> >
> > > I would be happy to schedule on for next week to specifically discuss
> >
> > > CEP-7. I can attach the recorded call to the CEP after.
> >
> > >
> >
> > > +1 or -1?
> >
> > >
> >
> > > Patrick
> >
> > >
> >
> > > On Tue, Aug 25, 2020 at 7:03 AM Joshua McKenzie 
> >
> > > wrote:
> >
> > >
> >
> > > > >
> >
> > > > > Does community plan to open another discussion or CEP on
> >
> > > modularization?
> >
> > > >
> >
> > > > We probably should have a discussion on the ML or monthly contrib
> call
> >
> > > > about it first to see how aligned the interested contributors are.
> > Could
> >
> > > do
> >
> > > > that through CEP as well but CEP's (at least thus far sans k8s
> > operator)
> >
> > > > tend to start with a strong, deeply thought out point of view being
> >
> > > > expressed.
> >
> > > >
> >
> > > > On Tue, Aug 25, 2020 at 3:26 AM Jasonstack Zhao Yang <
> >
> > > > jasonstack.z...@gmail.com> wrote:
> >
> > > >
> >
> > > > > >>> SASI's performance, specifically the search in the B+ tree
> >
> > > component,
> >
> > > > > >>> depends a lot on the component file's header being available in
> > the
> >
> > > > > >>> pagecache. SASI benefits from (needs) nodes with lots of RAM.
> Is
> >
> > > SAI
> >
> > > > > bound
> >
> > > > > >>> to this same or similar limitation?
> >
> > > > >
> >
> > > > > SAI also benefits from larger memory because SAI puts block info on
> >
> > > heap
> >
> > > > > for searching on-disk components and having cross-index files on
> page
> >
> > > > cache
> >
> > > > > improves read performance of different indexes on the same table.
> >
> > > > >
> >
> > > > >
> >
> > > > > >>> Flushing of SASI can be CPU+IO intensive, to the point of
> >
> > > saturation,
> >
> > > > > >>> pauses, and crashes on the node. SSDs are a must, along with a
> > bit
> >
> > > of
> >
> > > > > >>> tuning, just to avoid bringing down your cluster. Beyond
> reducing
> >
> > > > space
> >
> > > > > >>> requirements, does SAI improve on these things? Like SASI how
> > does
> >
> > > > SAI,
> >
> > > > > in
> >
> > > > > >>> its own way, change/narrow the recommendations on node hardware
> >
> > > > specs?
> >
> > > > >
> >
> > > > > SAI won't crash the node during compaction and requires less
> CPU/IO.
> >
> > > > >
> >
> > > > > * SAI defines global memory limit for compaction instead of
> per-index
> >
> > > > > memory limit used by SASI.
> >
> > > > >   For example, compactions are running on 10 tables and each has 10
> >
> > > > > indexes. SAI will cap the
> >
> > > > >   memory usage with global limit while SASI may use up to 100 *
> >
> > > per-index
> >
> > > > > limit.
> >
> > > > >
> >
> > > > > * After flushing in-memory segments to disk, SAI won't merge
> on-disk
> >
> > > > > segments while SASI
> >
> > > > >   attempts to merge them at the end.
> >
> > > > >
> >
> > > > >   There are pros and cons of not merging segments:
> >
> > > > > ** Pros: compaction runs faster and requires fewer resources.
> >
> > > > > ** Cons: small segments reduce compression ratio.
> >
> > > > >
> >
> > > > > * SAI on-disk format with row ids compresses better.
> >
> > > > >
> >
> > > > >
> >
> > > > > >>> I understand the desire in keeping out of scope the longer term
> >
> > > > > deprecation
> >
> > > > > >>> and migration plan, but… if SASI provides functionality that
> SAI
> >
> > > > > doesn't,
> >
> > > > > >>> like tokenisation and DelimiterAnalyzer, yet introduces a body
> of
> >
> > > > code
> >
> > > > > >>> ~somewhat similar, shouldn't we be roughly sketching out how to
> >
> > > > reduce
> >
> > > > > the
> >
> > > > > >>> maintenance surface area?
> >
> > > > >
> >
> > > > > Agreed that we should reduce maintenance area if possible, but only
> >
> > > very
> >
> > > > > limited
> >
> > > > > code base (eg. RangeIterator, QueryPlan) can be shared. The rest of
> > the
> >
> > > > > code base
> >
> > > > > is quite different because of on-disk format and cross-index files.
> >
> > > > >
> >
> > > > > The goal of this CEP is to get community buy-in on SAI's design.
> >
> > > > > Tokenization,
> >
> > > > > DelimiterAnalyzer should be straightforward to implement on top of
> > SAI.
> >
> > > > >
> >
> > > > > >>> Can we list what configurations of SASI will become deprecated
> > once
> >
> > > > SAI
> >
> > > > > >>> becomes non-experimental?
> >
> > > > >
> >
> > > > > Except for "Like", "Tokenisation", "DelimiterAnalyzer", the rest of
> >
> > > SASI
> >
> > > > > can
> >
> > > > > be replaced by SAI.
> >
> > > > >
> >
> > > > > >>> 

Re: [DISCUSS] CEP-7 Storage Attached Index

2020-08-26 Thread Ekaterina Dimitrova
+1

On Wed, 26 Aug 2020 at 16:48, Caleb Rackliffe 
wrote:

> +1
>
>
>
> On Wed, Aug 26, 2020, 3:45 PM Patrick McFadin  wrote:
>
>
>
> > This is related to the discussion Jordan and I had about the contributor
>
> > Zoom call. Instead of open mic for any issue, call it based on a
> discussion
>
> > thread or threads for higher bandwidth discussion.
>
> >
>
> > I would be happy to schedule on for next week to specifically discuss
>
> > CEP-7. I can attach the recorded call to the CEP after.
>
> >
>
> > +1 or -1?
>
> >
>
> > Patrick
>
> >
>
> > On Tue, Aug 25, 2020 at 7:03 AM Joshua McKenzie 
>
> > wrote:
>
> >
>
> > > >
>
> > > > Does community plan to open another discussion or CEP on
>
> > modularization?
>
> > >
>
> > > We probably should have a discussion on the ML or monthly contrib call
>
> > > about it first to see how aligned the interested contributors are.
> Could
>
> > do
>
> > > that through CEP as well but CEP's (at least thus far sans k8s
> operator)
>
> > > tend to start with a strong, deeply thought out point of view being
>
> > > expressed.
>
> > >
>
> > > On Tue, Aug 25, 2020 at 3:26 AM Jasonstack Zhao Yang <
>
> > > jasonstack.z...@gmail.com> wrote:
>
> > >
>
> > > > >>> SASI's performance, specifically the search in the B+ tree
>
> > component,
>
> > > > >>> depends a lot on the component file's header being available in
> the
>
> > > > >>> pagecache. SASI benefits from (needs) nodes with lots of RAM. Is
>
> > SAI
>
> > > > bound
>
> > > > >>> to this same or similar limitation?
>
> > > >
>
> > > > SAI also benefits from larger memory because SAI puts block info on
>
> > heap
>
> > > > for searching on-disk components and having cross-index files on page
>
> > > cache
>
> > > > improves read performance of different indexes on the same table.
>
> > > >
>
> > > >
>
> > > > >>> Flushing of SASI can be CPU+IO intensive, to the point of
>
> > saturation,
>
> > > > >>> pauses, and crashes on the node. SSDs are a must, along with a
> bit
>
> > of
>
> > > > >>> tuning, just to avoid bringing down your cluster. Beyond reducing
>
> > > space
>
> > > > >>> requirements, does SAI improve on these things? Like SASI how
> does
>
> > > SAI,
>
> > > > in
>
> > > > >>> its own way, change/narrow the recommendations on node hardware
>
> > > specs?
>
> > > >
>
> > > > SAI won't crash the node during compaction and requires less CPU/IO.
>
> > > >
>
> > > > * SAI defines global memory limit for compaction instead of per-index
>
> > > > memory limit used by SASI.
>
> > > >   For example, compactions are running on 10 tables and each has 10
>
> > > > indexes. SAI will cap the
>
> > > >   memory usage with global limit while SASI may use up to 100 *
>
> > per-index
>
> > > > limit.
>
> > > >
>
> > > > * After flushing in-memory segments to disk, SAI won't merge on-disk
>
> > > > segments while SASI
>
> > > >   attempts to merge them at the end.
>
> > > >
>
> > > >   There are pros and cons of not merging segments:
>
> > > > ** Pros: compaction runs faster and requires fewer resources.
>
> > > > ** Cons: small segments reduce compression ratio.
>
> > > >
>
> > > > * SAI on-disk format with row ids compresses better.
>
> > > >
>
> > > >
>
> > > > >>> I understand the desire in keeping out of scope the longer term
>
> > > > deprecation
>
> > > > >>> and migration plan, but… if SASI provides functionality that SAI
>
> > > > doesn't,
>
> > > > >>> like tokenisation and DelimiterAnalyzer, yet introduces a body of
>
> > > code
>
> > > > >>> ~somewhat similar, shouldn't we be roughly sketching out how to
>
> > > reduce
>
> > > > the
>
> > > > >>> maintenance surface area?
>
> > > >
>
> > > > Agreed that we should reduce maintenance area if possible, but only
>
> > very
>
> > > > limited
>
> > > > code base (eg. RangeIterator, QueryPlan) can be shared. The rest of
> the
>
> > > > code base
>
> > > > is quite different because of on-disk format and cross-index files.
>
> > > >
>
> > > > The goal of this CEP is to get community buy-in on SAI's design.
>
> > > > Tokenization,
>
> > > > DelimiterAnalyzer should be straightforward to implement on top of
> SAI.
>
> > > >
>
> > > > >>> Can we list what configurations of SASI will become deprecated
> once
>
> > > SAI
>
> > > > >>> becomes non-experimental?
>
> > > >
>
> > > > Except for "Like", "Tokenisation", "DelimiterAnalyzer", the rest of
>
> > SASI
>
> > > > can
>
> > > > be replaced by SAI.
>
> > > >
>
> > > > >>> Given a few bugs are open against 2i and SASI, can we provide
> some
>
> > > > >>> overview, or rough indication, of how many of them we could
> "triage
>
> > > > away"?
>
> > > >
>
> > > > I believe most of the known bugs in 2i/SASI either have been
> addressed
>
> > in
>
> > > > SAI or
>
> > > > don't apply to SAI.
>
> > > >
>
> > > > >>> And, is it time for the project to start introducing new SPI
>
> > > > >>> implementations as separate sub-modules and jar files that are
> only
>
> > > > loaded
>
> > > > >>> at runtime based on configuration 

Re: [DISCUSS] CEP-7 Storage Attached Index

2020-08-26 Thread Caleb Rackliffe
+1

On Wed, Aug 26, 2020, 3:45 PM Patrick McFadin  wrote:

> This is related to the discussion Jordan and I had about the contributor
> Zoom call. Instead of open mic for any issue, call it based on a discussion
> thread or threads for higher bandwidth discussion.
>
> I would be happy to schedule on for next week to specifically discuss
> CEP-7. I can attach the recorded call to the CEP after.
>
> +1 or -1?
>
> Patrick
>
> On Tue, Aug 25, 2020 at 7:03 AM Joshua McKenzie 
> wrote:
>
> > >
> > > Does community plan to open another discussion or CEP on
> modularization?
> >
> > We probably should have a discussion on the ML or monthly contrib call
> > about it first to see how aligned the interested contributors are. Could
> do
> > that through CEP as well but CEP's (at least thus far sans k8s operator)
> > tend to start with a strong, deeply thought out point of view being
> > expressed.
> >
> > On Tue, Aug 25, 2020 at 3:26 AM Jasonstack Zhao Yang <
> > jasonstack.z...@gmail.com> wrote:
> >
> > > >>> SASI's performance, specifically the search in the B+ tree
> component,
> > > >>> depends a lot on the component file's header being available in the
> > > >>> pagecache. SASI benefits from (needs) nodes with lots of RAM. Is
> SAI
> > > bound
> > > >>> to this same or similar limitation?
> > >
> > > SAI also benefits from larger memory because SAI puts block info on
> heap
> > > for searching on-disk components and having cross-index files on page
> > cache
> > > improves read performance of different indexes on the same table.
> > >
> > >
> > > >>> Flushing of SASI can be CPU+IO intensive, to the point of
> saturation,
> > > >>> pauses, and crashes on the node. SSDs are a must, along with a bit
> of
> > > >>> tuning, just to avoid bringing down your cluster. Beyond reducing
> > space
> > > >>> requirements, does SAI improve on these things? Like SASI how does
> > SAI,
> > > in
> > > >>> its own way, change/narrow the recommendations on node hardware
> > specs?
> > >
> > > SAI won't crash the node during compaction and requires less CPU/IO.
> > >
> > > * SAI defines global memory limit for compaction instead of per-index
> > > memory limit used by SASI.
> > >   For example, compactions are running on 10 tables and each has 10
> > > indexes. SAI will cap the
> > >   memory usage with global limit while SASI may use up to 100 *
> per-index
> > > limit.
> > >
> > > * After flushing in-memory segments to disk, SAI won't merge on-disk
> > > segments while SASI
> > >   attempts to merge them at the end.
> > >
> > >   There are pros and cons of not merging segments:
> > > ** Pros: compaction runs faster and requires fewer resources.
> > > ** Cons: small segments reduce compression ratio.
> > >
> > > * SAI on-disk format with row ids compresses better.
> > >
> > >
> > > >>> I understand the desire in keeping out of scope the longer term
> > > deprecation
> > > >>> and migration plan, but… if SASI provides functionality that SAI
> > > doesn't,
> > > >>> like tokenisation and DelimiterAnalyzer, yet introduces a body of
> > code
> > > >>> ~somewhat similar, shouldn't we be roughly sketching out how to
> > reduce
> > > the
> > > >>> maintenance surface area?
> > >
> > > Agreed that we should reduce maintenance area if possible, but only
> very
> > > limited
> > > code base (eg. RangeIterator, QueryPlan) can be shared. The rest of the
> > > code base
> > > is quite different because of on-disk format and cross-index files.
> > >
> > > The goal of this CEP is to get community buy-in on SAI's design.
> > > Tokenization,
> > > DelimiterAnalyzer should be straightforward to implement on top of SAI.
> > >
> > > >>> Can we list what configurations of SASI will become deprecated once
> > SAI
> > > >>> becomes non-experimental?
> > >
> > > Except for "Like", "Tokenisation", "DelimiterAnalyzer", the rest of
> SASI
> > > can
> > > be replaced by SAI.
> > >
> > > >>> Given a few bugs are open against 2i and SASI, can we provide some
> > > >>> overview, or rough indication, of how many of them we could "triage
> > > away"?
> > >
> > > I believe most of the known bugs in 2i/SASI either have been addressed
> in
> > > SAI or
> > > don't apply to SAI.
> > >
> > > >>> And, is it time for the project to start introducing new SPI
> > > >>> implementations as separate sub-modules and jar files that are only
> > > loaded
> > > >>> at runtime based on configuration settings? (sorry for the
> conflation
> > > on
> > > >>> this one, but maybe it's the right time to raise it :shrug:)
> > >
> > > Agreed that modularization is the way to go and will speed up module
> > > development speed.
> > >
> > > Does community plan to open another discussion or CEP on
> modularization?
> > >
> > >
> > > On Mon, 24 Aug 2020 at 16:43, Mick Semb Wever  wrote:
> > >
> > > > Adding to Duy's questions…
> > > >
> > > >
> > > > * Hardware specs
> > > >
> > > > SASI's performance, specifically the search in the B+ tree component,
> > > > depends a lot on the 

Re: [DISCUSS] CEP-7 Storage Attached Index

2020-08-26 Thread Patrick McFadin
This is related to the discussion Jordan and I had about the contributor
Zoom call. Instead of open mic for any issue, call it based on a discussion
thread or threads for higher bandwidth discussion.

I would be happy to schedule on for next week to specifically discuss
CEP-7. I can attach the recorded call to the CEP after.

+1 or -1?

Patrick

On Tue, Aug 25, 2020 at 7:03 AM Joshua McKenzie 
wrote:

> >
> > Does community plan to open another discussion or CEP on modularization?
>
> We probably should have a discussion on the ML or monthly contrib call
> about it first to see how aligned the interested contributors are. Could do
> that through CEP as well but CEP's (at least thus far sans k8s operator)
> tend to start with a strong, deeply thought out point of view being
> expressed.
>
> On Tue, Aug 25, 2020 at 3:26 AM Jasonstack Zhao Yang <
> jasonstack.z...@gmail.com> wrote:
>
> > >>> SASI's performance, specifically the search in the B+ tree component,
> > >>> depends a lot on the component file's header being available in the
> > >>> pagecache. SASI benefits from (needs) nodes with lots of RAM. Is SAI
> > bound
> > >>> to this same or similar limitation?
> >
> > SAI also benefits from larger memory because SAI puts block info on heap
> > for searching on-disk components and having cross-index files on page
> cache
> > improves read performance of different indexes on the same table.
> >
> >
> > >>> Flushing of SASI can be CPU+IO intensive, to the point of saturation,
> > >>> pauses, and crashes on the node. SSDs are a must, along with a bit of
> > >>> tuning, just to avoid bringing down your cluster. Beyond reducing
> space
> > >>> requirements, does SAI improve on these things? Like SASI how does
> SAI,
> > in
> > >>> its own way, change/narrow the recommendations on node hardware
> specs?
> >
> > SAI won't crash the node during compaction and requires less CPU/IO.
> >
> > * SAI defines global memory limit for compaction instead of per-index
> > memory limit used by SASI.
> >   For example, compactions are running on 10 tables and each has 10
> > indexes. SAI will cap the
> >   memory usage with global limit while SASI may use up to 100 * per-index
> > limit.
> >
> > * After flushing in-memory segments to disk, SAI won't merge on-disk
> > segments while SASI
> >   attempts to merge them at the end.
> >
> >   There are pros and cons of not merging segments:
> > ** Pros: compaction runs faster and requires fewer resources.
> > ** Cons: small segments reduce compression ratio.
> >
> > * SAI on-disk format with row ids compresses better.
> >
> >
> > >>> I understand the desire in keeping out of scope the longer term
> > deprecation
> > >>> and migration plan, but… if SASI provides functionality that SAI
> > doesn't,
> > >>> like tokenisation and DelimiterAnalyzer, yet introduces a body of
> code
> > >>> ~somewhat similar, shouldn't we be roughly sketching out how to
> reduce
> > the
> > >>> maintenance surface area?
> >
> > Agreed that we should reduce maintenance area if possible, but only very
> > limited
> > code base (eg. RangeIterator, QueryPlan) can be shared. The rest of the
> > code base
> > is quite different because of on-disk format and cross-index files.
> >
> > The goal of this CEP is to get community buy-in on SAI's design.
> > Tokenization,
> > DelimiterAnalyzer should be straightforward to implement on top of SAI.
> >
> > >>> Can we list what configurations of SASI will become deprecated once
> SAI
> > >>> becomes non-experimental?
> >
> > Except for "Like", "Tokenisation", "DelimiterAnalyzer", the rest of SASI
> > can
> > be replaced by SAI.
> >
> > >>> Given a few bugs are open against 2i and SASI, can we provide some
> > >>> overview, or rough indication, of how many of them we could "triage
> > away"?
> >
> > I believe most of the known bugs in 2i/SASI either have been addressed in
> > SAI or
> > don't apply to SAI.
> >
> > >>> And, is it time for the project to start introducing new SPI
> > >>> implementations as separate sub-modules and jar files that are only
> > loaded
> > >>> at runtime based on configuration settings? (sorry for the conflation
> > on
> > >>> this one, but maybe it's the right time to raise it :shrug:)
> >
> > Agreed that modularization is the way to go and will speed up module
> > development speed.
> >
> > Does community plan to open another discussion or CEP on modularization?
> >
> >
> > On Mon, 24 Aug 2020 at 16:43, Mick Semb Wever  wrote:
> >
> > > Adding to Duy's questions…
> > >
> > >
> > > * Hardware specs
> > >
> > > SASI's performance, specifically the search in the B+ tree component,
> > > depends a lot on the component file's header being available in the
> > > pagecache. SASI benefits from (needs) nodes with lots of RAM. Is SAI
> > bound
> > > to this same or similar limitation?
> > >
> > > Flushing of SASI can be CPU+IO intensive, to the point of saturation,
> > > pauses, and crashes on the node. SSDs are a must, along with a bit of
> > > 

Re: [DISCUSS] CEP-7 Storage Attached Index

2020-08-25 Thread Joshua McKenzie
>
> Does community plan to open another discussion or CEP on modularization?

We probably should have a discussion on the ML or monthly contrib call
about it first to see how aligned the interested contributors are. Could do
that through CEP as well but CEP's (at least thus far sans k8s operator)
tend to start with a strong, deeply thought out point of view being
expressed.

On Tue, Aug 25, 2020 at 3:26 AM Jasonstack Zhao Yang <
jasonstack.z...@gmail.com> wrote:

> >>> SASI's performance, specifically the search in the B+ tree component,
> >>> depends a lot on the component file's header being available in the
> >>> pagecache. SASI benefits from (needs) nodes with lots of RAM. Is SAI
> bound
> >>> to this same or similar limitation?
>
> SAI also benefits from larger memory because SAI puts block info on heap
> for searching on-disk components and having cross-index files on page cache
> improves read performance of different indexes on the same table.
>
>
> >>> Flushing of SASI can be CPU+IO intensive, to the point of saturation,
> >>> pauses, and crashes on the node. SSDs are a must, along with a bit of
> >>> tuning, just to avoid bringing down your cluster. Beyond reducing space
> >>> requirements, does SAI improve on these things? Like SASI how does SAI,
> in
> >>> its own way, change/narrow the recommendations on node hardware specs?
>
> SAI won't crash the node during compaction and requires less CPU/IO.
>
> * SAI defines global memory limit for compaction instead of per-index
> memory limit used by SASI.
>   For example, compactions are running on 10 tables and each has 10
> indexes. SAI will cap the
>   memory usage with global limit while SASI may use up to 100 * per-index
> limit.
>
> * After flushing in-memory segments to disk, SAI won't merge on-disk
> segments while SASI
>   attempts to merge them at the end.
>
>   There are pros and cons of not merging segments:
> ** Pros: compaction runs faster and requires fewer resources.
> ** Cons: small segments reduce compression ratio.
>
> * SAI on-disk format with row ids compresses better.
>
>
> >>> I understand the desire in keeping out of scope the longer term
> deprecation
> >>> and migration plan, but… if SASI provides functionality that SAI
> doesn't,
> >>> like tokenisation and DelimiterAnalyzer, yet introduces a body of code
> >>> ~somewhat similar, shouldn't we be roughly sketching out how to reduce
> the
> >>> maintenance surface area?
>
> Agreed that we should reduce maintenance area if possible, but only very
> limited
> code base (eg. RangeIterator, QueryPlan) can be shared. The rest of the
> code base
> is quite different because of on-disk format and cross-index files.
>
> The goal of this CEP is to get community buy-in on SAI's design.
> Tokenization,
> DelimiterAnalyzer should be straightforward to implement on top of SAI.
>
> >>> Can we list what configurations of SASI will become deprecated once SAI
> >>> becomes non-experimental?
>
> Except for "Like", "Tokenisation", "DelimiterAnalyzer", the rest of SASI
> can
> be replaced by SAI.
>
> >>> Given a few bugs are open against 2i and SASI, can we provide some
> >>> overview, or rough indication, of how many of them we could "triage
> away"?
>
> I believe most of the known bugs in 2i/SASI either have been addressed in
> SAI or
> don't apply to SAI.
>
> >>> And, is it time for the project to start introducing new SPI
> >>> implementations as separate sub-modules and jar files that are only
> loaded
> >>> at runtime based on configuration settings? (sorry for the conflation
> on
> >>> this one, but maybe it's the right time to raise it :shrug:)
>
> Agreed that modularization is the way to go and will speed up module
> development speed.
>
> Does community plan to open another discussion or CEP on modularization?
>
>
> On Mon, 24 Aug 2020 at 16:43, Mick Semb Wever  wrote:
>
> > Adding to Duy's questions…
> >
> >
> > * Hardware specs
> >
> > SASI's performance, specifically the search in the B+ tree component,
> > depends a lot on the component file's header being available in the
> > pagecache. SASI benefits from (needs) nodes with lots of RAM. Is SAI
> bound
> > to this same or similar limitation?
> >
> > Flushing of SASI can be CPU+IO intensive, to the point of saturation,
> > pauses, and crashes on the node. SSDs are a must, along with a bit of
> > tuning, just to avoid bringing down your cluster. Beyond reducing space
> > requirements, does SAI improve on these things? Like SASI how does SAI,
> in
> > its own way, change/narrow the recommendations on node hardware specs?
> >
> >
> > * Code Maintenance
> >
> > I understand the desire in keeping out of scope the longer term
> deprecation
> > and migration plan, but… if SASI provides functionality that SAI doesn't,
> > like tokenisation and DelimiterAnalyzer, yet introduces a body of code
> > ~somewhat similar, shouldn't we be roughly sketching out how to reduce
> the
> > maintenance surface area?
> >
> > Can we list what 

Re: [DISCUSS] CEP-7 Storage Attached Index

2020-08-25 Thread Jasonstack Zhao Yang
>>> SASI's performance, specifically the search in the B+ tree component,
>>> depends a lot on the component file's header being available in the
>>> pagecache. SASI benefits from (needs) nodes with lots of RAM. Is SAI
bound
>>> to this same or similar limitation?

SAI also benefits from larger memory because SAI puts block info on heap
for searching on-disk components and having cross-index files on page cache
improves read performance of different indexes on the same table.


>>> Flushing of SASI can be CPU+IO intensive, to the point of saturation,
>>> pauses, and crashes on the node. SSDs are a must, along with a bit of
>>> tuning, just to avoid bringing down your cluster. Beyond reducing space
>>> requirements, does SAI improve on these things? Like SASI how does SAI,
in
>>> its own way, change/narrow the recommendations on node hardware specs?

SAI won't crash the node during compaction and requires less CPU/IO.

* SAI defines global memory limit for compaction instead of per-index
memory limit used by SASI.
  For example, compactions are running on 10 tables and each has 10
indexes. SAI will cap the
  memory usage with global limit while SASI may use up to 100 * per-index
limit.

* After flushing in-memory segments to disk, SAI won't merge on-disk
segments while SASI
  attempts to merge them at the end.

  There are pros and cons of not merging segments:
** Pros: compaction runs faster and requires fewer resources.
** Cons: small segments reduce compression ratio.

* SAI on-disk format with row ids compresses better.


>>> I understand the desire in keeping out of scope the longer term
deprecation
>>> and migration plan, but… if SASI provides functionality that SAI
doesn't,
>>> like tokenisation and DelimiterAnalyzer, yet introduces a body of code
>>> ~somewhat similar, shouldn't we be roughly sketching out how to reduce
the
>>> maintenance surface area?

Agreed that we should reduce maintenance area if possible, but only very
limited
code base (eg. RangeIterator, QueryPlan) can be shared. The rest of the
code base
is quite different because of on-disk format and cross-index files.

The goal of this CEP is to get community buy-in on SAI's design.
Tokenization,
DelimiterAnalyzer should be straightforward to implement on top of SAI.

>>> Can we list what configurations of SASI will become deprecated once SAI
>>> becomes non-experimental?

Except for "Like", "Tokenisation", "DelimiterAnalyzer", the rest of SASI can
be replaced by SAI.

>>> Given a few bugs are open against 2i and SASI, can we provide some
>>> overview, or rough indication, of how many of them we could "triage
away"?

I believe most of the known bugs in 2i/SASI either have been addressed in
SAI or
don't apply to SAI.

>>> And, is it time for the project to start introducing new SPI
>>> implementations as separate sub-modules and jar files that are only
loaded
>>> at runtime based on configuration settings? (sorry for the conflation on
>>> this one, but maybe it's the right time to raise it :shrug:)

Agreed that modularization is the way to go and will speed up module
development speed.

Does community plan to open another discussion or CEP on modularization?


On Mon, 24 Aug 2020 at 16:43, Mick Semb Wever  wrote:

> Adding to Duy's questions…
>
>
> * Hardware specs
>
> SASI's performance, specifically the search in the B+ tree component,
> depends a lot on the component file's header being available in the
> pagecache. SASI benefits from (needs) nodes with lots of RAM. Is SAI bound
> to this same or similar limitation?
>
> Flushing of SASI can be CPU+IO intensive, to the point of saturation,
> pauses, and crashes on the node. SSDs are a must, along with a bit of
> tuning, just to avoid bringing down your cluster. Beyond reducing space
> requirements, does SAI improve on these things? Like SASI how does SAI, in
> its own way, change/narrow the recommendations on node hardware specs?
>
>
> * Code Maintenance
>
> I understand the desire in keeping out of scope the longer term deprecation
> and migration plan, but… if SASI provides functionality that SAI doesn't,
> like tokenisation and DelimiterAnalyzer, yet introduces a body of code
> ~somewhat similar, shouldn't we be roughly sketching out how to reduce the
> maintenance surface area?
>
> Can we list what configurations of SASI will become deprecated once SAI
> becomes non-experimental?
>
> Given a few bugs are open against 2i and SASI, can we provide some
> overview, or rough indication, of how many of them we could "triage away"?
>
> And, is it time for the project to start introducing new SPI
> implementations as separate sub-modules and jar files that are only loaded
> at runtime based on configuration settings? (sorry for the conflation on
> this one, but maybe it's the right time to raise it :shrug:)
>
> regards,
> Mick
>
>
> On Tue, 18 Aug 2020 at 13:05, DuyHai Doan  wrote:
>
> > Thank you Zhao Yang for starting this topic
> >
> > After reading the short design doc, I have 

Re: [DISCUSS] CEP-7 Storage Attached Index

2020-08-24 Thread Mick Semb Wever
Adding to Duy's questions…


* Hardware specs

SASI's performance, specifically the search in the B+ tree component,
depends a lot on the component file's header being available in the
pagecache. SASI benefits from (needs) nodes with lots of RAM. Is SAI bound
to this same or similar limitation?

Flushing of SASI can be CPU+IO intensive, to the point of saturation,
pauses, and crashes on the node. SSDs are a must, along with a bit of
tuning, just to avoid bringing down your cluster. Beyond reducing space
requirements, does SAI improve on these things? Like SASI how does SAI, in
its own way, change/narrow the recommendations on node hardware specs?


* Code Maintenance

I understand the desire in keeping out of scope the longer term deprecation
and migration plan, but… if SASI provides functionality that SAI doesn't,
like tokenisation and DelimiterAnalyzer, yet introduces a body of code
~somewhat similar, shouldn't we be roughly sketching out how to reduce the
maintenance surface area?

Can we list what configurations of SASI will become deprecated once SAI
becomes non-experimental?

Given a few bugs are open against 2i and SASI, can we provide some
overview, or rough indication, of how many of them we could "triage away"?

And, is it time for the project to start introducing new SPI
implementations as separate sub-modules and jar files that are only loaded
at runtime based on configuration settings? (sorry for the conflation on
this one, but maybe it's the right time to raise it :shrug:)

regards,
Mick


On Tue, 18 Aug 2020 at 13:05, DuyHai Doan  wrote:

> Thank you Zhao Yang for starting this topic
>
> After reading the short design doc, I have a few questions
>
> 1) SASI was pretty inefficient indexing wide partitions because the index
> structure only retains the partition token, not the clustering colums. As
> per design doc SAI has row id mapping to partition offset, can we hope that
> indexing wide partition will be more efficient with SAI ? One detail that
> worries me is that in the beggining of the design doc, it is said that the
> matching rows are post filtered while scanning the partition. Can you
> confirm or infirm that SAI is efficient with wide partitions and provides
> the partition offsets to the matching rows ?
>
> 2) About space efficiency, one of the biggest drawback of SASI was the huge
> space required for index structure when using CONTAINS logic because of the
> decomposition of text columns into n-grams. Will SAI suffer from the same
> issue in future iterations ? I'm anticipating a bit
>
> 3) If I'm querying using SAI and providing complete partition key, will it
> be more efficient than querying without partition key. In other words, does
> SAI provide any optimisation when partition key is specified ?
>
> Regards
>
> Duy Hai DOAN
>
> Le mar. 18 août 2020 à 11:39, Mick Semb Wever  a écrit :
>
> > >
> > > We are looking forward to the community's feedback and suggestions.
> > >
> >
> >
> > What comes immediately to mind is testing requirements. It has been
> > mentioned already that the project's testability and QA guidelines are
> > inadequate to successfully introduce new features and refactorings to the
> > codebase. During the 4.0 beta phase this was intended to be addressed,
> i.e.
> > defining more specific QA guidelines for 4.0-rc. This would be an
> important
> > step towards QA guidelines for all changes and CEPs post-4.0.
> >
> > Questions from me
> >  - How will this be tested, how will its QA status and lifecycle be
> > defined? (per above)
> >  - With existing C* code needing to be changed, what is the proposed plan
> > for making those changes ensuring maintained QA, e.g. is there separate
> QA
> > cycles planned for altering the SPI before adding a new SPI
> implementation?
> >  - Despite being out of scope, it would be nice to have some idea from
> the
> > CEP author of when users might still choose afresh 2i or SASI over SAI,
> >  - Who fills the roles involved? Who are the contributors in this
> DataStax
> > team? Who is the shepherd? Are there other stakeholders willing to be
> > involved?
> >  - Is there a preference to use gdoc instead of the project's wiki, and
> > why? (the CEP process suggest a wiki page, and feedback on why another
> > approach is considered better helps evolve the CEP process itself)
> >
> > cheers,
> > Mick
> >
>


Re: [DISCUSS] CEP-7 Storage Attached Index

2020-08-24 Thread Jasonstack Zhao Yang
> I think the project needs to conclude the discussions that keep being
started around the "definition of done" before determining what sufficient
quality assurance looks like for this feature.

Looking forward to the Test/QA guideline. Thanks for bringing this up.


> the CEP process suggest a wiki page

Added CEP-7 SAI cwiki:
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-7%3A+Storage+Attached+Index

On Sat, 22 Aug 2020 at 01:01, Jason Rutherglen 
wrote:

> > About space efficiency, one of the biggest drawback of SASI was the huge
> space required for index structure when using CONTAINS logic because of the
> decomposition of text columns into n-grams. Will SAI suffer from the same
> issue in future iterations ?
>
> SAI does not have specific ngram support atm, though that may be added
> with tokenizers.  Ngrams do indeed grow the index, that's a user
> decision for faster queries or more disk space.
>
> On Tue, Aug 18, 2020 at 6:05 AM DuyHai Doan  wrote:
> >
> > Thank you Zhao Yang for starting this topic
> >
> > After reading the short design doc, I have a few questions
> >
> > 1) SASI was pretty inefficient indexing wide partitions because the index
> > structure only retains the partition token, not the clustering colums. As
> > per design doc SAI has row id mapping to partition offset, can we hope
> that
> > indexing wide partition will be more efficient with SAI ? One detail that
> > worries me is that in the beggining of the design doc, it is said that
> the
> > matching rows are post filtered while scanning the partition. Can you
> > confirm or infirm that SAI is efficient with wide partitions and provides
> > the partition offsets to the matching rows ?
> >
> > 2) About space efficiency, one of the biggest drawback of SASI was the
> huge
> > space required for index structure when using CONTAINS logic because of
> the
> > decomposition of text columns into n-grams. Will SAI suffer from the same
> > issue in future iterations ? I'm anticipating a bit
> >
> > 3) If I'm querying using SAI and providing complete partition key, will
> it
> > be more efficient than querying without partition key. In other words,
> does
> > SAI provide any optimisation when partition key is specified ?
> >
> > Regards
> >
> > Duy Hai DOAN
> >
> > Le mar. 18 août 2020 à 11:39, Mick Semb Wever  a écrit :
> >
> > > >
> > > > We are looking forward to the community's feedback and suggestions.
> > > >
> > >
> > >
> > > What comes immediately to mind is testing requirements. It has been
> > > mentioned already that the project's testability and QA guidelines are
> > > inadequate to successfully introduce new features and refactorings to
> the
> > > codebase. During the 4.0 beta phase this was intended to be addressed,
> i.e.
> > > defining more specific QA guidelines for 4.0-rc. This would be an
> important
> > > step towards QA guidelines for all changes and CEPs post-4.0.
> > >
> > > Questions from me
> > >  - How will this be tested, how will its QA status and lifecycle be
> > > defined? (per above)
> > >  - With existing C* code needing to be changed, what is the proposed
> plan
> > > for making those changes ensuring maintained QA, e.g. is there
> separate QA
> > > cycles planned for altering the SPI before adding a new SPI
> implementation?
> > >  - Despite being out of scope, it would be nice to have some idea from
> the
> > > CEP author of when users might still choose afresh 2i or SASI over SAI,
> > >  - Who fills the roles involved? Who are the contributors in this
> DataStax
> > > team? Who is the shepherd? Are there other stakeholders willing to be
> > > involved?
> > >  - Is there a preference to use gdoc instead of the project's wiki, and
> > > why? (the CEP process suggest a wiki page, and feedback on why another
> > > approach is considered better helps evolve the CEP process itself)
> > >
> > > cheers,
> > > Mick
> > >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>


Re: [DISCUSS] CEP-7 Storage Attached Index

2020-08-19 Thread Jasonstack Zhao Yang
Hi Duy, great questions.

> 1) SASI was pretty inefficient indexing wide partitions because the index
> structure only retains the partition token, not the clustering colums. As
> per design doc SAI has row id mapping to partition offset, can we hope
that
> indexing wide partition will be more efficient with SAI ? One detail that
> worries me is that in the beggining of the design doc, it is said that the
> matching rows are post filtered while scanning the partition. Can you
> confirm or infirm that SAI is efficient with wide partitions and provides
> the partition offsets to the matching rows ?

As of now, SAI indexes partition offset, same as SASI. But during design, we
have taken row-level-index into consideration and row-awareness is being
prototyped.

For the record, partition-level indexing works nicely when most rows in the
wide
partition match indexed value. After switching to row-level-index, when
matching
most rows in wide partition, the index engine needs to fall back to
partition-level
index behavior (scanning entire partition + post-filter) instead of
fetching single
rows many times.

> 2) About space efficiency, one of the biggest drawback of SASI was the
huge
> space required for index structure when using CONTAINS logic because of
the
> decomposition of text columns into n-grams. Will SAI suffer from the same
> issue in future iterations ? I'm anticipating a bit

Tokenization wasn't part of the CEP scope.

Off the top of my head, I think tokenization did require more space, as
both SAI and SASI
need to store matches for every decomposed value. But with
frame-of-reference encoding
on row ids, SAI should require less disk space than SASI.

> 3) If I'm querying using SAI and providing complete partition key, will it
> be more efficient than querying without partition key. In other words,
does
> SAI provide any optimisation when partition key is specified ?

Yes.

* On coordinator, it will find replicas with PK.
* On replica side:
 - it will skip to given PK token
 - there is some pruning based on min/max key of index segments.

> 4) Are collections, static columns, composite partition key composent and
> UDT indexings (at any depth) on the roadmap of SAI ? I strongly believe
> that those features are the bare minimum to make SAI an interesting
> replacement for the native 2nd index as well as SASI. SASI limited support
> for those advanced data structures has hindered its wide adoption (among
> other issues and bugs)

Collections, static columns, composite partition key are supported.

I think "UDT indexings (at any depth)" can be added because there is no
architectural limitation on SAI or SASI.

I have invited you to slack #cassandra-sai, really appreciate your
participation.


On Tue, 18 Aug 2020 at 19:33, DuyHai Doan  wrote:

> Last but not least
>
> 4) Are collections, static columns, composite partition key composent and
> UDT indexings (at any depth) on the roadmap of SAI ? I strongly believe
> that those features are the bare minimum to make SAI an interesting
> replacement for the native 2nd index as well as SASI. SASI limited support
> for those advanced data structures has hindered its wide adoption (among
> other issues and bugs)
>
> Regards
>
> Duy Hai DOAN
>
> Le mar. 18 août 2020 à 13:02, Jasonstack Zhao Yang <
> jasonstack.z...@gmail.com> a écrit :
>
> > Mick thanks for your questions.
> >
> > > During the 4.0 beta phase this was intended to be addressed, i.e.>
> > defining more specific QA guidelines for 4.0-rc. This would be an
> important
> > > step towards QA guidelines for all changes and CEPs post-4.0.
> >
> > Agreed, I think CASSANDRA-15536
> >  (4.0 Quality:
> > Components and Test Plans) has set a good example of QA/Testing.
> >
> > >  - How will this be tested, how will its QA status and lifecycle be>
> > defined? (per above)
> >
> > SAI will follow the same QA/Testing guideline as in CASSANDRA-15536.
> >
> > >  - With existing C* code needing to be changed, what is the proposed
> > plan> for making those changes ensuring maintained QA, e.g. is there
> > separate QA
> > > cycles planned for altering the SPI before adding a new SPI
> > implementation?
> >
> > The plan is to have interface changes and their new implementations to be
> > reviewed/tested/merged at once to reduce overhead.
> >
> > But if having interface changes reviewed/tested/merged separately helps
> > quality, I don't think anyone will object.
> >
> > > - Despite being out of scope, it would be nice to have some idea from
> > the>  CEP author of when users might still choose afresh 2i or SASI over
> > SAI
> >
> > I'd like SAI to be the only index for users, but this is a decision to be
> > made by the community.
> >
> > > - Who fills the roles involved?
> >
> > Contributors that are still active on C* or related projects:
> >
> > Andres de la Peña
> > Caleb Rackliffe
> > Dan LaRocque
> > Jason Rutherglen
> > Mike Adamson
> > Rocco Varela
> > Zhao 

Re: [DISCUSS] CEP-7 Storage Attached Index

2020-08-18 Thread DuyHai Doan
Last but not least

4) Are collections, static columns, composite partition key composent and
UDT indexings (at any depth) on the roadmap of SAI ? I strongly believe
that those features are the bare minimum to make SAI an interesting
replacement for the native 2nd index as well as SASI. SASI limited support
for those advanced data structures has hindered its wide adoption (among
other issues and bugs)

Regards

Duy Hai DOAN

Le mar. 18 août 2020 à 13:02, Jasonstack Zhao Yang <
jasonstack.z...@gmail.com> a écrit :

> Mick thanks for your questions.
>
> > During the 4.0 beta phase this was intended to be addressed, i.e.>
> defining more specific QA guidelines for 4.0-rc. This would be an important
> > step towards QA guidelines for all changes and CEPs post-4.0.
>
> Agreed, I think CASSANDRA-15536
>  (4.0 Quality:
> Components and Test Plans) has set a good example of QA/Testing.
>
> >  - How will this be tested, how will its QA status and lifecycle be>
> defined? (per above)
>
> SAI will follow the same QA/Testing guideline as in CASSANDRA-15536.
>
> >  - With existing C* code needing to be changed, what is the proposed
> plan> for making those changes ensuring maintained QA, e.g. is there
> separate QA
> > cycles planned for altering the SPI before adding a new SPI
> implementation?
>
> The plan is to have interface changes and their new implementations to be
> reviewed/tested/merged at once to reduce overhead.
>
> But if having interface changes reviewed/tested/merged separately helps
> quality, I don't think anyone will object.
>
> > - Despite being out of scope, it would be nice to have some idea from
> the>  CEP author of when users might still choose afresh 2i or SASI over
> SAI
>
> I'd like SAI to be the only index for users, but this is a decision to be
> made by the community.
>
> > - Who fills the roles involved?
>
> Contributors that are still active on C* or related projects:
>
> Andres de la Peña
> Caleb Rackliffe
> Dan LaRocque
> Jason Rutherglen
> Mike Adamson
> Rocco Varela
> Zhao Yang
>
> I will shepherd.
>
> Anyone that is interested in C* index, feel free to join us at slack
> #cassandra-sai.
>
> > - Is there a preference to use gdoc instead of the project's wiki, and>
> why? (the CEP process suggest a wiki page, and feedback on why another
> > approach is considered better helps evolve the CEP process itself)
>
> Didn't notice wiki is required. Will port CEP to wiki.
>
>
> On Tue, 18 Aug 2020 at 17:39, Mick Semb Wever  wrote:
>
> > >
> > > We are looking forward to the community's feedback and suggestions.
> > >
> >
> >
> > What comes immediately to mind is testing requirements. It has been
> > mentioned already that the project's testability and QA guidelines are
> > inadequate to successfully introduce new features and refactorings to the
> > codebase. During the 4.0 beta phase this was intended to be addressed,
> i.e.
> > defining more specific QA guidelines for 4.0-rc. This would be an
> important
> > step towards QA guidelines for all changes and CEPs post-4.0.
> >
> > Questions from me
> >  - How will this be tested, how will its QA status and lifecycle be
> > defined? (per above)
> >  - With existing C* code needing to be changed, what is the proposed plan
> > for making those changes ensuring maintained QA, e.g. is there separate
> QA
> > cycles planned for altering the SPI before adding a new SPI
> implementation?
> >  - Despite being out of scope, it would be nice to have some idea from
> the
> > CEP author of when users might still choose afresh 2i or SASI over SAI,
> >  - Who fills the roles involved? Who are the contributors in this
> DataStax
> > team? Who is the shepherd? Are there other stakeholders willing to be
> > involved?
> >  - Is there a preference to use gdoc instead of the project's wiki, and
> > why? (the CEP process suggest a wiki page, and feedback on why another
> > approach is considered better helps evolve the CEP process itself)
> >
> > cheers,
> > Mick
> >
>


Re: [DISCUSS] CEP-7 Storage Attached Index

2020-08-18 Thread Benedict Elliott Smith
> SAI will follow the same QA/Testing guideline as in CASSANDRA-15536.

CASSANDRA-15536 might set some good examples for retrospectively shoring up our 
quality assurance, but offers no prescriptions for how we approach the testing 
of new work.  I think the project needs to conclude the discussions that keep 
being started around the "definition of done" before determining what 
sufficient quality assurance looks like for this feature.

I've briefly set out some of my views in an earlier email chain that was 
initiated by Josh, that unfortunately received no response.  The project is 
generally very busy right now as we approach 4.0 release, which is partially I 
assume why there has been no movement.  Assuming no further activity from 
others, as we get closer to 4.0 (and I have more time) I will try to produce a 
more formal proposal for quality assurance for the project, to be debated and 
agreed.



On 18/08/2020, 12:02, "Jasonstack Zhao Yang"  wrote:

Mick thanks for your questions.

> During the 4.0 beta phase this was intended to be addressed, i.e.>
defining more specific QA guidelines for 4.0-rc. This would be an important
> step towards QA guidelines for all changes and CEPs post-4.0.

Agreed, I think CASSANDRA-15536
 (4.0 Quality:
Components and Test Plans) has set a good example of QA/Testing.

>  - How will this be tested, how will its QA status and lifecycle be>
defined? (per above)

SAI will follow the same QA/Testing guideline as in CASSANDRA-15536.

>  - With existing C* code needing to be changed, what is the proposed
plan> for making those changes ensuring maintained QA, e.g. is there
separate QA
> cycles planned for altering the SPI before adding a new SPI
implementation?

The plan is to have interface changes and their new implementations to be
reviewed/tested/merged at once to reduce overhead.

But if having interface changes reviewed/tested/merged separately helps
quality, I don't think anyone will object.

> - Despite being out of scope, it would be nice to have some idea from
the>  CEP author of when users might still choose afresh 2i or SASI over SAI

I'd like SAI to be the only index for users, but this is a decision to be
made by the community.

> - Who fills the roles involved?

Contributors that are still active on C* or related projects:

Andres de la Peña
Caleb Rackliffe
Dan LaRocque
Jason Rutherglen
Mike Adamson
Rocco Varela
Zhao Yang

I will shepherd.

Anyone that is interested in C* index, feel free to join us at slack
#cassandra-sai.

> - Is there a preference to use gdoc instead of the project's wiki, and>
why? (the CEP process suggest a wiki page, and feedback on why another
> approach is considered better helps evolve the CEP process itself)

Didn't notice wiki is required. Will port CEP to wiki.


On Tue, 18 Aug 2020 at 17:39, Mick Semb Wever  wrote:

> >
> > We are looking forward to the community's feedback and suggestions.
> >
>
>
> What comes immediately to mind is testing requirements. It has been
> mentioned already that the project's testability and QA guidelines are
> inadequate to successfully introduce new features and refactorings to the
> codebase. During the 4.0 beta phase this was intended to be addressed, 
i.e.
> defining more specific QA guidelines for 4.0-rc. This would be an 
important
> step towards QA guidelines for all changes and CEPs post-4.0.
>
> Questions from me
>  - How will this be tested, how will its QA status and lifecycle be
> defined? (per above)
>  - With existing C* code needing to be changed, what is the proposed plan
> for making those changes ensuring maintained QA, e.g. is there separate QA
> cycles planned for altering the SPI before adding a new SPI 
implementation?
>  - Despite being out of scope, it would be nice to have some idea from the
> CEP author of when users might still choose afresh 2i or SASI over SAI,
>  - Who fills the roles involved? Who are the contributors in this DataStax
> team? Who is the shepherd? Are there other stakeholders willing to be
> involved?
>  - Is there a preference to use gdoc instead of the project's wiki, and
> why? (the CEP process suggest a wiki page, and feedback on why another
> approach is considered better helps evolve the CEP process itself)
>
> cheers,
> Mick
>



-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: [DISCUSS] CEP-7 Storage Attached Index

2020-08-18 Thread DuyHai Doan
Thank you Zhao Yang for starting this topic

After reading the short design doc, I have a few questions

1) SASI was pretty inefficient indexing wide partitions because the index
structure only retains the partition token, not the clustering colums. As
per design doc SAI has row id mapping to partition offset, can we hope that
indexing wide partition will be more efficient with SAI ? One detail that
worries me is that in the beggining of the design doc, it is said that the
matching rows are post filtered while scanning the partition. Can you
confirm or infirm that SAI is efficient with wide partitions and provides
the partition offsets to the matching rows ?

2) About space efficiency, one of the biggest drawback of SASI was the huge
space required for index structure when using CONTAINS logic because of the
decomposition of text columns into n-grams. Will SAI suffer from the same
issue in future iterations ? I'm anticipating a bit

3) If I'm querying using SAI and providing complete partition key, will it
be more efficient than querying without partition key. In other words, does
SAI provide any optimisation when partition key is specified ?

Regards

Duy Hai DOAN

Le mar. 18 août 2020 à 11:39, Mick Semb Wever  a écrit :

> >
> > We are looking forward to the community's feedback and suggestions.
> >
>
>
> What comes immediately to mind is testing requirements. It has been
> mentioned already that the project's testability and QA guidelines are
> inadequate to successfully introduce new features and refactorings to the
> codebase. During the 4.0 beta phase this was intended to be addressed, i.e.
> defining more specific QA guidelines for 4.0-rc. This would be an important
> step towards QA guidelines for all changes and CEPs post-4.0.
>
> Questions from me
>  - How will this be tested, how will its QA status and lifecycle be
> defined? (per above)
>  - With existing C* code needing to be changed, what is the proposed plan
> for making those changes ensuring maintained QA, e.g. is there separate QA
> cycles planned for altering the SPI before adding a new SPI implementation?
>  - Despite being out of scope, it would be nice to have some idea from the
> CEP author of when users might still choose afresh 2i or SASI over SAI,
>  - Who fills the roles involved? Who are the contributors in this DataStax
> team? Who is the shepherd? Are there other stakeholders willing to be
> involved?
>  - Is there a preference to use gdoc instead of the project's wiki, and
> why? (the CEP process suggest a wiki page, and feedback on why another
> approach is considered better helps evolve the CEP process itself)
>
> cheers,
> Mick
>


Re: [DISCUSS] CEP-7 Storage Attached Index

2020-08-18 Thread Jasonstack Zhao Yang
Mick thanks for your questions.

> During the 4.0 beta phase this was intended to be addressed, i.e.>
defining more specific QA guidelines for 4.0-rc. This would be an important
> step towards QA guidelines for all changes and CEPs post-4.0.

Agreed, I think CASSANDRA-15536
 (4.0 Quality:
Components and Test Plans) has set a good example of QA/Testing.

>  - How will this be tested, how will its QA status and lifecycle be>
defined? (per above)

SAI will follow the same QA/Testing guideline as in CASSANDRA-15536.

>  - With existing C* code needing to be changed, what is the proposed
plan> for making those changes ensuring maintained QA, e.g. is there
separate QA
> cycles planned for altering the SPI before adding a new SPI
implementation?

The plan is to have interface changes and their new implementations to be
reviewed/tested/merged at once to reduce overhead.

But if having interface changes reviewed/tested/merged separately helps
quality, I don't think anyone will object.

> - Despite being out of scope, it would be nice to have some idea from
the>  CEP author of when users might still choose afresh 2i or SASI over SAI

I'd like SAI to be the only index for users, but this is a decision to be
made by the community.

> - Who fills the roles involved?

Contributors that are still active on C* or related projects:

Andres de la Peña
Caleb Rackliffe
Dan LaRocque
Jason Rutherglen
Mike Adamson
Rocco Varela
Zhao Yang

I will shepherd.

Anyone that is interested in C* index, feel free to join us at slack
#cassandra-sai.

> - Is there a preference to use gdoc instead of the project's wiki, and>
why? (the CEP process suggest a wiki page, and feedback on why another
> approach is considered better helps evolve the CEP process itself)

Didn't notice wiki is required. Will port CEP to wiki.


On Tue, 18 Aug 2020 at 17:39, Mick Semb Wever  wrote:

> >
> > We are looking forward to the community's feedback and suggestions.
> >
>
>
> What comes immediately to mind is testing requirements. It has been
> mentioned already that the project's testability and QA guidelines are
> inadequate to successfully introduce new features and refactorings to the
> codebase. During the 4.0 beta phase this was intended to be addressed, i.e.
> defining more specific QA guidelines for 4.0-rc. This would be an important
> step towards QA guidelines for all changes and CEPs post-4.0.
>
> Questions from me
>  - How will this be tested, how will its QA status and lifecycle be
> defined? (per above)
>  - With existing C* code needing to be changed, what is the proposed plan
> for making those changes ensuring maintained QA, e.g. is there separate QA
> cycles planned for altering the SPI before adding a new SPI implementation?
>  - Despite being out of scope, it would be nice to have some idea from the
> CEP author of when users might still choose afresh 2i or SASI over SAI,
>  - Who fills the roles involved? Who are the contributors in this DataStax
> team? Who is the shepherd? Are there other stakeholders willing to be
> involved?
>  - Is there a preference to use gdoc instead of the project's wiki, and
> why? (the CEP process suggest a wiki page, and feedback on why another
> approach is considered better helps evolve the CEP process itself)
>
> cheers,
> Mick
>


Re: [DISCUSS] CEP-7 Storage Attached Index

2020-08-18 Thread Mick Semb Wever
>
> We are looking forward to the community's feedback and suggestions.
>


What comes immediately to mind is testing requirements. It has been
mentioned already that the project's testability and QA guidelines are
inadequate to successfully introduce new features and refactorings to the
codebase. During the 4.0 beta phase this was intended to be addressed, i.e.
defining more specific QA guidelines for 4.0-rc. This would be an important
step towards QA guidelines for all changes and CEPs post-4.0.

Questions from me
 - How will this be tested, how will its QA status and lifecycle be
defined? (per above)
 - With existing C* code needing to be changed, what is the proposed plan
for making those changes ensuring maintained QA, e.g. is there separate QA
cycles planned for altering the SPI before adding a new SPI implementation?
 - Despite being out of scope, it would be nice to have some idea from the
CEP author of when users might still choose afresh 2i or SASI over SAI,
 - Who fills the roles involved? Who are the contributors in this DataStax
team? Who is the shepherd? Are there other stakeholders willing to be
involved?
 - Is there a preference to use gdoc instead of the project's wiki, and
why? (the CEP process suggest a wiki page, and feedback on why another
approach is considered better helps evolve the CEP process itself)

cheers,
Mick


[DISCUSS] CEP-7 Storage Attached Index

2020-08-17 Thread Jasonstack Zhao Yang
Hi,

As per the CEP guideline, I am sending this email to start a discussion
about Storage-Attached-Index[1][2] for Apache Cassandra.

A team at DataStax has developed a new index implementation, called Storage
Attached Index(SAI), based on the advancement made by SASI. SAI improves:

* disk usage by sharing of common data between multiple column indexes on
the same table and better compression of on-disk structures.
* numeric range query performance with modified KDTree and collection type
support.
* compaction performance and stability for larger data set.

There is a more detailed explanation about SAI design in the CEP document.
To make
the technical discussion simpler, we created a slack channel #cassandra-sai.

We are looking forward to the community's feedback and suggestions.


Regards,

Zhao Yang


[1]
https://docs.google.com/document/d/1V830eAMmQAspjJdjviVZIaSolVGvZ1hVsqOLWyV0DS4/edit#heading=h.cgm22puztagk

[2] https://issues.apache.org/jira/browse/CASSANDRA-16052