Re: Efficient model for a sorting

2016-10-04 Thread Benjamin Roth
I started off with 3.0.6 and for my personal use case(s) they had the same
bugs as tick tock.

2016-10-04 19:03 GMT+02:00 Jonathan Haddad :

> I strongly recommend avoiding tick tock. You'll be one of the only people
> putting it in prod and will likely hit a number of weird issues nobody will
> be able to help you with.
> On Tue, Oct 4, 2016 at 12:40 PM Benjamin Roth 
> wrote:
>
>> I have the impression, that not the tick-tock is the real problem but MVs
>> are not really battle-tested yet.
>> Depending on the model, they put much more complexity on a cluster and
>> it's behaviour under heavy load. Especially if you are going to create an
>> MV with a different partition key than the base table this might be a shot
>> in the head.
>> At least I was able to bring my cluster down many times just by throwing
>> a few queries too much at it or by running some big repairs with reaper.
>> Only since some days, things seem to go smoothly after having struggled
>> about 2 months with very different kind of issues.
>>
>> We'll see ... most probably I will stick with the latest version. After
>> all it seems to work ok, I gained a lot of experience in running and
>> troubleshooting and to deal with bugs and maybe I am so able to contribute
>> a bit to further development.
>>
>> 2016-10-04 18:26 GMT+02:00 Vladimir Yudovin :
>>
>> >Would you consider 3.0.x to be more stable than 3.x?
>> I guess yes, but there are some discussion on this list:
>>
>> (C)* stable version after 3.5
>> 
>> Upgrade from 3.0.6 to 3.7.
>> 
>>
>> It seems to be eternal topic till tick-tock approach stabilizes.
>>
>>
>> Best regards, Vladimir Yudovin,
>>
>>
>> *Winguzone Inc  - Hosted Cloud Cassandra
>> on Azure and SoftLayer.Launch your cluster in minutes.*
>>
>>
>>  On Tue, 04 Oct 2016 12:19:13 -0400 *Benjamin
>> Roth>* wrote 
>>
>> I use the self-compiled master (3.10, ticktock). I had to fix a severe
>> bug on my own and decided to go with the latest code.
>> Would you consider 3.0.x to be more stable than 3.x?
>>
>> 2016-10-04 18:14 GMT+02:00 Vladimir Yudovin :
>>
>> Hi Benjamin!
>>
>> >we now use CS 3.x and have been advised that 3.x is still not considered
>> really production ready.
>>
>> Did you consider using of 3.0.9? Actually it's 3.0 with almost an year
>> fixes.
>>
>>
>> Best regards, Vladimir Yudovin,
>>
>>
>> *Winguzone Inc  - Hosted Cloud Cassandra
>> on Azure and SoftLayer.Launch your cluster in minutes.*
>>
>>
>>  On Tue, 04 Oct 2016 07:27:54 -0400 *Benjamin Roth
>> >* wrote 
>>
>> Hi!
>>
>> I have a frequently used pattern which seems to be quite costly in CS.
>> The pattern is always the same: I have a unique key and a sorting by a
>> different field.
>>
>> To give an example, here a real life example from our model:
>> CREATE TABLE visits.visits_in (
>> user_id int,
>> user_id_visitor int,
>> created timestamp,
>> PRIMARY KEY (user_id, user_id_visitor)
>> ) WITH CLUSTERING ORDER BY (user_id_visitor ASC)
>>
>> CREATE MATERIALIZED VIEW visits.visits_in_sorted_mv AS
>> SELECT user_id, created, user_id_visitor
>> FROM visits.visits_in
>> WHERE user_id IS NOT NULL AND created IS NOT NULL AND user_id_visitor
>> IS NOT NULL
>> PRIMARY KEY (user_id, created, user_id_visitor)
>> WITH CLUSTERING ORDER BY (created DESC, user_id_visitor DESC)
>>
>> This simply represents people, that visited my profile sorted by date
>> desc but only one entry per visitor.
>> Other examples with the same pattern could be a whats-app-like inbox
>> where the last message of each sender is shown by date desc. There are lots
>> of examples for that pattern.
>>
>> E.g. in redis I'd just use a sorted set, where the key could be like
>> "visits_${user_id}", set key would be user_id_visitor and score
>> the created timestamp.
>> In MySQL I'd create the table with PK on user_id + user_id_visitor and
>> create an index on user_id + created
>> In C* i use an MV.
>>
>> Is this the most efficient approach?
>> I also could have done this without an MV but then the situation in our
>> app would be far more complex.
>> I know that denormalization is a common pattern in C* and I don't
>> hesitate to use it but in this case, it is not as simple as it's not an
>> append-only case but updates have to be handled correctly.
>> If it is the first visit of a user, it's that simple, just 2 inserts in
>> base table + denormalized table. But on a 2nd or 3rd visit, the 1st or 2nd
>> visit has to be deleted from the 

Re: Efficient model for a sorting

2016-10-04 Thread Jonathan Haddad
I strongly recommend avoiding tick tock. You'll be one of the only people
putting it in prod and will likely hit a number of weird issues nobody will
be able to help you with.
On Tue, Oct 4, 2016 at 12:40 PM Benjamin Roth 
wrote:

> I have the impression, that not the tick-tock is the real problem but MVs
> are not really battle-tested yet.
> Depending on the model, they put much more complexity on a cluster and
> it's behaviour under heavy load. Especially if you are going to create an
> MV with a different partition key than the base table this might be a shot
> in the head.
> At least I was able to bring my cluster down many times just by throwing a
> few queries too much at it or by running some big repairs with reaper.
> Only since some days, things seem to go smoothly after having struggled
> about 2 months with very different kind of issues.
>
> We'll see ... most probably I will stick with the latest version. After
> all it seems to work ok, I gained a lot of experience in running and
> troubleshooting and to deal with bugs and maybe I am so able to contribute
> a bit to further development.
>
> 2016-10-04 18:26 GMT+02:00 Vladimir Yudovin :
>
> >Would you consider 3.0.x to be more stable than 3.x?
> I guess yes, but there are some discussion on this list:
>
> (C)* stable version after 3.5
> 
> Upgrade from 3.0.6 to 3.7.
> 
>
> It seems to be eternal topic till tick-tock approach stabilizes.
>
>
> Best regards, Vladimir Yudovin,
>
>
> *Winguzone Inc  - Hosted Cloud Cassandra
> on Azure and SoftLayer.Launch your cluster in minutes.*
>
>
>  On Tue, 04 Oct 2016 12:19:13 -0400 *Benjamin
> Roth>* wrote 
>
> I use the self-compiled master (3.10, ticktock). I had to fix a severe bug
> on my own and decided to go with the latest code.
> Would you consider 3.0.x to be more stable than 3.x?
>
> 2016-10-04 18:14 GMT+02:00 Vladimir Yudovin :
>
> Hi Benjamin!
>
> >we now use CS 3.x and have been advised that 3.x is still not considered
> really production ready.
>
> Did you consider using of 3.0.9? Actually it's 3.0 with almost an year
> fixes.
>
>
> Best regards, Vladimir Yudovin,
>
>
> *Winguzone Inc  - Hosted Cloud Cassandra
> on Azure and SoftLayer.Launch your cluster in minutes.*
>
>
>  On Tue, 04 Oct 2016 07:27:54 -0400 *Benjamin Roth
> >* wrote 
>
> Hi!
>
> I have a frequently used pattern which seems to be quite costly in CS. The
> pattern is always the same: I have a unique key and a sorting by a
> different field.
>
> To give an example, here a real life example from our model:
> CREATE TABLE visits.visits_in (
> user_id int,
> user_id_visitor int,
> created timestamp,
> PRIMARY KEY (user_id, user_id_visitor)
> ) WITH CLUSTERING ORDER BY (user_id_visitor ASC)
>
> CREATE MATERIALIZED VIEW visits.visits_in_sorted_mv AS
> SELECT user_id, created, user_id_visitor
> FROM visits.visits_in
> WHERE user_id IS NOT NULL AND created IS NOT NULL AND user_id_visitor
> IS NOT NULL
> PRIMARY KEY (user_id, created, user_id_visitor)
> WITH CLUSTERING ORDER BY (created DESC, user_id_visitor DESC)
>
> This simply represents people, that visited my profile sorted by date desc
> but only one entry per visitor.
> Other examples with the same pattern could be a whats-app-like inbox where
> the last message of each sender is shown by date desc. There are lots of
> examples for that pattern.
>
> E.g. in redis I'd just use a sorted set, where the key could be like
> "visits_${user_id}", set key would be user_id_visitor and score
> the created timestamp.
> In MySQL I'd create the table with PK on user_id + user_id_visitor and
> create an index on user_id + created
> In C* i use an MV.
>
> Is this the most efficient approach?
> I also could have done this without an MV but then the situation in our
> app would be far more complex.
> I know that denormalization is a common pattern in C* and I don't hesitate
> to use it but in this case, it is not as simple as it's not an append-only
> case but updates have to be handled correctly.
> If it is the first visit of a user, it's that simple, just 2 inserts in
> base table + denormalized table. But on a 2nd or 3rd visit, the 1st or 2nd
> visit has to be deleted from the denormalized table before. Otherwise the
> visit would not be unique any more.
> Handling this case without an MV requires a lot more effort, I guess even
> more effort than just using an MV.
> 1. You need kind of app-side locking to deal with race conditions
> 2. Read before write is 

Re: Efficient model for a sorting

2016-10-04 Thread Benjamin Roth
I have the impression, that not the tick-tock is the real problem but MVs
are not really battle-tested yet.
Depending on the model, they put much more complexity on a cluster and it's
behaviour under heavy load. Especially if you are going to create an MV
with a different partition key than the base table this might be a shot in
the head.
At least I was able to bring my cluster down many times just by throwing a
few queries too much at it or by running some big repairs with reaper.
Only since some days, things seem to go smoothly after having struggled
about 2 months with very different kind of issues.

We'll see ... most probably I will stick with the latest version. After all
it seems to work ok, I gained a lot of experience in running and
troubleshooting and to deal with bugs and maybe I am so able to contribute
a bit to further development.

2016-10-04 18:26 GMT+02:00 Vladimir Yudovin :

> >Would you consider 3.0.x to be more stable than 3.x?
> I guess yes, but there are some discussion on this list:
>
> (C)* stable version after 3.5
> 
> Upgrade from 3.0.6 to 3.7.
> 
>
> It seems to be eternal topic till tick-tock approach stabilizes.
>
>
> Best regards, Vladimir Yudovin,
>
>
> *Winguzone Inc  - Hosted Cloud Cassandra
> on Azure and SoftLayer.Launch your cluster in minutes.*
>
>
>  On Tue, 04 Oct 2016 12:19:13 -0400 *Benjamin
> Roth>* wrote 
>
> I use the self-compiled master (3.10, ticktock). I had to fix a severe bug
> on my own and decided to go with the latest code.
> Would you consider 3.0.x to be more stable than 3.x?
>
> 2016-10-04 18:14 GMT+02:00 Vladimir Yudovin :
>
> Hi Benjamin!
>
> >we now use CS 3.x and have been advised that 3.x is still not considered
> really production ready.
>
> Did you consider using of 3.0.9? Actually it's 3.0 with almost an year
> fixes.
>
>
> Best regards, Vladimir Yudovin,
>
>
> *Winguzone Inc  - Hosted Cloud Cassandra
> on Azure and SoftLayer.Launch your cluster in minutes.*
>
>
>  On Tue, 04 Oct 2016 07:27:54 -0400 *Benjamin Roth
> >* wrote 
>
> Hi!
>
> I have a frequently used pattern which seems to be quite costly in CS. The
> pattern is always the same: I have a unique key and a sorting by a
> different field.
>
> To give an example, here a real life example from our model:
> CREATE TABLE visits.visits_in (
> user_id int,
> user_id_visitor int,
> created timestamp,
> PRIMARY KEY (user_id, user_id_visitor)
> ) WITH CLUSTERING ORDER BY (user_id_visitor ASC)
>
> CREATE MATERIALIZED VIEW visits.visits_in_sorted_mv AS
> SELECT user_id, created, user_id_visitor
> FROM visits.visits_in
> WHERE user_id IS NOT NULL AND created IS NOT NULL AND user_id_visitor
> IS NOT NULL
> PRIMARY KEY (user_id, created, user_id_visitor)
> WITH CLUSTERING ORDER BY (created DESC, user_id_visitor DESC)
>
> This simply represents people, that visited my profile sorted by date desc
> but only one entry per visitor.
> Other examples with the same pattern could be a whats-app-like inbox where
> the last message of each sender is shown by date desc. There are lots of
> examples for that pattern.
>
> E.g. in redis I'd just use a sorted set, where the key could be like
> "visits_${user_id}", set key would be user_id_visitor and score
> the created timestamp.
> In MySQL I'd create the table with PK on user_id + user_id_visitor and
> create an index on user_id + created
> In C* i use an MV.
>
> Is this the most efficient approach?
> I also could have done this without an MV but then the situation in our
> app would be far more complex.
> I know that denormalization is a common pattern in C* and I don't hesitate
> to use it but in this case, it is not as simple as it's not an append-only
> case but updates have to be handled correctly.
> If it is the first visit of a user, it's that simple, just 2 inserts in
> base table + denormalized table. But on a 2nd or 3rd visit, the 1st or 2nd
> visit has to be deleted from the denormalized table before. Otherwise the
> visit would not be unique any more.
> Handling this case without an MV requires a lot more effort, I guess even
> more effort than just using an MV.
> 1. You need kind of app-side locking to deal with race conditions
> 2. Read before write is required to determine if an old record has to be
> deleted
> 3. At least CL_QUORUM is required to make sure that read before write is
> always consistent
> 4. Old record has to be deleted on update
>
> I guess, using an MV here is more efficient as there is less roundtrip
> between C* and the 

Re: Efficient model for a sorting

2016-10-04 Thread Vladimir Yudovin
Would you consider 3.0.x to be more stable than 3.x?
I guess yes, but there are some discussion on this list:

 (C)* stable version after 3.5
 Upgrade from 3.0.6 to 3.7.


It seems to be eternal topic till tick-tock approach stabilizes.


Best regards, Vladimir Yudovin, 
Winguzone Inc - Hosted Cloud Cassandra on Azure and SoftLayer.
Launch your cluster in minutes.




 On Tue, 04 Oct 2016 12:19:13 -0400 Benjamin 
Rothbenjamin.r...@jaumo.com wrote  

I use the self-compiled master (3.10, ticktock). I had to fix a severe bug on 
my own and decided to go with the latest code.Would you consider 3.0.x to be 
more stable than 3.x?


2016-10-04 18:14 GMT+02:00 Vladimir Yudovin vla...@winguzone.com:
Hi Benjamin!

we now use CS 3.x and have been advised that 3.x is still not considered 
really production ready.

Did you consider using of 3.0.9? Actually it's 3.0 with almost an year fixes.


Best regards, Vladimir Yudovin, 
Winguzone Inc - Hosted Cloud Cassandra on Azure and SoftLayer.
Launch your cluster in minutes.




 On Tue, 04 Oct 2016 07:27:54 -0400 Benjamin Roth 
benjamin.r...@jaumo.com wrote  

Hi!


I have a frequently used pattern which seems to be quite costly in CS. The 
pattern is always the same: I have a unique key and a sorting by a different 
field.


To give an example, here a real life example from our model:
CREATE TABLE visits.visits_in (
user_id int,
user_id_visitor int,
created timestamp,
PRIMARY KEY (user_id, user_id_visitor)
) WITH CLUSTERING ORDER BY (user_id_visitor ASC)



CREATE MATERIALIZED VIEW visits.visits_in_sorted_mv AS
SELECT user_id, created, user_id_visitor
FROM visits.visits_in
WHERE user_id IS NOT NULL AND created IS NOT NULL AND user_id_visitor IS 
NOT NULL
PRIMARY KEY (user_id, created, user_id_visitor)
WITH CLUSTERING ORDER BY (created DESC, user_id_visitor DESC)


This simply represents people, that visited my profile sorted by date desc but 
only one entry per visitor.
Other examples with the same pattern could be a whats-app-like inbox where the 
last message of each sender is shown by date desc. There are lots of examples 
for that pattern.




E.g. in redis I'd just use a sorted set, where the key could be like 
"visits_${user_id}", set key would be user_id_visitor and score the created 
timestamp.

In MySQL I'd create the table with PK on user_id + user_id_visitor and create 
an index on user_id + created
In C* i use an MV.


Is this the most efficient approach?
I also could have done this without an MV but then the situation in our app 
would be far more complex.
I know that denormalization is a common pattern in C* and I don't hesitate to 
use it but in this case, it is not as simple as it's not an append-only case 
but updates have to be handled correctly.
If it is the first visit of a user, it's that simple, just 2 inserts in base 
table + denormalized table. But on a 2nd or 3rd visit, the 1st or 2nd visit has 
to be deleted from the denormalized table before. Otherwise the visit would not 
be unique any more.
Handling this case without an MV requires a lot more effort, I guess even more 
effort than just using an MV. 
1. You need kind of app-side locking to deal with race conditions
2. Read before write is required to determine if an old record has to be deleted
3. At least CL_QUORUM is required to make sure that read before write is always 
consistent
4. Old record has to be deleted on update


I guess, using an MV here is more efficient as there is less roundtrip between 
C* and the app to do all that and the MV does not require strong consistency as 
MV updates are always local and are eventual consistent when the base table is. 
So there is also no need for distributed locks.


I ask all this as we now use CS 3.x and have been advised that 3.x is still not 
considered really production ready.


I guess in a perfect world, this wouldn't even require an MV if SASI indexes 
could be created over more than 1 column. E.g. in MySQL this case is nothing 
else than a BTree. AFAIK SASI indices are also BTrees, filtering by Partition 
Key (which should to be done anyway) and sorting by a field would perfectly do 
the trick. But from the docs, this is not possible right now.



Does anyone see a better solution or are all my assumptions correct?



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

 
 












-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

 
 






Re: Efficient model for a sorting

2016-10-04 Thread Benjamin Roth
I use the self-compiled master (3.10, ticktock). I had to fix a severe bug
on my own and decided to go with the latest code.
Would you consider 3.0.x to be more stable than 3.x?

2016-10-04 18:14 GMT+02:00 Vladimir Yudovin :

> Hi Benjamin!
>
> >we now use CS 3.x and have been advised that 3.x is still not considered
> really production ready.
>
> Did you consider using of 3.0.9? Actually it's 3.0 with almost an year
> fixes.
>
>
> Best regards, Vladimir Yudovin,
>
>
> *Winguzone Inc  - Hosted Cloud Cassandra
> on Azure and SoftLayer.Launch your cluster in minutes.*
>
>
>  On Tue, 04 Oct 2016 07:27:54 -0400 *Benjamin Roth
> >* wrote 
>
> Hi!
>
> I have a frequently used pattern which seems to be quite costly in CS. The
> pattern is always the same: I have a unique key and a sorting by a
> different field.
>
> To give an example, here a real life example from our model:
> CREATE TABLE visits.visits_in (
> user_id int,
> user_id_visitor int,
> created timestamp,
> PRIMARY KEY (user_id, user_id_visitor)
> ) WITH CLUSTERING ORDER BY (user_id_visitor ASC)
>
> CREATE MATERIALIZED VIEW visits.visits_in_sorted_mv AS
> SELECT user_id, created, user_id_visitor
> FROM visits.visits_in
> WHERE user_id IS NOT NULL AND created IS NOT NULL AND user_id_visitor
> IS NOT NULL
> PRIMARY KEY (user_id, created, user_id_visitor)
> WITH CLUSTERING ORDER BY (created DESC, user_id_visitor DESC)
>
> This simply represents people, that visited my profile sorted by date desc
> but only one entry per visitor.
> Other examples with the same pattern could be a whats-app-like inbox where
> the last message of each sender is shown by date desc. There are lots of
> examples for that pattern.
>
> E.g. in redis I'd just use a sorted set, where the key could be like
> "visits_${user_id}", set key would be user_id_visitor and score
> the created timestamp.
> In MySQL I'd create the table with PK on user_id + user_id_visitor and
> create an index on user_id + created
> In C* i use an MV.
>
> Is this the most efficient approach?
> I also could have done this without an MV but then the situation in our
> app would be far more complex.
> I know that denormalization is a common pattern in C* and I don't hesitate
> to use it but in this case, it is not as simple as it's not an append-only
> case but updates have to be handled correctly.
> If it is the first visit of a user, it's that simple, just 2 inserts in
> base table + denormalized table. But on a 2nd or 3rd visit, the 1st or 2nd
> visit has to be deleted from the denormalized table before. Otherwise the
> visit would not be unique any more.
> Handling this case without an MV requires a lot more effort, I guess even
> more effort than just using an MV.
> 1. You need kind of app-side locking to deal with race conditions
> 2. Read before write is required to determine if an old record has to be
> deleted
> 3. At least CL_QUORUM is required to make sure that read before write is
> always consistent
> 4. Old record has to be deleted on update
>
> I guess, using an MV here is more efficient as there is less roundtrip
> between C* and the app to do all that and the MV does not require strong
> consistency as MV updates are always local and are eventual consistent when
> the base table is. So there is also no need for distributed locks.
>
> I ask all this as we now use CS 3.x and have been advised that 3.x is
> still not considered really production ready.
>
> I guess in a perfect world, this wouldn't even require an MV if SASI
> indexes could be created over more than 1 column. E.g. in MySQL this case
> is nothing else than a BTree. AFAIK SASI indices are also BTrees, filtering
> by Partition Key (which should to be done anyway) and sorting by a field
> would perfectly do the trick. But from the docs, this is not possible right
> now.
>
> Does anyone see a better solution or are all my assumptions correct?
>
> --
> Benjamin Roth
> Prokurist
>
> Jaumo GmbH · www.jaumo.com
> Wehrstraße 46 · 73035 Göppingen · Germany
> Phone +49 7161 304880-6 · Fax +49 7161 304880-1
> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>
>
>
>


-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer


Re: Efficient model for a sorting

2016-10-04 Thread Vladimir Yudovin
Hi Benjamin!

we now use CS 3.x and have been advised that 3.x is still not considered 
really production ready.

Did you consider using of 3.0.9? Actually it's 3.0 with almost an year fixes.


Best regards, Vladimir Yudovin, 
Winguzone Inc - Hosted Cloud Cassandra on Azure and SoftLayer.
Launch your cluster in minutes.




 On Tue, 04 Oct 2016 07:27:54 -0400 Benjamin Roth 
benjamin.r...@jaumo.com wrote  

Hi!


I have a frequently used pattern which seems to be quite costly in CS. The 
pattern is always the same: I have a unique key and a sorting by a different 
field.


To give an example, here a real life example from our model:
CREATE TABLE visits.visits_in (
user_id int,
user_id_visitor int,
created timestamp,
PRIMARY KEY (user_id, user_id_visitor)
) WITH CLUSTERING ORDER BY (user_id_visitor ASC)



CREATE MATERIALIZED VIEW visits.visits_in_sorted_mv AS
SELECT user_id, created, user_id_visitor
FROM visits.visits_in
WHERE user_id IS NOT NULL AND created IS NOT NULL AND user_id_visitor IS 
NOT NULL
PRIMARY KEY (user_id, created, user_id_visitor)
WITH CLUSTERING ORDER BY (created DESC, user_id_visitor DESC)


This simply represents people, that visited my profile sorted by date desc but 
only one entry per visitor.
Other examples with the same pattern could be a whats-app-like inbox where the 
last message of each sender is shown by date desc. There are lots of examples 
for that pattern.




E.g. in redis I'd just use a sorted set, where the key could be like 
"visits_${user_id}", set key would be user_id_visitor and score the created 
timestamp.

In MySQL I'd create the table with PK on user_id + user_id_visitor and create 
an index on user_id + created
In C* i use an MV.


Is this the most efficient approach?
I also could have done this without an MV but then the situation in our app 
would be far more complex.
I know that denormalization is a common pattern in C* and I don't hesitate to 
use it but in this case, it is not as simple as it's not an append-only case 
but updates have to be handled correctly.
If it is the first visit of a user, it's that simple, just 2 inserts in base 
table + denormalized table. But on a 2nd or 3rd visit, the 1st or 2nd visit has 
to be deleted from the denormalized table before. Otherwise the visit would not 
be unique any more.
Handling this case without an MV requires a lot more effort, I guess even more 
effort than just using an MV. 
1. You need kind of app-side locking to deal with race conditions
2. Read before write is required to determine if an old record has to be deleted
3. At least CL_QUORUM is required to make sure that read before write is always 
consistent
4. Old record has to be deleted on update


I guess, using an MV here is more efficient as there is less roundtrip between 
C* and the app to do all that and the MV does not require strong consistency as 
MV updates are always local and are eventual consistent when the base table is. 
So there is also no need for distributed locks.


I ask all this as we now use CS 3.x and have been advised that 3.x is still not 
considered really production ready.


I guess in a perfect world, this wouldn't even require an MV if SASI indexes 
could be created over more than 1 column. E.g. in MySQL this case is nothing 
else than a BTree. AFAIK SASI indices are also BTrees, filtering by Partition 
Key (which should to be done anyway) and sorting by a field would perfectly do 
the trick. But from the docs, this is not possible right now.



Does anyone see a better solution or are all my assumptions correct?



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

 
 






Re: Efficient model for a sorting

2016-10-04 Thread DuyHai Doan
MV build is also async.

In the end it's MV maintenance cost vs Lucene index maintenance cost. I
don't have clear figure to judge which one is better. Maybe you should
benchmark yourself. Anyway I'll be interested by the results

On Tue, Oct 4, 2016 at 3:05 PM, Dorian Hoxha  wrote:

> On lucene you can query+filter+sort on a single shard, so it should be
> better than MV/sasi. The index building is a little async though.
>
> On Tue, Oct 4, 2016 at 2:29 PM, Benjamin Roth 
> wrote:
>
>> Thanks guys!
>>
>> Good to know, that my approach is basically right, but I will check that
>> lucene indices by time.
>>
>> 2016-10-04 14:22 GMT+02:00 DuyHai Doan :
>>
>>> "What scatter/gather? "
>>>
>>> http://www.slideshare.net/doanduyhai/sasi-cassandra-on-the-f
>>> ull-text-search-ride-voxxed-daybelgrade-2016/23
>>>
>>> "If you partition your data by user_id then you query only 1 shard to
>>> get sorted by time visitors for a user"
>>>
>>> Exact, but in this case, you're using a 2nd index only for sorting right
>>> ? For SASI it's not even possible. Maybe it can work with Statrio Lucene
>>> impl
>>>
>>> On Tue, Oct 4, 2016 at 2:15 PM, Dorian Hoxha 
>>> wrote:
>>>
 @DuyHai

 What scatter/gather? If you partition your data by user_id then you
 query only 1 shard to get sorted by time visitors for a user.

 On Tue, Oct 4, 2016 at 2:09 PM, DuyHai Doan 
 wrote:

> MV is right now your best choice for this kind of sorting behavior.
>
> Secondary index (whatever the impl, SASI or Lucene) has a cost of
> scatter-gather if your cluster scale out. With MV you're at least
> guaranteed to hit a single node everytime
>
> On Tue, Oct 4, 2016 at 1:56 PM, Dorian Hoxha 
> wrote:
>
>> Can you use the lucene index https://github.com/Stratio/cas
>> sandra-lucene-index ?
>>
>> On Tue, Oct 4, 2016 at 1:27 PM, Benjamin Roth <
>> benjamin.r...@jaumo.com> wrote:
>>
>>> Hi!
>>>
>>> I have a frequently used pattern which seems to be quite costly in
>>> CS. The pattern is always the same: I have a unique key and a sorting 
>>> by a
>>> different field.
>>>
>>> To give an example, here a real life example from our model:
>>> CREATE TABLE visits.visits_in (
>>> user_id int,
>>> user_id_visitor int,
>>> created timestamp,
>>> PRIMARY KEY (user_id, user_id_visitor)
>>> ) WITH CLUSTERING ORDER BY (user_id_visitor ASC)
>>>
>>> CREATE MATERIALIZED VIEW visits.visits_in_sorted_mv AS
>>> SELECT user_id, created, user_id_visitor
>>> FROM visits.visits_in
>>> WHERE user_id IS NOT NULL AND created IS NOT NULL AND
>>> user_id_visitor IS NOT NULL
>>> PRIMARY KEY (user_id, created, user_id_visitor)
>>> WITH CLUSTERING ORDER BY (created DESC, user_id_visitor DESC)
>>>
>>> This simply represents people, that visited my profile sorted by
>>> date desc but only one entry per visitor.
>>> Other examples with the same pattern could be a whats-app-like inbox
>>> where the last message of each sender is shown by date desc. There are 
>>> lots
>>> of examples for that pattern.
>>>
>>> E.g. in redis I'd just use a sorted set, where the key could be like
>>> "visits_${user_id}", set key would be user_id_visitor and score
>>> the created timestamp.
>>> In MySQL I'd create the table with PK on user_id + user_id_visitor
>>> and create an index on user_id + created
>>> In C* i use an MV.
>>>
>>> Is this the most efficient approach?
>>> I also could have done this without an MV but then the situation in
>>> our app would be far more complex.
>>> I know that denormalization is a common pattern in C* and I don't
>>> hesitate to use it but in this case, it is not as simple as it's not an
>>> append-only case but updates have to be handled correctly.
>>> If it is the first visit of a user, it's that simple, just 2 inserts
>>> in base table + denormalized table. But on a 2nd or 3rd visit, the 1st 
>>> or
>>> 2nd visit has to be deleted from the denormalized table before. 
>>> Otherwise
>>> the visit would not be unique any more.
>>> Handling this case without an MV requires a lot more effort, I guess
>>> even more effort than just using an MV.
>>> 1. You need kind of app-side locking to deal with race conditions
>>> 2. Read before write is required to determine if an old record has
>>> to be deleted
>>> 3. At least CL_QUORUM is required to make sure that read before
>>> write is always consistent
>>> 4. Old record has to be deleted on update
>>>
>>> I guess, using an MV here is more efficient as there is less
>>> roundtrip between C* and the app to do all that and the MV does not 

Re: Efficient model for a sorting

2016-10-04 Thread Dorian Hoxha
On lucene you can query+filter+sort on a single shard, so it should be
better than MV/sasi. The index building is a little async though.

On Tue, Oct 4, 2016 at 2:29 PM, Benjamin Roth 
wrote:

> Thanks guys!
>
> Good to know, that my approach is basically right, but I will check that
> lucene indices by time.
>
> 2016-10-04 14:22 GMT+02:00 DuyHai Doan :
>
>> "What scatter/gather? "
>>
>> http://www.slideshare.net/doanduyhai/sasi-cassandra-on-the-
>> full-text-search-ride-voxxed-daybelgrade-2016/23
>>
>> "If you partition your data by user_id then you query only 1 shard to
>> get sorted by time visitors for a user"
>>
>> Exact, but in this case, you're using a 2nd index only for sorting right
>> ? For SASI it's not even possible. Maybe it can work with Statrio Lucene
>> impl
>>
>> On Tue, Oct 4, 2016 at 2:15 PM, Dorian Hoxha 
>> wrote:
>>
>>> @DuyHai
>>>
>>> What scatter/gather? If you partition your data by user_id then you
>>> query only 1 shard to get sorted by time visitors for a user.
>>>
>>> On Tue, Oct 4, 2016 at 2:09 PM, DuyHai Doan 
>>> wrote:
>>>
 MV is right now your best choice for this kind of sorting behavior.

 Secondary index (whatever the impl, SASI or Lucene) has a cost of
 scatter-gather if your cluster scale out. With MV you're at least
 guaranteed to hit a single node everytime

 On Tue, Oct 4, 2016 at 1:56 PM, Dorian Hoxha 
 wrote:

> Can you use the lucene index https://github.com/Stratio/cas
> sandra-lucene-index ?
>
> On Tue, Oct 4, 2016 at 1:27 PM, Benjamin Roth  > wrote:
>
>> Hi!
>>
>> I have a frequently used pattern which seems to be quite costly in
>> CS. The pattern is always the same: I have a unique key and a sorting by 
>> a
>> different field.
>>
>> To give an example, here a real life example from our model:
>> CREATE TABLE visits.visits_in (
>> user_id int,
>> user_id_visitor int,
>> created timestamp,
>> PRIMARY KEY (user_id, user_id_visitor)
>> ) WITH CLUSTERING ORDER BY (user_id_visitor ASC)
>>
>> CREATE MATERIALIZED VIEW visits.visits_in_sorted_mv AS
>> SELECT user_id, created, user_id_visitor
>> FROM visits.visits_in
>> WHERE user_id IS NOT NULL AND created IS NOT NULL AND
>> user_id_visitor IS NOT NULL
>> PRIMARY KEY (user_id, created, user_id_visitor)
>> WITH CLUSTERING ORDER BY (created DESC, user_id_visitor DESC)
>>
>> This simply represents people, that visited my profile sorted by date
>> desc but only one entry per visitor.
>> Other examples with the same pattern could be a whats-app-like inbox
>> where the last message of each sender is shown by date desc. There are 
>> lots
>> of examples for that pattern.
>>
>> E.g. in redis I'd just use a sorted set, where the key could be like
>> "visits_${user_id}", set key would be user_id_visitor and score
>> the created timestamp.
>> In MySQL I'd create the table with PK on user_id + user_id_visitor
>> and create an index on user_id + created
>> In C* i use an MV.
>>
>> Is this the most efficient approach?
>> I also could have done this without an MV but then the situation in
>> our app would be far more complex.
>> I know that denormalization is a common pattern in C* and I don't
>> hesitate to use it but in this case, it is not as simple as it's not an
>> append-only case but updates have to be handled correctly.
>> If it is the first visit of a user, it's that simple, just 2 inserts
>> in base table + denormalized table. But on a 2nd or 3rd visit, the 1st or
>> 2nd visit has to be deleted from the denormalized table before. Otherwise
>> the visit would not be unique any more.
>> Handling this case without an MV requires a lot more effort, I guess
>> even more effort than just using an MV.
>> 1. You need kind of app-side locking to deal with race conditions
>> 2. Read before write is required to determine if an old record has to
>> be deleted
>> 3. At least CL_QUORUM is required to make sure that read before write
>> is always consistent
>> 4. Old record has to be deleted on update
>>
>> I guess, using an MV here is more efficient as there is less
>> roundtrip between C* and the app to do all that and the MV does not 
>> require
>> strong consistency as MV updates are always local and are eventual
>> consistent when the base table is. So there is also no need for 
>> distributed
>> locks.
>>
>> I ask all this as we now use CS 3.x and have been advised that 3.x is
>> still not considered really production ready.
>>
>> I guess in a perfect world, this wouldn't even require an MV if SASI
>> indexes could be 

Re: Efficient model for a sorting

2016-10-04 Thread Benjamin Roth
Thanks guys!

Good to know, that my approach is basically right, but I will check that
lucene indices by time.

2016-10-04 14:22 GMT+02:00 DuyHai Doan :

> "What scatter/gather? "
>
> http://www.slideshare.net/doanduyhai/sasi-cassandra-on-
> the-full-text-search-ride-voxxed-daybelgrade-2016/23
>
> "If you partition your data by user_id then you query only 1 shard to get
> sorted by time visitors for a user"
>
> Exact, but in this case, you're using a 2nd index only for sorting right ?
> For SASI it's not even possible. Maybe it can work with Statrio Lucene impl
>
> On Tue, Oct 4, 2016 at 2:15 PM, Dorian Hoxha 
> wrote:
>
>> @DuyHai
>>
>> What scatter/gather? If you partition your data by user_id then you query
>> only 1 shard to get sorted by time visitors for a user.
>>
>> On Tue, Oct 4, 2016 at 2:09 PM, DuyHai Doan  wrote:
>>
>>> MV is right now your best choice for this kind of sorting behavior.
>>>
>>> Secondary index (whatever the impl, SASI or Lucene) has a cost of
>>> scatter-gather if your cluster scale out. With MV you're at least
>>> guaranteed to hit a single node everytime
>>>
>>> On Tue, Oct 4, 2016 at 1:56 PM, Dorian Hoxha 
>>> wrote:
>>>
 Can you use the lucene index https://github.com/Stratio/cas
 sandra-lucene-index ?

 On Tue, Oct 4, 2016 at 1:27 PM, Benjamin Roth 
 wrote:

> Hi!
>
> I have a frequently used pattern which seems to be quite costly in CS.
> The pattern is always the same: I have a unique key and a sorting by a
> different field.
>
> To give an example, here a real life example from our model:
> CREATE TABLE visits.visits_in (
> user_id int,
> user_id_visitor int,
> created timestamp,
> PRIMARY KEY (user_id, user_id_visitor)
> ) WITH CLUSTERING ORDER BY (user_id_visitor ASC)
>
> CREATE MATERIALIZED VIEW visits.visits_in_sorted_mv AS
> SELECT user_id, created, user_id_visitor
> FROM visits.visits_in
> WHERE user_id IS NOT NULL AND created IS NOT NULL AND
> user_id_visitor IS NOT NULL
> PRIMARY KEY (user_id, created, user_id_visitor)
> WITH CLUSTERING ORDER BY (created DESC, user_id_visitor DESC)
>
> This simply represents people, that visited my profile sorted by date
> desc but only one entry per visitor.
> Other examples with the same pattern could be a whats-app-like inbox
> where the last message of each sender is shown by date desc. There are 
> lots
> of examples for that pattern.
>
> E.g. in redis I'd just use a sorted set, where the key could be like
> "visits_${user_id}", set key would be user_id_visitor and score
> the created timestamp.
> In MySQL I'd create the table with PK on user_id + user_id_visitor and
> create an index on user_id + created
> In C* i use an MV.
>
> Is this the most efficient approach?
> I also could have done this without an MV but then the situation in
> our app would be far more complex.
> I know that denormalization is a common pattern in C* and I don't
> hesitate to use it but in this case, it is not as simple as it's not an
> append-only case but updates have to be handled correctly.
> If it is the first visit of a user, it's that simple, just 2 inserts
> in base table + denormalized table. But on a 2nd or 3rd visit, the 1st or
> 2nd visit has to be deleted from the denormalized table before. Otherwise
> the visit would not be unique any more.
> Handling this case without an MV requires a lot more effort, I guess
> even more effort than just using an MV.
> 1. You need kind of app-side locking to deal with race conditions
> 2. Read before write is required to determine if an old record has to
> be deleted
> 3. At least CL_QUORUM is required to make sure that read before write
> is always consistent
> 4. Old record has to be deleted on update
>
> I guess, using an MV here is more efficient as there is less roundtrip
> between C* and the app to do all that and the MV does not require strong
> consistency as MV updates are always local and are eventual consistent 
> when
> the base table is. So there is also no need for distributed locks.
>
> I ask all this as we now use CS 3.x and have been advised that 3.x is
> still not considered really production ready.
>
> I guess in a perfect world, this wouldn't even require an MV if SASI
> indexes could be created over more than 1 column. E.g. in MySQL this case
> is nothing else than a BTree. AFAIK SASI indices are also BTrees, 
> filtering
> by Partition Key (which should to be done anyway) and sorting by a field
> would perfectly do the trick. But from the docs, this is not possible 
> right
> now.
>
> Does anyone see a better 

Re: Efficient model for a sorting

2016-10-04 Thread DuyHai Doan
"What scatter/gather? "

http://www.slideshare.net/doanduyhai/sasi-cassandra-on-the-full-text-search-ride-voxxed-daybelgrade-2016/23

"If you partition your data by user_id then you query only 1 shard to get
sorted by time visitors for a user"

Exact, but in this case, you're using a 2nd index only for sorting right ?
For SASI it's not even possible. Maybe it can work with Statrio Lucene impl

On Tue, Oct 4, 2016 at 2:15 PM, Dorian Hoxha  wrote:

> @DuyHai
>
> What scatter/gather? If you partition your data by user_id then you query
> only 1 shard to get sorted by time visitors for a user.
>
> On Tue, Oct 4, 2016 at 2:09 PM, DuyHai Doan  wrote:
>
>> MV is right now your best choice for this kind of sorting behavior.
>>
>> Secondary index (whatever the impl, SASI or Lucene) has a cost of
>> scatter-gather if your cluster scale out. With MV you're at least
>> guaranteed to hit a single node everytime
>>
>> On Tue, Oct 4, 2016 at 1:56 PM, Dorian Hoxha 
>> wrote:
>>
>>> Can you use the lucene index https://github.com/Stratio/cas
>>> sandra-lucene-index ?
>>>
>>> On Tue, Oct 4, 2016 at 1:27 PM, Benjamin Roth 
>>> wrote:
>>>
 Hi!

 I have a frequently used pattern which seems to be quite costly in CS.
 The pattern is always the same: I have a unique key and a sorting by a
 different field.

 To give an example, here a real life example from our model:
 CREATE TABLE visits.visits_in (
 user_id int,
 user_id_visitor int,
 created timestamp,
 PRIMARY KEY (user_id, user_id_visitor)
 ) WITH CLUSTERING ORDER BY (user_id_visitor ASC)

 CREATE MATERIALIZED VIEW visits.visits_in_sorted_mv AS
 SELECT user_id, created, user_id_visitor
 FROM visits.visits_in
 WHERE user_id IS NOT NULL AND created IS NOT NULL AND
 user_id_visitor IS NOT NULL
 PRIMARY KEY (user_id, created, user_id_visitor)
 WITH CLUSTERING ORDER BY (created DESC, user_id_visitor DESC)

 This simply represents people, that visited my profile sorted by date
 desc but only one entry per visitor.
 Other examples with the same pattern could be a whats-app-like inbox
 where the last message of each sender is shown by date desc. There are lots
 of examples for that pattern.

 E.g. in redis I'd just use a sorted set, where the key could be like
 "visits_${user_id}", set key would be user_id_visitor and score
 the created timestamp.
 In MySQL I'd create the table with PK on user_id + user_id_visitor and
 create an index on user_id + created
 In C* i use an MV.

 Is this the most efficient approach?
 I also could have done this without an MV but then the situation in our
 app would be far more complex.
 I know that denormalization is a common pattern in C* and I don't
 hesitate to use it but in this case, it is not as simple as it's not an
 append-only case but updates have to be handled correctly.
 If it is the first visit of a user, it's that simple, just 2 inserts in
 base table + denormalized table. But on a 2nd or 3rd visit, the 1st or 2nd
 visit has to be deleted from the denormalized table before. Otherwise the
 visit would not be unique any more.
 Handling this case without an MV requires a lot more effort, I guess
 even more effort than just using an MV.
 1. You need kind of app-side locking to deal with race conditions
 2. Read before write is required to determine if an old record has to
 be deleted
 3. At least CL_QUORUM is required to make sure that read before write
 is always consistent
 4. Old record has to be deleted on update

 I guess, using an MV here is more efficient as there is less roundtrip
 between C* and the app to do all that and the MV does not require strong
 consistency as MV updates are always local and are eventual consistent when
 the base table is. So there is also no need for distributed locks.

 I ask all this as we now use CS 3.x and have been advised that 3.x is
 still not considered really production ready.

 I guess in a perfect world, this wouldn't even require an MV if SASI
 indexes could be created over more than 1 column. E.g. in MySQL this case
 is nothing else than a BTree. AFAIK SASI indices are also BTrees, filtering
 by Partition Key (which should to be done anyway) and sorting by a field
 would perfectly do the trick. But from the docs, this is not possible right
 now.

 Does anyone see a better solution or are all my assumptions correct?

 --
 Benjamin Roth
 Prokurist

 Jaumo GmbH · www.jaumo.com
 Wehrstraße 46 · 73035 Göppingen · Germany
 Phone +49 7161 304880-6 · Fax +49 7161 304880-1
 AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

>>>
>>>
>>
>


Re: Efficient model for a sorting

2016-10-04 Thread Dorian Hoxha
@DuyHai

What scatter/gather? If you partition your data by user_id then you query
only 1 shard to get sorted by time visitors for a user.

On Tue, Oct 4, 2016 at 2:09 PM, DuyHai Doan  wrote:

> MV is right now your best choice for this kind of sorting behavior.
>
> Secondary index (whatever the impl, SASI or Lucene) has a cost of
> scatter-gather if your cluster scale out. With MV you're at least
> guaranteed to hit a single node everytime
>
> On Tue, Oct 4, 2016 at 1:56 PM, Dorian Hoxha 
> wrote:
>
>> Can you use the lucene index https://github.com/Stratio/cas
>> sandra-lucene-index ?
>>
>> On Tue, Oct 4, 2016 at 1:27 PM, Benjamin Roth 
>> wrote:
>>
>>> Hi!
>>>
>>> I have a frequently used pattern which seems to be quite costly in CS.
>>> The pattern is always the same: I have a unique key and a sorting by a
>>> different field.
>>>
>>> To give an example, here a real life example from our model:
>>> CREATE TABLE visits.visits_in (
>>> user_id int,
>>> user_id_visitor int,
>>> created timestamp,
>>> PRIMARY KEY (user_id, user_id_visitor)
>>> ) WITH CLUSTERING ORDER BY (user_id_visitor ASC)
>>>
>>> CREATE MATERIALIZED VIEW visits.visits_in_sorted_mv AS
>>> SELECT user_id, created, user_id_visitor
>>> FROM visits.visits_in
>>> WHERE user_id IS NOT NULL AND created IS NOT NULL AND
>>> user_id_visitor IS NOT NULL
>>> PRIMARY KEY (user_id, created, user_id_visitor)
>>> WITH CLUSTERING ORDER BY (created DESC, user_id_visitor DESC)
>>>
>>> This simply represents people, that visited my profile sorted by date
>>> desc but only one entry per visitor.
>>> Other examples with the same pattern could be a whats-app-like inbox
>>> where the last message of each sender is shown by date desc. There are lots
>>> of examples for that pattern.
>>>
>>> E.g. in redis I'd just use a sorted set, where the key could be like
>>> "visits_${user_id}", set key would be user_id_visitor and score
>>> the created timestamp.
>>> In MySQL I'd create the table with PK on user_id + user_id_visitor and
>>> create an index on user_id + created
>>> In C* i use an MV.
>>>
>>> Is this the most efficient approach?
>>> I also could have done this without an MV but then the situation in our
>>> app would be far more complex.
>>> I know that denormalization is a common pattern in C* and I don't
>>> hesitate to use it but in this case, it is not as simple as it's not an
>>> append-only case but updates have to be handled correctly.
>>> If it is the first visit of a user, it's that simple, just 2 inserts in
>>> base table + denormalized table. But on a 2nd or 3rd visit, the 1st or 2nd
>>> visit has to be deleted from the denormalized table before. Otherwise the
>>> visit would not be unique any more.
>>> Handling this case without an MV requires a lot more effort, I guess
>>> even more effort than just using an MV.
>>> 1. You need kind of app-side locking to deal with race conditions
>>> 2. Read before write is required to determine if an old record has to be
>>> deleted
>>> 3. At least CL_QUORUM is required to make sure that read before write is
>>> always consistent
>>> 4. Old record has to be deleted on update
>>>
>>> I guess, using an MV here is more efficient as there is less roundtrip
>>> between C* and the app to do all that and the MV does not require strong
>>> consistency as MV updates are always local and are eventual consistent when
>>> the base table is. So there is also no need for distributed locks.
>>>
>>> I ask all this as we now use CS 3.x and have been advised that 3.x is
>>> still not considered really production ready.
>>>
>>> I guess in a perfect world, this wouldn't even require an MV if SASI
>>> indexes could be created over more than 1 column. E.g. in MySQL this case
>>> is nothing else than a BTree. AFAIK SASI indices are also BTrees, filtering
>>> by Partition Key (which should to be done anyway) and sorting by a field
>>> would perfectly do the trick. But from the docs, this is not possible right
>>> now.
>>>
>>> Does anyone see a better solution or are all my assumptions correct?
>>>
>>> --
>>> Benjamin Roth
>>> Prokurist
>>>
>>> Jaumo GmbH · www.jaumo.com
>>> Wehrstraße 46 · 73035 Göppingen · Germany
>>> Phone +49 7161 304880-6 · Fax +49 7161 304880-1
>>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>>>
>>
>>
>


Re: Efficient model for a sorting

2016-10-04 Thread DuyHai Doan
MV is right now your best choice for this kind of sorting behavior.

Secondary index (whatever the impl, SASI or Lucene) has a cost of
scatter-gather if your cluster scale out. With MV you're at least
guaranteed to hit a single node everytime

On Tue, Oct 4, 2016 at 1:56 PM, Dorian Hoxha  wrote:

> Can you use the lucene index https://github.com/Stratio/
> cassandra-lucene-index ?
>
> On Tue, Oct 4, 2016 at 1:27 PM, Benjamin Roth 
> wrote:
>
>> Hi!
>>
>> I have a frequently used pattern which seems to be quite costly in CS.
>> The pattern is always the same: I have a unique key and a sorting by a
>> different field.
>>
>> To give an example, here a real life example from our model:
>> CREATE TABLE visits.visits_in (
>> user_id int,
>> user_id_visitor int,
>> created timestamp,
>> PRIMARY KEY (user_id, user_id_visitor)
>> ) WITH CLUSTERING ORDER BY (user_id_visitor ASC)
>>
>> CREATE MATERIALIZED VIEW visits.visits_in_sorted_mv AS
>> SELECT user_id, created, user_id_visitor
>> FROM visits.visits_in
>> WHERE user_id IS NOT NULL AND created IS NOT NULL AND user_id_visitor
>> IS NOT NULL
>> PRIMARY KEY (user_id, created, user_id_visitor)
>> WITH CLUSTERING ORDER BY (created DESC, user_id_visitor DESC)
>>
>> This simply represents people, that visited my profile sorted by date
>> desc but only one entry per visitor.
>> Other examples with the same pattern could be a whats-app-like inbox
>> where the last message of each sender is shown by date desc. There are lots
>> of examples for that pattern.
>>
>> E.g. in redis I'd just use a sorted set, where the key could be like
>> "visits_${user_id}", set key would be user_id_visitor and score
>> the created timestamp.
>> In MySQL I'd create the table with PK on user_id + user_id_visitor and
>> create an index on user_id + created
>> In C* i use an MV.
>>
>> Is this the most efficient approach?
>> I also could have done this without an MV but then the situation in our
>> app would be far more complex.
>> I know that denormalization is a common pattern in C* and I don't
>> hesitate to use it but in this case, it is not as simple as it's not an
>> append-only case but updates have to be handled correctly.
>> If it is the first visit of a user, it's that simple, just 2 inserts in
>> base table + denormalized table. But on a 2nd or 3rd visit, the 1st or 2nd
>> visit has to be deleted from the denormalized table before. Otherwise the
>> visit would not be unique any more.
>> Handling this case without an MV requires a lot more effort, I guess even
>> more effort than just using an MV.
>> 1. You need kind of app-side locking to deal with race conditions
>> 2. Read before write is required to determine if an old record has to be
>> deleted
>> 3. At least CL_QUORUM is required to make sure that read before write is
>> always consistent
>> 4. Old record has to be deleted on update
>>
>> I guess, using an MV here is more efficient as there is less roundtrip
>> between C* and the app to do all that and the MV does not require strong
>> consistency as MV updates are always local and are eventual consistent when
>> the base table is. So there is also no need for distributed locks.
>>
>> I ask all this as we now use CS 3.x and have been advised that 3.x is
>> still not considered really production ready.
>>
>> I guess in a perfect world, this wouldn't even require an MV if SASI
>> indexes could be created over more than 1 column. E.g. in MySQL this case
>> is nothing else than a BTree. AFAIK SASI indices are also BTrees, filtering
>> by Partition Key (which should to be done anyway) and sorting by a field
>> would perfectly do the trick. But from the docs, this is not possible right
>> now.
>>
>> Does anyone see a better solution or are all my assumptions correct?
>>
>> --
>> Benjamin Roth
>> Prokurist
>>
>> Jaumo GmbH · www.jaumo.com
>> Wehrstraße 46 · 73035 Göppingen · Germany
>> Phone +49 7161 304880-6 · Fax +49 7161 304880-1
>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>>
>
>


Re: Efficient model for a sorting

2016-10-04 Thread Dorian Hoxha
Can you use the lucene index
https://github.com/Stratio/cassandra-lucene-index ?

On Tue, Oct 4, 2016 at 1:27 PM, Benjamin Roth 
wrote:

> Hi!
>
> I have a frequently used pattern which seems to be quite costly in CS. The
> pattern is always the same: I have a unique key and a sorting by a
> different field.
>
> To give an example, here a real life example from our model:
> CREATE TABLE visits.visits_in (
> user_id int,
> user_id_visitor int,
> created timestamp,
> PRIMARY KEY (user_id, user_id_visitor)
> ) WITH CLUSTERING ORDER BY (user_id_visitor ASC)
>
> CREATE MATERIALIZED VIEW visits.visits_in_sorted_mv AS
> SELECT user_id, created, user_id_visitor
> FROM visits.visits_in
> WHERE user_id IS NOT NULL AND created IS NOT NULL AND user_id_visitor
> IS NOT NULL
> PRIMARY KEY (user_id, created, user_id_visitor)
> WITH CLUSTERING ORDER BY (created DESC, user_id_visitor DESC)
>
> This simply represents people, that visited my profile sorted by date desc
> but only one entry per visitor.
> Other examples with the same pattern could be a whats-app-like inbox where
> the last message of each sender is shown by date desc. There are lots of
> examples for that pattern.
>
> E.g. in redis I'd just use a sorted set, where the key could be like
> "visits_${user_id}", set key would be user_id_visitor and score
> the created timestamp.
> In MySQL I'd create the table with PK on user_id + user_id_visitor and
> create an index on user_id + created
> In C* i use an MV.
>
> Is this the most efficient approach?
> I also could have done this without an MV but then the situation in our
> app would be far more complex.
> I know that denormalization is a common pattern in C* and I don't hesitate
> to use it but in this case, it is not as simple as it's not an append-only
> case but updates have to be handled correctly.
> If it is the first visit of a user, it's that simple, just 2 inserts in
> base table + denormalized table. But on a 2nd or 3rd visit, the 1st or 2nd
> visit has to be deleted from the denormalized table before. Otherwise the
> visit would not be unique any more.
> Handling this case without an MV requires a lot more effort, I guess even
> more effort than just using an MV.
> 1. You need kind of app-side locking to deal with race conditions
> 2. Read before write is required to determine if an old record has to be
> deleted
> 3. At least CL_QUORUM is required to make sure that read before write is
> always consistent
> 4. Old record has to be deleted on update
>
> I guess, using an MV here is more efficient as there is less roundtrip
> between C* and the app to do all that and the MV does not require strong
> consistency as MV updates are always local and are eventual consistent when
> the base table is. So there is also no need for distributed locks.
>
> I ask all this as we now use CS 3.x and have been advised that 3.x is
> still not considered really production ready.
>
> I guess in a perfect world, this wouldn't even require an MV if SASI
> indexes could be created over more than 1 column. E.g. in MySQL this case
> is nothing else than a BTree. AFAIK SASI indices are also BTrees, filtering
> by Partition Key (which should to be done anyway) and sorting by a field
> would perfectly do the trick. But from the docs, this is not possible right
> now.
>
> Does anyone see a better solution or are all my assumptions correct?
>
> --
> Benjamin Roth
> Prokurist
>
> Jaumo GmbH · www.jaumo.com
> Wehrstraße 46 · 73035 Göppingen · Germany
> Phone +49 7161 304880-6 · Fax +49 7161 304880-1
> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>