RE: Big Data Question

2023-08-18 Thread Durity, Sean R via user
Cost of availability is a fair question at some level of the discussion. In my 
experience, high availability is one of the top 2 or 3 reasons why Cassandra is 
chosen as the data solution. So, if I am given a Cassandra use case to build 
out, I would normally assume high availability is needed, even in a single data 
center scenario. Otherwise, there are other data options.


Sean R. Durity
DB Solutions
Staff Systems Engineer – Cassandra



INTERNAL USE
From: daemeon reiydelle 
Sent: Thursday, August 17, 2023 7:38 PM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Re: Big Data Question

I started to respond, then realized I and the other OP posters are not thinking 
the same: What is the business case for availability, data 
los/reload/recoverability? You all argue for higher availability and damn the 
cost. But noone asked "can

I started to respond, then realized I and the other OP posters are not thinking 
the same: What is the business case for availability, data 
los/reload/recoverability? You all argue for higher availability and damn the 
cost. But noone asked "can you lose access, for 20 minutes, to a portion of the 
data, 10 times a year, on a 250 node cluster in AWS, if it is not lost"? Can 
you lose access 1-2 times a year for the cost of a 500 node cluster holding the 
same data?

Then we can discuss 32/64g JVM and SSD's.
.
Arthur C. Clarke famously said that "technology sufficiently advanced is 
indistinguishable from magic." Magic is coming, and it's coming for all of 
us

Daemeon Reiydelle
email: daeme...@gmail.com
LI: https://www.linkedin.com/in/daemeonreiydelle/ 
[linkedin.com]
San Francisco 1.415.501.0198/Skype daemeon.c.m.reiydelle


On Thu, Aug 17, 2023 at 1:53 PM Joe Obernberger 
mailto:joseph.obernber...@gmail.com>> wrote:
Was assuming reaper did incremental?  That was probably a bad assumption.

nodetool repair -pr
I know it well now!

:)

-Joe

On 8/17/2023 4:47 PM, Bowen Song via user wrote:
> I don't have experience with Cassandra on Kubernetes, so I can't
> comment on that.
>
> For repairs, may I interest you with incremental repairs? It will make
> repairs hell of a lot faster. Of course, occasional full repair is
> still needed, but that's another story.
>
>
> On 17/08/2023 21:36, Joe Obernberger wrote:
>> Thank you.  Enjoying this conversation.
>> Agree on blade servers, where each blade has a small number of SSDs.
>> Yeh/Nah to a kubernetes approach assuming fast persistent storage?  I
>> think that might be easier to manage.
>>
>> In my current benchmarks, the performance is excellent, but the
>> repairs are painful.  I come from the Hadoop world where it was all
>> about large servers with lots of disk.
>> Relatively small number of tables, but some have a high number of
>> rows, 10bil + - we use spark to run across all the data.
>>
>> -Joe
>>
>> On 8/17/2023 12:13 PM, Bowen Song via user wrote:
>>> The optimal node size largely depends on the table schema and
>>> read/write pattern. In some cases 500 GB per node is too large, but
>>> in some other cases 10TB per node works totally fine. It's hard to
>>> estimate that without benchmarking.
>>>
>>> Again, just pointing out the obvious, you did not count the off-heap
>>> memory and page cache. 1TB of RAM for 24GB heap * 40 instances is
>>> definitely not enough. You'll most likely need between 1.5 and 2 TB
>>> memory for 40x 24GB heap nodes. You may be better off with blade
>>> servers than single server with gigantic memory and disk sizes.
>>>
>>>
>>> On 17/08/2023 15:46, Joe Obernberger wrote:
 Thanks for this - yeah - duh - forgot about replication in my example!
 So - is 2TBytes per Cassandra instance advisable?  Better to use
 more/less?  Modern 2u servers can be had with 24 3.8TBtyte SSDs; so
 assume 80Tbytes per server, you could do:
 (1024*3)/80 = 39 servers, but you'd have to run 40 instances of
 Cassandra on each server; maybe 24G of heap per instance, so a
 server with 1TByte of RAM would work.
 Is this what folks would do?

 -Joe

 On 8/17/2023 9:13 AM, Bowen Song via user wrote:
> Just pointing out the obvious, for 1PB of data on nodes with 2TB
> disk each, you will need far more than 500 nodes.
>
> 1, it is unwise to run Cassandra with replication factor 1. It
> usually makes sense to use RF=3, so 1PB data will cost 3PB of
> storage space, minimal of 1500 such nodes.
>
> 2, depending on the compaction strategy you use and the write
> access pattern, there's a disk space amplification to consider.
> For example, with STCS, the disk usage can be many times of the
> actual live data size.
>
> 3, you will need some extra free disk space as temporary space for
> running compactions.
>
> 4, the data 

Re: Materialized View inconsistency issue

2023-08-18 Thread Miklosovic, Stefan
Well you could always do it like this

cqlsh> CREATE TABLE dating.visits2 (user_id int, visitor_id int, visit_month 
int, visit_date int, primary key (user_id, visitor_id, visit_month)) WITH 
CLUSTERING ORDER BY (visitor_id ASC, visit_month DESC );

This means that if you have, clearly, 6 months, you might have at most 6 
entries per user. If your primary key is user_id, visitor_id and visit_month, 
then clustering columns are vistor_id and visit_month and visit_month is in 
descending order.

// user 300 visits user 100 in august (8) on some specific timestamp
cqlsh> insert into dating.visits2 (user_id , visitor_id , visit_month , 
visit_date ) VALUES ( 100, 300, 8, 123);

// user 200 visits 100 in July and June on some timestamps.
cqlsh> insert into dating.visits2 (user_id , visitor_id , visit_month , 
visit_date ) VALUES ( 100, 200, 7, 456);
cqlsh> insert into dating.visits2 (user_id , visitor_id , visit_month , 
visit_date ) VALUES ( 100, 200, 6, 456);

cqlsh> select * from dating.visits2 WHERE user_id = 100 and visitor_id = 200;

 user_id | visitor_id | visit_month | visit_date
-++-+
 100 |200 |   7 |456
 100 |200 |   6 |456

(2 rows)


This is the most important query. You always get sorted it by month, latest 
month on top with some visit day.

cqlsh> select * from dating.visits2 WHERE user_id = 100 and visitor_id = 200 
limit 1;

 user_id | visitor_id | visit_month | visit_date
-++-+
 100 |200 |   7 |456

The trick is that if somebody visited that user later in July (visit_month 7), 
it will get overwritten because the whole primary key is same:

cqlsh> insert into dating.visits2 (user_id , visitor_id , visit_month , 
visit_date ) VALUES ( 100, 200, 7, 12345);
cqlsh> select * from dating.visits2 WHERE user_id = 100 and visitor_id = 200 
limit 1;

 user_id | visitor_id | visit_month | visit_date
-++-+
 100 |200 |   7 |  12345

So you will have 1 entry ever per month and you will have 6 entries for 6 
months, each such entry would always tell you the most recent visit in that 
month.



From: Regis Le Bretonnic 
Sent: Friday, August 18, 2023 11:30
To: user@cassandra.apache.org
Subject: Re: Materialized View inconsistency issue

You don't often get email from r.lebreton...@meetic-corp.com. Learn why this is 
important
NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.



What you propose is another debate 

Most of the time there are a product department and a tech department (I'm sure 
it is your case at netapp)... I'd like to have a voice loud enough to influence 
product requirements but it is not the way it works. I'm paid to make miracles 
and not to explain to the Director of Product, he can not do what he wants...
I know that "6 months" is arbiitrary and a lower period could simplify 
things... but basically it is a compromise I can not challenge.

- 1 month is not enough for different reasons :
 - long enough for a "jet fighter" that received 1 visits per months... 
but not long enough for people that receive 4 visits per month (because he 
lives in a poor density area or other reasons). This has a psychological impact 
directly influences the experience (and revenue).
 - you can suspend an account for instance because you are in holidays... 
and when you will come back the list of visits received will be empty. This as 
also a psychological impact (also impacting the revenue).
- 1 year is probably to long...

The compromise with the product team is 6 months and I can not change that even 
if it is stupid.

I am sure that most readers of this forum are technical folks that are in the 
same situtation as me.
Let's stay on the technical point of view...


Le ven. 18 août 2023 à 10:48, Miklosovic, Stefan 
mailto:stefan.mikloso...@netapp.com>> a écrit :
The 2 tables you propose Stefan can not natively order rows by time (they will 
be ordered by visitor_id), excepted if you sort rows after the select.

So what? I think this is way better than dealing with MV which you will get 
inconsistent eventually. Do you want to have broken MV or you want to sort on 
the client? Which is better?

The table will be like this

cqlsh> select * from dating.visits_by_visitor_id ;

 user_id | visitor_id
-+
 100 |200
 100 |300

(2 rows)
cqlsh> select * from dating.visits;

 user_id | visitor_id | visit_date
-++
 100 |300 |  3
 100 |200 |  5
 100 |200 |  2
 100 |200 |  1

(4 rows)
cqlsh>

Now if you iterate over 100 and 200 and 

Re: Materialized View inconsistency issue

2023-08-18 Thread Regis Le Bretonnic
What you propose is another debate 

Most of the time there are a product department and a tech department (I'm
sure it is your case at netapp)... I'd like to have a voice loud enough to
influence product requirements but it is not the way it works. I'm paid to
make miracles and not to explain to the Director of Product, he can not do
what he wants...
I know that "6 months" is arbiitrary and a lower period could simplify
things... but basically it is a compromise I can not challenge.

- 1 month is not enough for different reasons :
 - long enough for a "jet fighter" that received 1 visits per
months... but not long enough for people that receive 4 visits per month
(because he lives in a poor density area or other reasons). This has a
psychological impact directly influences the experience (and revenue).
 - you can suspend an account for instance because you are in
holidays... and when you will come back the list of visits received will be
empty. This as also a psychological impact (also impacting the revenue).
- 1 year is probably to long...


*The compromise with the product team is 6 months and I can not change that
even if it is stupid.*
I am sure that most readers of this forum are technical folks that are in
the same situtation as me.
Let's stay on the technical point of view...


Le ven. 18 août 2023 à 10:48, Miklosovic, Stefan <
stefan.mikloso...@netapp.com> a écrit :

> The 2 tables you propose Stefan can not natively order rows by time (they
> will be ordered by visitor_id), excepted if you sort rows after the select.
>
> So what? I think this is way better than dealing with MV which you will
> get inconsistent eventually. Do you want to have broken MV or you want to
> sort on the client? Which is better?
>
> The table will be like this
>
> cqlsh> select * from dating.visits_by_visitor_id ;
>
>  user_id | visitor_id
> -+
>  100 |200
>  100 |300
>
> (2 rows)
> cqlsh> select * from dating.visits;
>
>  user_id | visitor_id | visit_date
> -++
>  100 |300 |  3
>  100 |200 |  5
>  100 |200 |  2
>  100 |200 |  1
>
> (4 rows)
> cqlsh>
>
> Now if you iterate over 100 and 200 and you get limits by 1, you get
> latest results.
>
> Now it might be true that you get the result which is not sorted on
> timestamp but does that really matter? You can always sort it on the client.
>
> The advantage of this approach is that you know all visitors of somebody
> on one query if that ever mattered.
> You also know when somebody was visited by somebody in some period of time
>
> select visit_date from dating.visits where user_id = 100 and visitor_id =
> 200 and visit_date > 3 and visit_date < 8;
>
> Also, I dont know what business logic you have in detail, but why would
> somebody be interested who visited him 6 months ago? What is that
> information good for in practice? Why dont you do it like this?
>
> INSERT INTO dating.visits (user_id , visitor_id, visit_date ) VALUES (
> 100, 300, 60) USING TTL 10;
>
> Use TTL of e.g. 1 month? So rows would start to disappear automatically.
> If somebody visited me 2 months ago and then it disappears next I would not
> care at all. A user who visited me 2 months ago is basically equal to a
> user who has never visited me.
>
>
> 
> From: Regis Le Bretonnic 
> Sent: Friday, August 18, 2023 9:47
> To: user@cassandra.apache.org
> Subject: Re: Materialized View inconsistency issue
>
> You don't often get email from r.lebreton...@meetic-corp.com. Learn why
> this is important
> NetApp Security WARNING: This is an external email. Do not click links or
> open attachments unless you recognize the sender and know the content is
> safe.
>
>
>
> Hi Stefan
>
> Happy to see that our use case interest you :-)
> I'm not sure that I explained well what we want.
>
> Imagine that sequence of events :
> - Julia visits Joe at t1
> - Julia visits Joe at t2
> - Karen visits Joe at t3
> - Silvia visits Joe at t4
> - Karen visits Joe at t5
> - Karen visits Joe at t6
> - Julia visits Joe at t7
>
> We want to provide to Joe a webpage listing visits he received in that
> order :
> - Juiia at t7  (the more recent)
> - Karen at t6
> - Silvia at t4
>
> The 2 tables you propose Stefan can not natively order rows by time (they
> will be ordered by visitor_id), excepted if you sort rows after the select.
>
> Keep in mind that some people can received a loot on visits in 6
> months (200,000 or 300,000 deduplicated visits, and much more if you keep
> duplicate visits) and ordering such volume of rows by code is not easy (in
> fact impossible because we use PHP and we can't do that in a FPM memory...)
> ... and of course, because we can not provide in a single page of 200,000
> or 300,000 members stickers in one shot, the webpage requires pagination
> (with lot 

Re: Materialized View inconsistency issue

2023-08-18 Thread Miklosovic, Stefan
The 2 tables you propose Stefan can not natively order rows by time (they will 
be ordered by visitor_id), excepted if you sort rows after the select.

So what? I think this is way better than dealing with MV which you will get 
inconsistent eventually. Do you want to have broken MV or you want to sort on 
the client? Which is better?

The table will be like this

cqlsh> select * from dating.visits_by_visitor_id ;

 user_id | visitor_id
-+
 100 |200
 100 |300

(2 rows)
cqlsh> select * from dating.visits;

 user_id | visitor_id | visit_date
-++
 100 |300 |  3
 100 |200 |  5
 100 |200 |  2
 100 |200 |  1

(4 rows)
cqlsh>

Now if you iterate over 100 and 200 and you get limits by 1, you get latest 
results.

Now it might be true that you get the result which is not sorted on timestamp 
but does that really matter? You can always sort it on the client.

The advantage of this approach is that you know all visitors of somebody on one 
query if that ever mattered.
You also know when somebody was visited by somebody in some period of time

select visit_date from dating.visits where user_id = 100 and visitor_id = 200 
and visit_date > 3 and visit_date < 8;

Also, I dont know what business logic you have in detail, but why would 
somebody be interested who visited him 6 months ago? What is that information 
good for in practice? Why dont you do it like this?

INSERT INTO dating.visits (user_id , visitor_id, visit_date ) VALUES ( 100, 
300, 60) USING TTL 10;

Use TTL of e.g. 1 month? So rows would start to disappear automatically. If 
somebody visited me 2 months ago and then it disappears next I would not care 
at all. A user who visited me 2 months ago is basically equal to a user who has 
never visited me.



From: Regis Le Bretonnic 
Sent: Friday, August 18, 2023 9:47
To: user@cassandra.apache.org
Subject: Re: Materialized View inconsistency issue

You don't often get email from r.lebreton...@meetic-corp.com. Learn why this is 
important
NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.



Hi Stefan

Happy to see that our use case interest you :-)
I'm not sure that I explained well what we want.

Imagine that sequence of events :
- Julia visits Joe at t1
- Julia visits Joe at t2
- Karen visits Joe at t3
- Silvia visits Joe at t4
- Karen visits Joe at t5
- Karen visits Joe at t6
- Julia visits Joe at t7

We want to provide to Joe a webpage listing visits he received in that order :
- Juiia at t7  (the more recent)
- Karen at t6
- Silvia at t4

The 2 tables you propose Stefan can not natively order rows by time (they will 
be ordered by visitor_id), excepted if you sort rows after the select.

Keep in mind that some people can received a loot on visits in 6 months 
(200,000 or 300,000 deduplicated visits, and much more if you keep duplicate 
visits) and ordering such volume of rows by code is not easy (in fact 
impossible because we use PHP and we can't do that in a FPM memory...)
... and of course, because we can not provide in a single page of 200,000 or 
300,000 members stickers in one shot, the webpage requires pagination (with lot 
of 100 profiles per page). If you decide that sorting should be made on the 
code side, the pagination becomes awful to manage.

PS 1 : when we decide to do this, MV were not yet back to experimental
PS 2 : the code to manage a visit received is very easy... we just do a insert 
in the master table without doing any select before... we just don't care of 
what happened in past...
PS 3 : the pagination is very easy... we just do a
- select * from visits_received_by_date where receiver_id=111 and 
visit_datemailto:stefan.mikloso...@netapp.com>> a écrit :
Why can't you do it like this? You would have two tables:

create table visits (user_id bigint, visitor_id bigint, visit_date timestamp, 
primary key ((user_id, visitor_id), visit_date)) order by visit_date desc

create table visitors_by_user_id (user_id bigint, visitor_id bigint, primary 
key ((user_id), visitor_id))

The logic behind the second table, visitors_by_user_id, is that you do not care 
if a user visited you twice, because it is primary key + clustering column, if 
same user visits you twice, the second time it would basically do nothing, 
because such entry is already there.

For example:

user_id | visitor_id
joe | karen
joe | julia

If Karen visits me again, nothing happens as that entry is already there.

Then if Karen visits me, I put into the second table

joe | karen | tuesday
joe | karen | monday
joe | karen | last friday
joe | julia | today

So to know who visited me recently, I do

select visitor_id from visitors_by_user_id where user_id = Joe;

So I get Karen and 

Re: Materialized View inconsistency issue

2023-08-18 Thread Regis Le Bretonnic
Hi Stefan

Happy to see that our use case interest you :-)
I'm not sure that I explained well what we want.

Imagine that sequence of events :
- Julia visits Joe at t1
- Julia visits Joe at t2
- Karen visits Joe at t3
- Silvia visits Joe at t4
- Karen visits Joe at t5
- Karen visits Joe at t6
- Julia visits Joe at t7

We want to provide to Joe a webpage listing visits he received in that
order :
- Juiia at t7  (the more recent)
- Karen at t6
- Silvia at t4

The 2 tables you propose Stefan can not natively order rows by time (they
will be ordered by visitor_id), excepted if you sort rows after the select.

Keep in mind that some people can received a loot on visits in 6 months
(200,000 or 300,000 deduplicated visits, and much more if you keep
duplicate visits) and ordering such volume of rows by code is not easy (in
fact impossible because we use PHP and we can't do that in a FPM memory...)
... and of course, because we can not provide in a single page of 200,000
or 300,000 members stickers in one shot, the webpage requires pagination
(with lot of 100 profiles per page). If you decide that sorting should be
made on the code side, the pagination becomes awful to manage.

PS 1 : when we decide to do this, MV were not yet back to experimental
PS 2 : the code to manage a visit received is very easy... we just do a
insert in the master table without doing any select before... we just don't
care of what happened in past...
PS 3 : the pagination is very easy... we just do a
- select * from visits_received_by_date where receiver_id=111 and
visit_date a écrit :

> Why can't you do it like this? You would have two tables:
>
> create table visits (user_id bigint, visitor_id bigint, visit_date
> timestamp, primary key ((user_id, visitor_id), visit_date)) order by
> visit_date desc
>
> create table visitors_by_user_id (user_id bigint, visitor_id bigint,
> primary key ((user_id), visitor_id))
>
> The logic behind the second table, visitors_by_user_id, is that you do not
> care if a user visited you twice, because it is primary key + clustering
> column, if same user visits you twice, the second time it would basically
> do nothing, because such entry is already there.
>
> For example:
>
> user_id | visitor_id
> joe | karen
> joe | julia
>
> If Karen visits me again, nothing happens as that entry is already there.
>
> Then if Karen visits me, I put into the second table
>
> joe | karen | tuesday
> joe | karen | monday
> joe | karen | last friday
> joe | julia | today
>
> So to know who visited me recently, I do
>
> select visitor_id from visitors_by_user_id where user_id = Joe;
>
> So I get Karen and Julia
>
> And then for each such visitor I do
>
> select visit_date from visits where user_id = Joe and visitor_id = Julia
> limit 1
>
> 
> From: Regis Le Bretonnic 
> Sent: Tuesday, August 15, 2023 17:49
> To: user@cassandra.apache.org
> Subject: Re: Materialized View inconsistency issue
>
> You don't often get email from r.lebreton...@meetic-corp.com. Learn why
> this is important
> NetApp Security WARNING: This is an external email. Do not click links or
> open attachments unless you recognize the sender and know the content is
> safe.
>
>
>
> Hi Josh...
>
> A long (and almost private) message to explain how we fix materialized
> views.
>
> Let me first explain our use case... I work for an european dating website.
> Users can received visits from other users (typically when someone looks
> at a member profile page), and we want to inform them for each visit
> received (sorted from the most recent one to the oldest one).
> But imagine that Karen goes several times on my profile page... I don't
> want to see all her visits but only the last one. So, we want to
> deduplicate rows (see only once Karen), and ordered the rows (showing Julia
> that visit me 1 minute ago, Sophia that visit me 3 minutes ago, Karen that
> visit me 10 minutes ago, and so on).
>
> You can not do that in cassandra. If you want to deduplicate rows by pairs
> of users, the "visit timestamp" can not be in the primary key... and if you
> want to order rows by the "visit timestamp", this field must be in the
> clustering columns and consequently in the primary key. That is just not
> possible !
>
> Waht we do is :
> - a master table like this :
>
> CREATE TABLE visits_received (
> receiver_id bigint,
> sender_id bigint,
> visit_date timestamp,
> PRIMARY KEY ((receiver_id), sender_id)
> ) WITH CLUSTERING ORDER BY (sender_id ASC);
>
> - and a materialized view like this :
>
> CREATE MATERIALIZED VIEW visits_received_by_date as
> SELECT receiver_id, sender_id, visit_date
> FROM visits_received
> WHERE receiver_id IS NOT NULL AND sender_id IS NOT NULL AND visit_date
> IS NOT NULL
> PRIMARY KEY ((receiver_id), visit_date, sender_id)
> WITH CLUSTERING ORDER BY (visit_date DESC, sender_id ASC);
>
> With this the master table