Re: Big Data Question

2023-08-17 Thread daemeon reiydelle
I started to respond, then realized I and the other OP posters are not
thinking the same: What is the business case for availability, data
los/reload/recoverability? You all argue for higher availability and damn
the cost. But noone asked "can you lose access, for 20 minutes, to a
portion of the data, 10 times a year, on a 250 node cluster in AWS, if it
is not lost"? Can you lose access 1-2 times a year for the cost of a 500
node cluster holding the same data?

Then we can discuss 32/64g JVM and SSD's.
*.*
*Arthur C. Clarke famously said that "technology sufficiently advanced is
indistinguishable from magic." Magic is coming, and it's coming for all of
us*

*Daemeon Reiydelle*
*email: daeme...@gmail.com *
*LI: https://www.linkedin.com/in/daemeonreiydelle/
*
*San Francisco 1.415.501.0198/Skype daemeon.c.m.reiydelle*


On Thu, Aug 17, 2023 at 1:53 PM Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> Was assuming reaper did incremental?  That was probably a bad assumption.
>
> nodetool repair -pr
> I know it well now!
>
> :)
>
> -Joe
>
> On 8/17/2023 4:47 PM, Bowen Song via user wrote:
> > I don't have experience with Cassandra on Kubernetes, so I can't
> > comment on that.
> >
> > For repairs, may I interest you with incremental repairs? It will make
> > repairs hell of a lot faster. Of course, occasional full repair is
> > still needed, but that's another story.
> >
> >
> > On 17/08/2023 21:36, Joe Obernberger wrote:
> >> Thank you.  Enjoying this conversation.
> >> Agree on blade servers, where each blade has a small number of SSDs.
> >> Yeh/Nah to a kubernetes approach assuming fast persistent storage?  I
> >> think that might be easier to manage.
> >>
> >> In my current benchmarks, the performance is excellent, but the
> >> repairs are painful.  I come from the Hadoop world where it was all
> >> about large servers with lots of disk.
> >> Relatively small number of tables, but some have a high number of
> >> rows, 10bil + - we use spark to run across all the data.
> >>
> >> -Joe
> >>
> >> On 8/17/2023 12:13 PM, Bowen Song via user wrote:
> >>> The optimal node size largely depends on the table schema and
> >>> read/write pattern. In some cases 500 GB per node is too large, but
> >>> in some other cases 10TB per node works totally fine. It's hard to
> >>> estimate that without benchmarking.
> >>>
> >>> Again, just pointing out the obvious, you did not count the off-heap
> >>> memory and page cache. 1TB of RAM for 24GB heap * 40 instances is
> >>> definitely not enough. You'll most likely need between 1.5 and 2 TB
> >>> memory for 40x 24GB heap nodes. You may be better off with blade
> >>> servers than single server with gigantic memory and disk sizes.
> >>>
> >>>
> >>> On 17/08/2023 15:46, Joe Obernberger wrote:
>  Thanks for this - yeah - duh - forgot about replication in my example!
>  So - is 2TBytes per Cassandra instance advisable?  Better to use
>  more/less?  Modern 2u servers can be had with 24 3.8TBtyte SSDs; so
>  assume 80Tbytes per server, you could do:
>  (1024*3)/80 = 39 servers, but you'd have to run 40 instances of
>  Cassandra on each server; maybe 24G of heap per instance, so a
>  server with 1TByte of RAM would work.
>  Is this what folks would do?
> 
>  -Joe
> 
>  On 8/17/2023 9:13 AM, Bowen Song via user wrote:
> > Just pointing out the obvious, for 1PB of data on nodes with 2TB
> > disk each, you will need far more than 500 nodes.
> >
> > 1, it is unwise to run Cassandra with replication factor 1. It
> > usually makes sense to use RF=3, so 1PB data will cost 3PB of
> > storage space, minimal of 1500 such nodes.
> >
> > 2, depending on the compaction strategy you use and the write
> > access pattern, there's a disk space amplification to consider.
> > For example, with STCS, the disk usage can be many times of the
> > actual live data size.
> >
> > 3, you will need some extra free disk space as temporary space for
> > running compactions.
> >
> > 4, the data is rarely going to be perfectly evenly distributed
> > among all nodes, and you need to take that into consideration and
> > size the nodes based on the node with the most data.
> >
> > 5, enough of bad news, here's a good one. Compression will save
> > you (a lot) of disk space!
> >
> > With all the above considered, you probably will end up with a lot
> > more than the 500 nodes you initially thought. Your choice of
> > compaction strategy and compression ratio can dramatically affect
> > this calculation.
> >
> >
> > On 16/08/2023 16:33, Joe Obernberger wrote:
> >> General question on how to configure Cassandra.  Say I have
> >> 1PByte of data to store.  The general rule of thumb is that each
> >> node (or at least instance of Cassandra) shouldn't handle more
> >> than 2TBytes of disk.  That 

Re: Big Data Question

2023-08-17 Thread Joe Obernberger

Was assuming reaper did incremental?  That was probably a bad assumption.

nodetool repair -pr
I know it well now!

:)

-Joe

On 8/17/2023 4:47 PM, Bowen Song via user wrote:
I don't have experience with Cassandra on Kubernetes, so I can't 
comment on that.


For repairs, may I interest you with incremental repairs? It will make 
repairs hell of a lot faster. Of course, occasional full repair is 
still needed, but that's another story.



On 17/08/2023 21:36, Joe Obernberger wrote:

Thank you.  Enjoying this conversation.
Agree on blade servers, where each blade has a small number of SSDs.  
Yeh/Nah to a kubernetes approach assuming fast persistent storage?  I 
think that might be easier to manage.


In my current benchmarks, the performance is excellent, but the 
repairs are painful.  I come from the Hadoop world where it was all 
about large servers with lots of disk.
Relatively small number of tables, but some have a high number of 
rows, 10bil + - we use spark to run across all the data.


-Joe

On 8/17/2023 12:13 PM, Bowen Song via user wrote:
The optimal node size largely depends on the table schema and 
read/write pattern. In some cases 500 GB per node is too large, but 
in some other cases 10TB per node works totally fine. It's hard to 
estimate that without benchmarking.


Again, just pointing out the obvious, you did not count the off-heap 
memory and page cache. 1TB of RAM for 24GB heap * 40 instances is 
definitely not enough. You'll most likely need between 1.5 and 2 TB 
memory for 40x 24GB heap nodes. You may be better off with blade 
servers than single server with gigantic memory and disk sizes.



On 17/08/2023 15:46, Joe Obernberger wrote:

Thanks for this - yeah - duh - forgot about replication in my example!
So - is 2TBytes per Cassandra instance advisable?  Better to use 
more/less?  Modern 2u servers can be had with 24 3.8TBtyte SSDs; so 
assume 80Tbytes per server, you could do:
(1024*3)/80 = 39 servers, but you'd have to run 40 instances of 
Cassandra on each server; maybe 24G of heap per instance, so a 
server with 1TByte of RAM would work.

Is this what folks would do?

-Joe

On 8/17/2023 9:13 AM, Bowen Song via user wrote:
Just pointing out the obvious, for 1PB of data on nodes with 2TB 
disk each, you will need far more than 500 nodes.


1, it is unwise to run Cassandra with replication factor 1. It 
usually makes sense to use RF=3, so 1PB data will cost 3PB of 
storage space, minimal of 1500 such nodes.


2, depending on the compaction strategy you use and the write 
access pattern, there's a disk space amplification to consider. 
For example, with STCS, the disk usage can be many times of the 
actual live data size.


3, you will need some extra free disk space as temporary space for 
running compactions.


4, the data is rarely going to be perfectly evenly distributed 
among all nodes, and you need to take that into consideration and 
size the nodes based on the node with the most data.


5, enough of bad news, here's a good one. Compression will save 
you (a lot) of disk space!


With all the above considered, you probably will end up with a lot 
more than the 500 nodes you initially thought. Your choice of 
compaction strategy and compression ratio can dramatically affect 
this calculation.



On 16/08/2023 16:33, Joe Obernberger wrote:
General question on how to configure Cassandra.  Say I have 
1PByte of data to store.  The general rule of thumb is that each 
node (or at least instance of Cassandra) shouldn't handle more 
than 2TBytes of disk.  That means 500 instances of Cassandra.


Assuming you have very fast persistent storage (such as a NetApp, 
PorterWorx etc.), would using Kubernetes or some orchestration 
layer to handle those nodes be a viable approach? Perhaps the 
worker nodes would have enough RAM to run 4 instances (pods) of 
Cassandra, you would need 125 servers.
Another approach is to build your servers with 5 (or more) SSD 
devices - one for OS, four for each instance of Cassandra running 
on that server.  Then build some scripts/ansible/puppet that 
would manage Cassandra start/stops, and other maintenance items.


Where I think this runs into problems is with repairs, or 
sstablescrubs that can take days to run on a single instance. How 
is that handled 'in the real world'?  With seed nodes, how many 
would you have in such a configuration?

Thanks for any thoughts!

-Joe








--
This email has been checked for viruses by AVG antivirus software.
www.avg.com


Re: Big Data Question

2023-08-17 Thread Bowen Song via user
I don't have experience with Cassandra on Kubernetes, so I can't comment 
on that.


For repairs, may I interest you with incremental repairs? It will make 
repairs hell of a lot faster. Of course, occasional full repair is still 
needed, but that's another story.



On 17/08/2023 21:36, Joe Obernberger wrote:

Thank you.  Enjoying this conversation.
Agree on blade servers, where each blade has a small number of SSDs.  
Yeh/Nah to a kubernetes approach assuming fast persistent storage?  I 
think that might be easier to manage.


In my current benchmarks, the performance is excellent, but the 
repairs are painful.  I come from the Hadoop world where it was all 
about large servers with lots of disk.
Relatively small number of tables, but some have a high number of 
rows, 10bil + - we use spark to run across all the data.


-Joe

On 8/17/2023 12:13 PM, Bowen Song via user wrote:
The optimal node size largely depends on the table schema and 
read/write pattern. In some cases 500 GB per node is too large, but 
in some other cases 10TB per node works totally fine. It's hard to 
estimate that without benchmarking.


Again, just pointing out the obvious, you did not count the off-heap 
memory and page cache. 1TB of RAM for 24GB heap * 40 instances is 
definitely not enough. You'll most likely need between 1.5 and 2 TB 
memory for 40x 24GB heap nodes. You may be better off with blade 
servers than single server with gigantic memory and disk sizes.



On 17/08/2023 15:46, Joe Obernberger wrote:

Thanks for this - yeah - duh - forgot about replication in my example!
So - is 2TBytes per Cassandra instance advisable?  Better to use 
more/less?  Modern 2u servers can be had with 24 3.8TBtyte SSDs; so 
assume 80Tbytes per server, you could do:
(1024*3)/80 = 39 servers, but you'd have to run 40 instances of 
Cassandra on each server; maybe 24G of heap per instance, so a 
server with 1TByte of RAM would work.

Is this what folks would do?

-Joe

On 8/17/2023 9:13 AM, Bowen Song via user wrote:
Just pointing out the obvious, for 1PB of data on nodes with 2TB 
disk each, you will need far more than 500 nodes.


1, it is unwise to run Cassandra with replication factor 1. It 
usually makes sense to use RF=3, so 1PB data will cost 3PB of 
storage space, minimal of 1500 such nodes.


2, depending on the compaction strategy you use and the write 
access pattern, there's a disk space amplification to consider. For 
example, with STCS, the disk usage can be many times of the actual 
live data size.


3, you will need some extra free disk space as temporary space for 
running compactions.


4, the data is rarely going to be perfectly evenly distributed 
among all nodes, and you need to take that into consideration and 
size the nodes based on the node with the most data.


5, enough of bad news, here's a good one. Compression will save you 
(a lot) of disk space!


With all the above considered, you probably will end up with a lot 
more than the 500 nodes you initially thought. Your choice of 
compaction strategy and compression ratio can dramatically affect 
this calculation.



On 16/08/2023 16:33, Joe Obernberger wrote:
General question on how to configure Cassandra.  Say I have 1PByte 
of data to store.  The general rule of thumb is that each node (or 
at least instance of Cassandra) shouldn't handle more than 2TBytes 
of disk.  That means 500 instances of Cassandra.


Assuming you have very fast persistent storage (such as a NetApp, 
PorterWorx etc.), would using Kubernetes or some orchestration 
layer to handle those nodes be a viable approach? Perhaps the 
worker nodes would have enough RAM to run 4 instances (pods) of 
Cassandra, you would need 125 servers.
Another approach is to build your servers with 5 (or more) SSD 
devices - one for OS, four for each instance of Cassandra running 
on that server.  Then build some scripts/ansible/puppet that would 
manage Cassandra start/stops, and other maintenance items.


Where I think this runs into problems is with repairs, or 
sstablescrubs that can take days to run on a single instance. How 
is that handled 'in the real world'?  With seed nodes, how many 
would you have in such a configuration?

Thanks for any thoughts!

-Joe








Re: Big Data Question

2023-08-17 Thread Joe Obernberger

Thank you.  Enjoying this conversation.
Agree on blade servers, where each blade has a small number of SSDs.  
Yeh/Nah to a kubernetes approach assuming fast persistent storage?  I 
think that might be easier to manage.


In my current benchmarks, the performance is excellent, but the repairs 
are painful.  I come from the Hadoop world where it was all about large 
servers with lots of disk.
Relatively small number of tables, but some have a high number of rows, 
10bil + - we use spark to run across all the data.


-Joe

On 8/17/2023 12:13 PM, Bowen Song via user wrote:
The optimal node size largely depends on the table schema and 
read/write pattern. In some cases 500 GB per node is too large, but in 
some other cases 10TB per node works totally fine. It's hard to 
estimate that without benchmarking.


Again, just pointing out the obvious, you did not count the off-heap 
memory and page cache. 1TB of RAM for 24GB heap * 40 instances is 
definitely not enough. You'll most likely need between 1.5 and 2 TB 
memory for 40x 24GB heap nodes. You may be better off with blade 
servers than single server with gigantic memory and disk sizes.



On 17/08/2023 15:46, Joe Obernberger wrote:

Thanks for this - yeah - duh - forgot about replication in my example!
So - is 2TBytes per Cassandra instance advisable?  Better to use 
more/less?  Modern 2u servers can be had with 24 3.8TBtyte SSDs; so 
assume 80Tbytes per server, you could do:
(1024*3)/80 = 39 servers, but you'd have to run 40 instances of 
Cassandra on each server; maybe 24G of heap per instance, so a server 
with 1TByte of RAM would work.

Is this what folks would do?

-Joe

On 8/17/2023 9:13 AM, Bowen Song via user wrote:
Just pointing out the obvious, for 1PB of data on nodes with 2TB 
disk each, you will need far more than 500 nodes.


1, it is unwise to run Cassandra with replication factor 1. It 
usually makes sense to use RF=3, so 1PB data will cost 3PB of 
storage space, minimal of 1500 such nodes.


2, depending on the compaction strategy you use and the write access 
pattern, there's a disk space amplification to consider. For 
example, with STCS, the disk usage can be many times of the actual 
live data size.


3, you will need some extra free disk space as temporary space for 
running compactions.


4, the data is rarely going to be perfectly evenly distributed among 
all nodes, and you need to take that into consideration and size the 
nodes based on the node with the most data.


5, enough of bad news, here's a good one. Compression will save you 
(a lot) of disk space!


With all the above considered, you probably will end up with a lot 
more than the 500 nodes you initially thought. Your choice of 
compaction strategy and compression ratio can dramatically affect 
this calculation.



On 16/08/2023 16:33, Joe Obernberger wrote:
General question on how to configure Cassandra.  Say I have 1PByte 
of data to store.  The general rule of thumb is that each node (or 
at least instance of Cassandra) shouldn't handle more than 2TBytes 
of disk.  That means 500 instances of Cassandra.


Assuming you have very fast persistent storage (such as a NetApp, 
PorterWorx etc.), would using Kubernetes or some orchestration 
layer to handle those nodes be a viable approach? Perhaps the 
worker nodes would have enough RAM to run 4 instances (pods) of 
Cassandra, you would need 125 servers.
Another approach is to build your servers with 5 (or more) SSD 
devices - one for OS, four for each instance of Cassandra running 
on that server.  Then build some scripts/ansible/puppet that would 
manage Cassandra start/stops, and other maintenance items.


Where I think this runs into problems is with repairs, or 
sstablescrubs that can take days to run on a single instance. How 
is that handled 'in the real world'?  With seed nodes, how many 
would you have in such a configuration?

Thanks for any thoughts!

-Joe






--
This email has been checked for viruses by AVG antivirus software.
www.avg.com


Re: Big Data Question

2023-08-17 Thread Bowen Song via user
From my experience, that's not entirely true. For large nodes, the 
bottleneck is usually the JVM garbage collector. The the GC pauses can 
easily get out of control on very large heaps, and long STW pauses may 
also result in nodes flip up and down from other nodes' perspective, 
which often renders the entire cluster unstable.


Using RF=1 is also strongly discouraged, even with reliable and durable 
storage. By going with RF=1, you don't only lose the data replication, 
but also the high-availability. If any node becomes unavailable in the 
cluster, it will render the entire token range(s) owned by that node 
inaccessible, causing (some or all) CQL queries to fail. This means many 
routine maintenance tasks, such as upgrading and restarting nodes, are 
going to introduce downtime for the cluster. To ensure strong 
consistency and HA, RF=3 is recommended.



On 17/08/2023 20:40, daemeon reiydelle wrote:
A lot of (actually all) seem to be based on local nodes with 1gb 
networks of spinning rust. Much of what is mentioned below is TOTALLY 
wrong for cloud. So clarify whether you are "real world" or rusty slow 
data center world (definitely not modern DC either).


E.g. should not handle more than 2tb of ACTIVE disk, and that was for 
spinning rust with maybe 1gb networks. 10tb of modern high speed SSD 
is more typical with 10 or 40gb networks. If data is persisted to 
cloud storage, replication should be 1, vm's fail over to new 
hardware. Obviously if your storage is ephemeral, you have a different 
discussion. More of a monologue with an idiot in Finance, but 

/./
/Arthur C. Clarke famously said that "technology sufficiently advanced 
is indistinguishable from magic." Magic is coming, and it's coming for 
all of us/

/
/
*Daemeon Reiydelle*
*email: daeme...@gmail.com*
*LI: https://www.linkedin.com/in/daemeonreiydelle/*
*San Francisco 1.415.501.0198/Skype daemeon.c.m.reiydelle*


On Thu, Aug 17, 2023 at 6:13 AM Bowen Song via user 
 wrote:


Just pointing out the obvious, for 1PB of data on nodes with 2TB disk
each, you will need far more than 500 nodes.

1, it is unwise to run Cassandra with replication factor 1. It
usually
makes sense to use RF=3, so 1PB data will cost 3PB of storage space,
minimal of 1500 such nodes.

2, depending on the compaction strategy you use and the write access
pattern, there's a disk space amplification to consider. For example,
with STCS, the disk usage can be many times of the actual live
data size.

3, you will need some extra free disk space as temporary space for
running compactions.

4, the data is rarely going to be perfectly evenly distributed
among all
nodes, and you need to take that into consideration and size the
nodes
based on the node with the most data.

5, enough of bad news, here's a good one. Compression will save
you (a
lot) of disk space!

With all the above considered, you probably will end up with a lot
more
than the 500 nodes you initially thought. Your choice of compaction
strategy and compression ratio can dramatically affect this
calculation.


On 16/08/2023 16:33, Joe Obernberger wrote:
> General question on how to configure Cassandra.  Say I have
1PByte of
> data to store.  The general rule of thumb is that each node (or at
> least instance of Cassandra) shouldn't handle more than 2TBytes of
> disk.  That means 500 instances of Cassandra.
>
> Assuming you have very fast persistent storage (such as a NetApp,
> PorterWorx etc.), would using Kubernetes or some orchestration
layer
> to handle those nodes be a viable approach?  Perhaps the worker
nodes
> would have enough RAM to run 4 instances (pods) of Cassandra, you
> would need 125 servers.
> Another approach is to build your servers with 5 (or more) SSD
devices
> - one for OS, four for each instance of Cassandra running on that
> server.  Then build some scripts/ansible/puppet that would manage
> Cassandra start/stops, and other maintenance items.
>
> Where I think this runs into problems is with repairs, or
> sstablescrubs that can take days to run on a single instance. 
How is
> that handled 'in the real world'?  With seed nodes, how many
would you
> have in such a configuration?
> Thanks for any thoughts!
>
> -Joe
>
>


Re: Big Data Question

2023-08-17 Thread daemeon reiydelle
A lot of (actually all) seem to be based on local nodes with 1gb networks
of spinning rust. Much of what is mentioned below is TOTALLY wrong for
cloud. So clarify whether you are "real world" or rusty slow data center
world (definitely not modern DC either).

E.g. should not handle more than 2tb of ACTIVE disk, and that was for
spinning rust with maybe 1gb networks. 10tb of modern high speed SSD is
more typical with 10 or 40gb networks. If data is persisted to cloud
storage, replication should be 1, vm's fail over to new hardware. Obviously
if your storage is ephemeral, you have a different discussion. More of a
monologue with an idiot in Finance, but 
*.*
*Arthur C. Clarke famously said that "technology sufficiently advanced is
indistinguishable from magic." Magic is coming, and it's coming for all of
us*

*Daemeon Reiydelle*
*email: daeme...@gmail.com *
*LI: https://www.linkedin.com/in/daemeonreiydelle/
*
*San Francisco 1.415.501.0198/Skype daemeon.c.m.reiydelle*


On Thu, Aug 17, 2023 at 6:13 AM Bowen Song via user <
user@cassandra.apache.org> wrote:

> Just pointing out the obvious, for 1PB of data on nodes with 2TB disk
> each, you will need far more than 500 nodes.
>
> 1, it is unwise to run Cassandra with replication factor 1. It usually
> makes sense to use RF=3, so 1PB data will cost 3PB of storage space,
> minimal of 1500 such nodes.
>
> 2, depending on the compaction strategy you use and the write access
> pattern, there's a disk space amplification to consider. For example,
> with STCS, the disk usage can be many times of the actual live data size.
>
> 3, you will need some extra free disk space as temporary space for
> running compactions.
>
> 4, the data is rarely going to be perfectly evenly distributed among all
> nodes, and you need to take that into consideration and size the nodes
> based on the node with the most data.
>
> 5, enough of bad news, here's a good one. Compression will save you (a
> lot) of disk space!
>
> With all the above considered, you probably will end up with a lot more
> than the 500 nodes you initially thought. Your choice of compaction
> strategy and compression ratio can dramatically affect this calculation.
>
>
> On 16/08/2023 16:33, Joe Obernberger wrote:
> > General question on how to configure Cassandra.  Say I have 1PByte of
> > data to store.  The general rule of thumb is that each node (or at
> > least instance of Cassandra) shouldn't handle more than 2TBytes of
> > disk.  That means 500 instances of Cassandra.
> >
> > Assuming you have very fast persistent storage (such as a NetApp,
> > PorterWorx etc.), would using Kubernetes or some orchestration layer
> > to handle those nodes be a viable approach?  Perhaps the worker nodes
> > would have enough RAM to run 4 instances (pods) of Cassandra, you
> > would need 125 servers.
> > Another approach is to build your servers with 5 (or more) SSD devices
> > - one for OS, four for each instance of Cassandra running on that
> > server.  Then build some scripts/ansible/puppet that would manage
> > Cassandra start/stops, and other maintenance items.
> >
> > Where I think this runs into problems is with repairs, or
> > sstablescrubs that can take days to run on a single instance.  How is
> > that handled 'in the real world'?  With seed nodes, how many would you
> > have in such a configuration?
> > Thanks for any thoughts!
> >
> > -Joe
> >
> >
>


Re: Materialized View inconsistency issue

2023-08-17 Thread Miklosovic, Stefan
Why can't you do it like this? You would have two tables:

create table visits (user_id bigint, visitor_id bigint, visit_date timestamp, 
primary key ((user_id, visitor_id), visit_date)) order by visit_date desc

create table visitors_by_user_id (user_id bigint, visitor_id bigint, primary 
key ((user_id), visitor_id))

The logic behind the second table, visitors_by_user_id, is that you do not care 
if a user visited you twice, because it is primary key + clustering column, if 
same user visits you twice, the second time it would basically do nothing, 
because such entry is already there.

For example:

user_id | visitor_id
joe | karen
joe | julia

If Karen visits me again, nothing happens as that entry is already there.

Then if Karen visits me, I put into the second table

joe | karen | tuesday
joe | karen | monday
joe | karen | last friday
joe | julia | today

So to know who visited me recently, I do

select visitor_id from visitors_by_user_id where user_id = Joe;

So I get Karen and Julia

And then for each such visitor I do

select visit_date from visits where user_id = Joe and visitor_id = Julia limit 1


From: Regis Le Bretonnic 
Sent: Tuesday, August 15, 2023 17:49
To: user@cassandra.apache.org
Subject: Re: Materialized View inconsistency issue

You don't often get email from r.lebreton...@meetic-corp.com. Learn why this is 
important
NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.



Hi Josh...

A long (and almost private) message to explain how we fix materialized views.

Let me first explain our use case... I work for an european dating website.
Users can received visits from other users (typically when someone looks at a 
member profile page), and we want to inform them for each visit received 
(sorted from the most recent one to the oldest one).
But imagine that Karen goes several times on my profile page... I don't want to 
see all her visits but only the last one. So, we want to deduplicate rows (see 
only once Karen), and ordered the rows (showing Julia that visit me 1 minute 
ago, Sophia that visit me 3 minutes ago, Karen that visit me 10 minutes ago, 
and so on).

You can not do that in cassandra. If you want to deduplicate rows by pairs of 
users, the "visit timestamp" can not be in the primary key... and if you want 
to order rows by the "visit timestamp", this field must be in the clustering 
columns and consequently in the primary key. That is just not possible !

Waht we do is :
- a master table like this :

CREATE TABLE visits_received (
receiver_id bigint,
sender_id bigint,
visit_date timestamp,
PRIMARY KEY ((receiver_id), sender_id)
) WITH CLUSTERING ORDER BY (sender_id ASC);

- and a materialized view like this :

CREATE MATERIALIZED VIEW visits_received_by_date as
SELECT receiver_id, sender_id, visit_date
FROM visits_received
WHERE receiver_id IS NOT NULL AND sender_id IS NOT NULL AND visit_date IS 
NOT NULL
PRIMARY KEY ((receiver_id), visit_date, sender_id)
WITH CLUSTERING ORDER BY (visit_date DESC, sender_id ASC);

With this the master table deduplicates, and the MV sorts rows the way we want.


Problems we have are most of the time having rows that should not exist in the 
MV...
Let's say that I have this row in the master table :
- 111, 222, t3
and that because of materialized view unconsistency, I have 3 rows in the MV :
- 111, 222, t3
- 111, 222, t2
- 111, 222, t1

then to remove the 2 wrong rows in the MV, we do a double insert on the master 
table :
insert (111, 222, t1) + insert (111, 222, t3) -> this remove the row "111, 222, 
t1"
insert (111, 222, t2) + insert (111, 222, t3) -> this remove the row "111, 222, 
t2"

We can very, very rarely have other cases (rows in master and not in MV), but 
these are also very easy to fix by just re-inserting the master rows.


Now about our spark script :
- we download sequentially the master table and the MV
- we compare them to find ... "potential inconsistencies" (because the tables 
are not download at the same time and data can have change, we can find false 
positive errors)
- we loop on all the "potential inconsistencies" and force a new read on the 
table and the MV to check if there is truly inconsistency when reads are made 
in few milliseconds
- if it is a true inconsistency, we force inserts on the master table to fix 
the MV as describe below


Now, about the volume of inconsistency :
- on a master table with 1.7 B-rows
- we have ~ 12.5 K-rows that are unconsistent (0,0007%) after 2 years... 
clearly better than what our developpers will do by managing inserts and 
deletes by themshelves (and acceptable for our use case)


Le lun. 14 août 2023 à 16:36, Josh McKenzie 
mailto:jmcken...@apache.org>> a écrit :
When it comes to denormalization in Cassandra today your options are to either 
do it 

Re: Big Data Question

2023-08-17 Thread Bowen Song via user
The optimal node size largely depends on the table schema and read/write 
pattern. In some cases 500 GB per node is too large, but in some other 
cases 10TB per node works totally fine. It's hard to estimate that 
without benchmarking.


Again, just pointing out the obvious, you did not count the off-heap 
memory and page cache. 1TB of RAM for 24GB heap * 40 instances is 
definitely not enough. You'll most likely need between 1.5 and 2 TB 
memory for 40x 24GB heap nodes. You may be better off with blade servers 
than single server with gigantic memory and disk sizes.



On 17/08/2023 15:46, Joe Obernberger wrote:

Thanks for this - yeah - duh - forgot about replication in my example!
So - is 2TBytes per Cassandra instance advisable?  Better to use 
more/less?  Modern 2u servers can be had with 24 3.8TBtyte SSDs; so 
assume 80Tbytes per server, you could do:
(1024*3)/80 = 39 servers, but you'd have to run 40 instances of 
Cassandra on each server; maybe 24G of heap per instance, so a server 
with 1TByte of RAM would work.

Is this what folks would do?

-Joe

On 8/17/2023 9:13 AM, Bowen Song via user wrote:
Just pointing out the obvious, for 1PB of data on nodes with 2TB disk 
each, you will need far more than 500 nodes.


1, it is unwise to run Cassandra with replication factor 1. It 
usually makes sense to use RF=3, so 1PB data will cost 3PB of storage 
space, minimal of 1500 such nodes.


2, depending on the compaction strategy you use and the write access 
pattern, there's a disk space amplification to consider. For example, 
with STCS, the disk usage can be many times of the actual live data 
size.


3, you will need some extra free disk space as temporary space for 
running compactions.


4, the data is rarely going to be perfectly evenly distributed among 
all nodes, and you need to take that into consideration and size the 
nodes based on the node with the most data.


5, enough of bad news, here's a good one. Compression will save you 
(a lot) of disk space!


With all the above considered, you probably will end up with a lot 
more than the 500 nodes you initially thought. Your choice of 
compaction strategy and compression ratio can dramatically affect 
this calculation.



On 16/08/2023 16:33, Joe Obernberger wrote:
General question on how to configure Cassandra.  Say I have 1PByte 
of data to store.  The general rule of thumb is that each node (or 
at least instance of Cassandra) shouldn't handle more than 2TBytes 
of disk.  That means 500 instances of Cassandra.


Assuming you have very fast persistent storage (such as a NetApp, 
PorterWorx etc.), would using Kubernetes or some orchestration layer 
to handle those nodes be a viable approach? Perhaps the worker nodes 
would have enough RAM to run 4 instances (pods) of Cassandra, you 
would need 125 servers.
Another approach is to build your servers with 5 (or more) SSD 
devices - one for OS, four for each instance of Cassandra running on 
that server.  Then build some scripts/ansible/puppet that would 
manage Cassandra start/stops, and other maintenance items.


Where I think this runs into problems is with repairs, or 
sstablescrubs that can take days to run on a single instance. How is 
that handled 'in the real world'?  With seed nodes, how many would 
you have in such a configuration?

Thanks for any thoughts!

-Joe






RE: Big Data Question

2023-08-17 Thread Durity, Sean R via user
For a variety of reasons, we have clusters with 5 TB of disk per host as a 
“standard.” In our larger data clusters, it does take longer to add/remove 
nodes or do things like upgradesstables after an upgrade. These nodes have 3+TB 
of actual data on the drive. But, we were able to shrink the node count from 
our days of using 1 or 2 TB of disk. Lots of potential cost tradeoffs to 
consider – licensing/support, server cost, maintenance time, more or less 
servers to have failures, number of (expensive?!) switch ports used, etc.

NOTE: this is 3.x experience, not 4.x with faster streaming.

Sean R. Durity



INTERNAL USE
From: Joe Obernberger 
Sent: Thursday, August 17, 2023 10:46 AM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Re: Big Data Question

Thanks for this - yeah - duh - forgot about replication in my example! So - is 
2TBytes per Cassandra instance advisable?  Better to use more/less?  Modern 2u 
servers can be had with 24 3. 8TBtyte SSDs; so assume 80Tbytes per server, you 
could


Thanks for this - yeah - duh - forgot about replication in my example!

So - is 2TBytes per Cassandra instance advisable?  Better to use

more/less?  Modern 2u servers can be had with 24 3.8TBtyte SSDs; so

assume 80Tbytes per server, you could do:

(1024*3)/80 = 39 servers, but you'd have to run 40 instances of

Cassandra on each server; maybe 24G of heap per instance, so a server

with 1TByte of RAM would work.

Is this what folks would do?



-Joe



On 8/17/2023 9:13 AM, Bowen Song via user wrote:

> Just pointing out the obvious, for 1PB of data on nodes with 2TB disk

> each, you will need far more than 500 nodes.

>

> 1, it is unwise to run Cassandra with replication factor 1. It usually

> makes sense to use RF=3, so 1PB data will cost 3PB of storage space,

> minimal of 1500 such nodes.

>

> 2, depending on the compaction strategy you use and the write access

> pattern, there's a disk space amplification to consider. For example,

> with STCS, the disk usage can be many times of the actual live data size.

>

> 3, you will need some extra free disk space as temporary space for

> running compactions.

>

> 4, the data is rarely going to be perfectly evenly distributed among

> all nodes, and you need to take that into consideration and size the

> nodes based on the node with the most data.

>

> 5, enough of bad news, here's a good one. Compression will save you (a

> lot) of disk space!

>

> With all the above considered, you probably will end up with a lot

> more than the 500 nodes you initially thought. Your choice of

> compaction strategy and compression ratio can dramatically affect this

> calculation.

>

>

> On 16/08/2023 16:33, Joe Obernberger wrote:

>> General question on how to configure Cassandra.  Say I have 1PByte of

>> data to store.  The general rule of thumb is that each node (or at

>> least instance of Cassandra) shouldn't handle more than 2TBytes of

>> disk.  That means 500 instances of Cassandra.

>>

>> Assuming you have very fast persistent storage (such as a NetApp,

>> PorterWorx etc.), would using Kubernetes or some orchestration layer

>> to handle those nodes be a viable approach? Perhaps the worker nodes

>> would have enough RAM to run 4 instances (pods) of Cassandra, you

>> would need 125 servers.

>> Another approach is to build your servers with 5 (or more) SSD

>> devices - one for OS, four for each instance of Cassandra running on

>> that server.  Then build some scripts/ansible/puppet that would

>> manage Cassandra start/stops, and other maintenance items.

>>

>> Where I think this runs into problems is with repairs, or

>> sstablescrubs that can take days to run on a single instance. How is

>> that handled 'in the real world'?  With seed nodes, how many would

>> you have in such a configuration?

>> Thanks for any thoughts!

>>

>> -Joe

>>

>>



--

This email has been checked for viruses by AVG antivirus software.

https://urldefense.com/v3/__http://www.avg.com__;!!M-nmYVHPHQ!JNgRIPkjVYoJBn7hBrUMEUxlXhoB0f9NUYIcGYPiexUZA5rpWWgPiLJp37dwGzdXMyMVIJJn0hzkcljb0wokF_RwMJ_g6KRPXA$



The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for 

Re: Big Data Question

2023-08-17 Thread C. Scott Andreas

A few thoughts on this:– 80TB per machine is pretty dense. Consider the amount of data 
you'd need to re-replicate in the event of a hardware failure that takes down all 80TB 
(DIMM failure requiring replacement, non-reduntant PSU failure, NIC, etc).– 24GB of heap 
is also pretty generous. Depending on how you're using Cassandra, you may be able to get 
by with ~half of this (though keep in mind the additional direct memory/offheap space 
required if you're using off-heap merkle trees).– 40 instances per machine can be a lot 
to manage. You can reduce this and address multiple physical drives per instance by 
either RAID-0'ing them together; or by using Cassandra in a JBOD configuration (multiple 
data dirs per instance).– Remember to consider the ratio of available CPU vs. amount of 
storage you're addressing per machine in your configuration. It's easy to spec a box 
that maxes out on disk without enough oomph to serve user queries and compaction over 
the amount of storage.– You'll want to run some smaller-scale perf testing to determine 
this ratio. The good news is that you mostly need to stress is the throughput of a 
replica set rather than an entire cluster. Small-scale POCs will generally map well to 
larger clusters, so long as the total count of Cassandra processes isn't more than a 
couple thousand.– At this scale, small improvements can go a very long way. If your data 
is compressible (i.e., not pre-compressed/encrypted prior to being stored in Cassandra), 
you'll likely want to use ZStandard rather than LZ4 - and possibly at a higher-ratio 
than the default. Test a set of input data with different ZStandard compression levels. 
You may save > 10% of storage relative to LZ4 by doing so without sacrificing much in 
terms of CPU.On Aug 17, 2023, at 7:46 AM, Joe Obernberger 
 wrote:Thanks for this - yeah - duh - forgot about 
replication in my example!So - is 2TBytes per Cassandra instance advisable?  Better to 
use more/less?  Modern 2u servers can be had with 24 3.8TBtyte SSDs; so assume 80Tbytes 
per server, you could do:(1024*3)/80 = 39 servers, but you'd have to run 40 instances of 
Cassandra on each server; maybe 24G of heap per instance, so a server with 1TByte of RAM 
would work.Is this what folks would do?-JoeOn 8/17/2023 9:13 AM, Bowen Song via user 
wrote:Just pointing out the obvious, for 1PB of data on nodes with 2TB disk each, you 
will need far more than 500 nodes.1, it is unwise to run Cassandra with replication 
factor 1. It usually makes sense to use RF=3, so 1PB data will cost 3PB of storage 
space, minimal of 1500 such nodes.2, depending on the compaction strategy you use and 
the write access pattern, there's a disk space amplification to consider. For example, 
with STCS, the disk usage can be many times of the actual live data size.3, you will 
need some extra free disk space as temporary space for running compactions.4, the data 
is rarely going to be perfectly evenly distributed among all nodes, and you need to take 
that into consideration and size the nodes based on the node with the most data.5, 
enough of bad news, here's a good one. Compression will save you (a lot) of disk 
space!With all the above considered, you probably will end up with a lot more than the 
500 nodes you initially thought. Your choice of compaction strategy and compression 
ratio can dramatically affect this calculation.On 16/08/2023 16:33, Joe Obernberger 
wrote:General question on how to configure Cassandra.  Say I have 1PByte of data to 
store.  The general rule of thumb is that each node (or at least instance of Cassandra) 
shouldn't handle more than 2TBytes of disk.  That means 500 instances of 
Cassandra.Assuming you have very fast persistent storage (such as a NetApp, PorterWorx 
etc.), would using Kubernetes or some orchestration layer to handle those nodes be a 
viable approach? Perhaps the worker nodes would have enough RAM to run 4 instances 
(pods) of Cassandra, you would need 125 servers.Another approach is to build your 
servers with 5 (or more) SSD devices - one for OS, four for each instance of Cassandra 
running on that server.  Then build some scripts/ansible/puppet that would manage 
Cassandra start/stops, and other maintenance items.Where I think this runs into problems 
is with repairs, or sstablescrubs that can take days to run on a single instance. How is 
that handled 'in the real world'?  With seed nodes, how many would you have in such a 
configuration?Thanks for any thoughts!-Joe-- This email has been checked for viruses by 
AVG antivirus software.www.avg.com

Re: Big Data Question

2023-08-17 Thread Joe Obernberger

Thanks for this - yeah - duh - forgot about replication in my example!
So - is 2TBytes per Cassandra instance advisable?  Better to use 
more/less?  Modern 2u servers can be had with 24 3.8TBtyte SSDs; so 
assume 80Tbytes per server, you could do:
(1024*3)/80 = 39 servers, but you'd have to run 40 instances of 
Cassandra on each server; maybe 24G of heap per instance, so a server 
with 1TByte of RAM would work.

Is this what folks would do?

-Joe

On 8/17/2023 9:13 AM, Bowen Song via user wrote:
Just pointing out the obvious, for 1PB of data on nodes with 2TB disk 
each, you will need far more than 500 nodes.


1, it is unwise to run Cassandra with replication factor 1. It usually 
makes sense to use RF=3, so 1PB data will cost 3PB of storage space, 
minimal of 1500 such nodes.


2, depending on the compaction strategy you use and the write access 
pattern, there's a disk space amplification to consider. For example, 
with STCS, the disk usage can be many times of the actual live data size.


3, you will need some extra free disk space as temporary space for 
running compactions.


4, the data is rarely going to be perfectly evenly distributed among 
all nodes, and you need to take that into consideration and size the 
nodes based on the node with the most data.


5, enough of bad news, here's a good one. Compression will save you (a 
lot) of disk space!


With all the above considered, you probably will end up with a lot 
more than the 500 nodes you initially thought. Your choice of 
compaction strategy and compression ratio can dramatically affect this 
calculation.



On 16/08/2023 16:33, Joe Obernberger wrote:
General question on how to configure Cassandra.  Say I have 1PByte of 
data to store.  The general rule of thumb is that each node (or at 
least instance of Cassandra) shouldn't handle more than 2TBytes of 
disk.  That means 500 instances of Cassandra.


Assuming you have very fast persistent storage (such as a NetApp, 
PorterWorx etc.), would using Kubernetes or some orchestration layer 
to handle those nodes be a viable approach? Perhaps the worker nodes 
would have enough RAM to run 4 instances (pods) of Cassandra, you 
would need 125 servers.
Another approach is to build your servers with 5 (or more) SSD 
devices - one for OS, four for each instance of Cassandra running on 
that server.  Then build some scripts/ansible/puppet that would 
manage Cassandra start/stops, and other maintenance items.


Where I think this runs into problems is with repairs, or 
sstablescrubs that can take days to run on a single instance. How is 
that handled 'in the real world'?  With seed nodes, how many would 
you have in such a configuration?

Thanks for any thoughts!

-Joe




--
This email has been checked for viruses by AVG antivirus software.
www.avg.com


Re: Big Data Question

2023-08-17 Thread Bowen Song via user
Just pointing out the obvious, for 1PB of data on nodes with 2TB disk 
each, you will need far more than 500 nodes.


1, it is unwise to run Cassandra with replication factor 1. It usually 
makes sense to use RF=3, so 1PB data will cost 3PB of storage space, 
minimal of 1500 such nodes.


2, depending on the compaction strategy you use and the write access 
pattern, there's a disk space amplification to consider. For example, 
with STCS, the disk usage can be many times of the actual live data size.


3, you will need some extra free disk space as temporary space for 
running compactions.


4, the data is rarely going to be perfectly evenly distributed among all 
nodes, and you need to take that into consideration and size the nodes 
based on the node with the most data.


5, enough of bad news, here's a good one. Compression will save you (a 
lot) of disk space!


With all the above considered, you probably will end up with a lot more 
than the 500 nodes you initially thought. Your choice of compaction 
strategy and compression ratio can dramatically affect this calculation.



On 16/08/2023 16:33, Joe Obernberger wrote:
General question on how to configure Cassandra.  Say I have 1PByte of 
data to store.  The general rule of thumb is that each node (or at 
least instance of Cassandra) shouldn't handle more than 2TBytes of 
disk.  That means 500 instances of Cassandra.


Assuming you have very fast persistent storage (such as a NetApp, 
PorterWorx etc.), would using Kubernetes or some orchestration layer 
to handle those nodes be a viable approach?  Perhaps the worker nodes 
would have enough RAM to run 4 instances (pods) of Cassandra, you 
would need 125 servers.
Another approach is to build your servers with 5 (or more) SSD devices 
- one for OS, four for each instance of Cassandra running on that 
server.  Then build some scripts/ansible/puppet that would manage 
Cassandra start/stops, and other maintenance items.


Where I think this runs into problems is with repairs, or 
sstablescrubs that can take days to run on a single instance.  How is 
that handled 'in the real world'?  With seed nodes, how many would you 
have in such a configuration?

Thanks for any thoughts!

-Joe




Re: 2 nodes marked as '?N' in 5 node cluster

2023-08-17 Thread Bowen Song via user
The first thing to look is the logs, specifically, the 
/var/log/cassandra/system.log file on each node.


5 seconds time drift is enough to cause Cassandra to fail. You should 
ensure the time difference between Cassandra nodes is very low by ensure 
time sync is working correctly, otherwise cross node timeout may happen, 
and a node with time relatively slightly behind may think everything is 
fine, but a node with time relatively slightly ahead will think the 
other nodes are down.



On 08/08/2023 03:54, vishnu vanced wrote:

Hi All,

I am very new to Cassandra. I have a 5 nodes cluster setup in Centos 
servers for our internal team testing, couple of days ago our network 
team has asked us to stop 3 of the nodes let's say C1,C2,C3 for OS 
patching activity. After the activity I started the nodes again but 
now interestingly in C1 node it was showing as C2 node was down and in 
C2 node it was showing C1 as down. But in remaining all three nodes 
everything is UN.i have tried disabling gossip and enabling it. 
Restarting all the nodes nothing changed. So I stopped this cluster 
and tried to build freshly. But C1 and C2 only join the cluster if 
other node is not present. So I first added C1 to the cluster and C2 
only joins when I mention it as seed node. Now in C1 nodetool status 
C2 is showing as '?N' and vice-versa. But in other nodes showing all 
as 'UN'. I have checked connectivity between all the servers and 
everything is fine. NTP in the three stopped servers differs by 5 
secs, could that be the problem? But C3 node is not showing any issues.


Due to this while creating schemas and getting errors like schema 
version mismatch repairs are failing. Can anyone give any solution as 
to how this can be fixed? Thanks!


P.S are there any telegram/whatsapp groups for Cassandra?

Regards
Vishnu