from:"Dor Laor"

Re: Send large blobs

2022-05-31 Thread Dor Laor

On Tue, May 31, 2022 at 4:40 PM Andria Trigeorgi 
wrote:

> Hi,
>
> I want to write large blobs in Cassandra. However, when I tried to write
> more than a 256MB blob, I got the message:
> "Error from server: code=2200 [Invalid query] message=\"Request is too
> big: length 268435580 exceeds maximum allowed length 268435456.\"".
>
> I tried to change the variables "max_value_size_in_mb" and "
> native_transport_max_frame_size_in_mb" of the file "
> /etc/cassandra/cassandra.yaml" to 512, but I got a ConnectionRefusedError
> error. What am I doing wrong?
>

You sent a large blob ;)

This limitation exists to protect you as a user.
The DB can store such blobs but it will incur a large and unexpected
latency, not just
for the query but also for under-the-hood operations, like backup and
repair.

Best is not to store such large blobs in Cassandra or chop them into smaller
units, let's say 10MB pieces and re-assemble in the app.

>
> Thank you in advance,
>
> Andria
>

Re: about the performance of select * from tbl

2022-04-26 Thread Dor Laor

select * reads all of the data from the cluster, obviously it would be bad
if you'll
run a single query and expect it to return 'fast'. The best way is to
divide the data
set into chunks which will be selected by the range ownership per node, so
you'll
be able to query in parallel the entire cluster and maximize the
parallelism.

If needed, I can provide an example for this

On Tue, Apr 26, 2022 at 3:48 PM 18624049226 <18624049...@163.com> wrote:

> We have a business scenario. We must execute the following statement:
>
> select * from tbl;
>
> This CQL has no WHERE condition.
>
> What I want to ask is that if the data in this table is more than one
> million or more, what methods or parameters can improve the performance of
> this CQL?
>

Re: CDC Tools

2020-05-27 Thread Dor Laor

If it's helpful, IMO, the approach Cassandra needs to take isn't
by tracking the individual node commit log and putting the burden
on the client. At Scylla, we had the 'opportunity' to be a late comer
and see what approach Cassadnra took and what DynamoDB streams
took.

We've implemented CDC as a regular CQL table [1].
Not only it's super easy to consume, it's also consistent and
you can choose to read the older values.

I recommend Cassandra should pick up our design, a small
contribution back. We're implementing an OSS kafka CDCk connector
too.

Dor

[1]
https://www.scylladb.com/2020/03/31/observing-data-changes-with-change-data-capture-cdc/

On Wed, May 27, 2020 at 5:41 PM Erick Ramirez 
wrote:

> I have looked at DataStax CDC but I think it works only for DSE !
>>
>
> Yes, thanks for the correction.  I just got confirmation myself -- the
> Kafka-Cassandra connector works with OSS C* but the CDC connector relies on
> a DSE feature that's not yet available in OSS C*. Cheers!
>

Re: What does "PER PARTITION LIMIT" means in cql query in cassandra?

2020-05-07 Thread Dor Laor

In your schema case, for each client_id you will get a single 'when'
row. Just one. Even when there are multiple rows (clustering keys)

On Thu, May 7, 2020 at 12:14 AM Check Peck  wrote:
>
> I have a scylla table as shown below:
>
>
> cqlsh:sampleks> describe table test;
>
>
> CREATE TABLE test (
>
> client_id int,
>
> when timestamp,
>
> process_ids list,
>
> md text,
>
> PRIMARY KEY (client_id, when) ) WITH CLUSTERING ORDER BY (when DESC)
>
> AND bloom_filter_fp_chance = 0.01
>
> AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
>
> AND comment = ''
>
> AND compaction = {'class': 'TimeWindowCompactionStrategy', 
> 'compaction_window_size': '1', 'compaction_window_unit': 'DAYS'}
>
> AND compression = {'sstable_compression': 
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>
> AND crc_check_chance = 1.0
>
> AND dclocal_read_repair_chance = 0.1
>
> AND default_time_to_live = 0
>
> AND gc_grace_seconds = 172800
>
> AND max_index_interval = 1024
>
> AND memtable_flush_period_in_ms = 0
>
> AND min_index_interval = 128
>
> AND read_repair_chance = 0.0
>
> AND speculative_retry = '99.0PERCENTILE';
>
>
> And I see this is how we are querying it. It's been a long time I worked on 
> cassandra so this “PER PARTITION LIMIT” is new thing to me (looks like 
> recently added). Can someone explain what does this do with some example in a 
> layman language? I couldn't find any good doc on that which explains easily.
>
>
> SELECT * FROM test WHERE client_id IN ? PER PARTITION LIMIT 1;

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: Disabling Swap for Cassandra

2020-04-16 Thread Dor Laor

On Thu, Apr 16, 2020 at 5:09 PM Kunal  wrote:
>
> Thanks for the responses. Appreciae it.
>
> @Dor, so you are saying if we add "memlock unlimited" in limits.conf, the 
> entire heap (Xms=Xmx) can be locked at startup ? Will this be applied to all 
> Java processes ?  We have couple of Java programs running with the same owner.

Each process is responsible for calling mlock on its own (in the code itself).
I only see mlock in C* under JNA, my knowledge is mostly in scylla, so
not sure about this.
The limits.conf just makes sure the limits are high enough

You should configure swap for safety, better be slow than crash, the
memory locking
is another safety measure and isn't a must. You can also run your daemons
in separate cgroup and cap their memory usage as explained in one of the answers
here:
https://stackoverflow.com/questions/12520499/linux-how-to-lock-the-pages-of-a-process-in-memory

>
>
> Thanks
> Kunal
>
> On Thu, Apr 16, 2020 at 4:31 PM Dor Laor  wrote:
>>
>> It is good to configure swap for the OS but exempt Cassandra
>> from swapping. Why is it good? Since you never know the
>> memory utilization of additional agents and processes you or
>> other admins will run on your server.
>>
>> So do configure a swap partition.
>> You can control the eagerness of the kernel by the swappiness
>> sysctl parameter. You can even control it per cgroup:
>> https://askubuntu.com/questions/967588/how-can-i-prevent-certain-process-from-being-swapped
>>
>> You should make sure Cassandra locks its memory so the kernel
>> won't choose its memory to be swapped out (since it will kill
>> your latency). You do it by mlock. Read more on:
>> https://stackoverflow.com/questions/578137/can-i-tell-linux-not-to-swap-out-a-particular-processes-memory
>>
>> The scylla /dist/common/limits.d/scylladb.com looks like this:
>> scylla  -  core unlimited
>> scylla  -  memlock  unlimited
>> scylla  -  nofile   20
>> scylla  -  as   unlimited
>> scylla  -  nproc8096
>>
>> On Thu, Apr 16, 2020 at 3:57 PM Nitan Kainth  wrote:
>> >
>> > Swap is controlled by OS and will use it when running short of memory. I 
>> > don’t think you can disable at Cassandra level
>> >
>> >
>> > Regards,
>> >
>> > Nitan
>> >
>> > Cell: 510 449 9629
>> >
>> >
>> > On Apr 16, 2020, at 5:50 PM, Kunal  wrote:
>> >
>> > 
>> >
>> > Hello,
>> >
>> >
>> >
>> > I need some suggestion from you all. I am new to Cassandra and was reading 
>> > Cassandra best practices. On one document, it was mentioned that Cassandra 
>> > should not be using swap, it degrades the performance.
>> >
>> > My question is instead of disabling swap system wide, can we force 
>> > Cassandra not to use swap? Some documentation suggests to use 
>> > memory_locking_policy in cassandra.yaml.
>> >
>> >
>> > How do I check if our Cassandra already has this parameter and still uses 
>> > swap ? Is there any way i can check this. I already checked cassandra.yaml 
>> > and dont see this parameter. Is there any other place i can check and 
>> > confirm?
>> >
>> >
>> > Also, Can I set memlock parameter to unlimited (64kB default), so entire 
>> > Heap (Xms = Xmx) can be locked at node startup ? Will that help?
>> >
>> >
>> > Or if you have any other suggestions, please let me know.
>> >
>> >
>> >
>> >
>> >
>> > Regards,
>> >
>> > Kunal
>> >
>> >
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>
>
> --
>
>
>
> Regards,
> Kunal Vaid

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: Disabling Swap for Cassandra

2020-04-16 Thread Dor Laor

It is good to configure swap for the OS but exempt Cassandra
from swapping. Why is it good? Since you never know the
memory utilization of additional agents and processes you or
other admins will run on your server.

So do configure a swap partition.
You can control the eagerness of the kernel by the swappiness
sysctl parameter. You can even control it per cgroup:
https://askubuntu.com/questions/967588/how-can-i-prevent-certain-process-from-being-swapped

You should make sure Cassandra locks its memory so the kernel
won't choose its memory to be swapped out (since it will kill
your latency). You do it by mlock. Read more on:
https://stackoverflow.com/questions/578137/can-i-tell-linux-not-to-swap-out-a-particular-processes-memory

The scylla /dist/common/limits.d/scylladb.com looks like this:
scylla  -  core unlimited
scylla  -  memlock  unlimited
scylla  -  nofile   20
scylla  -  as   unlimited
scylla  -  nproc8096

On Thu, Apr 16, 2020 at 3:57 PM Nitan Kainth  wrote:
>
> Swap is controlled by OS and will use it when running short of memory. I 
> don’t think you can disable at Cassandra level
>
>
> Regards,
>
> Nitan
>
> Cell: 510 449 9629
>
>
> On Apr 16, 2020, at 5:50 PM, Kunal  wrote:
>
> 
>
> Hello,
>
>
>
> I need some suggestion from you all. I am new to Cassandra and was reading 
> Cassandra best practices. On one document, it was mentioned that Cassandra 
> should not be using swap, it degrades the performance.
>
> My question is instead of disabling swap system wide, can we force Cassandra 
> not to use swap? Some documentation suggests to use memory_locking_policy in 
> cassandra.yaml.
>
>
> How do I check if our Cassandra already has this parameter and still uses 
> swap ? Is there any way i can check this. I already checked cassandra.yaml 
> and dont see this parameter. Is there any other place i can check and confirm?
>
>
> Also, Can I set memlock parameter to unlimited (64kB default), so entire Heap 
> (Xms = Xmx) can be locked at node startup ? Will that help?
>
>
> Or if you have any other suggestions, please let me know.
>
>
>
>
>
> Regards,
>
> Kunal
>
>

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: sstableloader: How much does it actually need?

2020-02-05 Thread Dor Laor

Another option is to use the Spark migrator, it reads a source CQL cluster and
writes to another. It has a validation stage that compares a full scan
and reports the diff:
https://github.com/scylladb/scylla-migrator

There are many more ways to clone a cluster. My main recommendation is
to 'optimize'
for correctness and simplicity first and only last optimize for
performance/time. Eventually
machine time for such rare operation is cheap, engineering time is
expensive and data
inconsistency is priceless..

On Wed, Feb 5, 2020 at 5:24 PM Sergio  wrote:
>
> Another option is the DSE-bulk loader but it will require to convert to 
> csv/json (good option if you don't like to play with sstableloader and deal 
> to get all the sstables from all the nodes)
> https://docs.datastax.com/en/dsbulk/doc/index.html
>
> Cheers
>
> Sergio
>
> Il giorno mer 5 feb 2020 alle ore 16:56 Erick Ramirez  
> ha scritto:
>>
>> Unfortunately, there isn't a guarantee that 2 nodes alone will have the full 
>> copy of data. I'd rather not say "it depends".
>>
>> TIP: If the nodes in the target cluster have identical tokens allocated, you 
>> can just do a straight copy of the sstables node-for-node then do nodetool 
>> refresh. If the target cluster is already built and you can't assign the 
>> same tokens then sstableloader is your only option. Cheers!
>>
>> P.S. No need to apologise for asking questions. That's what we're all here 
>> for. Just keep them coming.

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: [EXTERNAL] Re: URGENT Migration across different Cassandra cluster few having same keyspace/table names

2020-01-17 Thread Dor Laor

Another option instead of raw sstables is to use the Spark Migrator [1].
It reads a source cluster, can make some transformations (like
table/column naming) and
writes to a target cluster. It's a very convenient tool, OSS and free of charge.

[1] https://github.com/scylladb/scylla-migrator

On Fri, Jan 17, 2020 at 5:31 PM Erick Ramirez  wrote:
>>
>> In terms of speed, the sstableloader should be faster correct?
>> Maybe the DSE BulkLoader finds application when you want a slice of the data 
>> and not the entire cake. Is it correct?
>
>
> There's no real direct comparison because DSBulk is designed for operating on 
> data in CSV or JSON as a replacement for the COPY command. Cheers!
>
> On Sat, Jan 18, 2020 at 6:29 AM Sergio  wrote:
>>
>> Hi everyone,
>>
>> Is the DSE BulkLoader faster than the sstableloader?
>>
>> Sometimes I need to make a cluster snapshot and replicate a Cluster A to a 
>> Cluster B  with fewer performance capabilities but the same data size.
>>
>> In terms of speed, the sstableloader should be faster correct?
>>
>> Maybe the DSE BulkLoader finds application when you want a slice of the data 
>> and not the entire cake. Is it correct?
>>
>> Thanks,
>>
>> Sergio
>>
>>

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: Dynamo autoscaling: does it beat cassandra?

2019-12-10 Thread Dor Laor

Compression of 3.x is much better than 2.y (see the attached graph of our
of our customers (scylla).
However it's not related to Dynamo's hot partition and caching. In Dynamo,
every tablet has its own
limits and caching isn't taken into account. Once the throughput goes
beyond the tablet reservation/maximum,
the tablet is either split or get capped. So Zipfian workloads get
penalized a lot.

Both Cassasndra and Scylla rather have uniform workloads but will handle
Zipfian better.

On Tue, Dec 10, 2019 at 12:19 PM Carl Mueller
 wrote:

> Dor and Reid: thanks, that was very helpful.
>
> Is the large amount of compression an artifact of pre-cass3.11 where the
> column names were per-cell (combined with the cluster key for extreme
> verbosity, I think), so compression would at least be effective against
> those portions of the sstable data? IIRC the cass commiters figured as long
> as you can shrink the data, the reduced size drops the time to read off of
> the disk, maybe even the time to get into CPU cache from memory and the CPU
> to decompress is somewhat "free" at that point since everything else is
> stalled for I/O or memory reads?
>
> But I don't know how the 3.11.x format works to avoid spamming of those
> column names, I haven't torn into that part of the code.
>
> On Tue, Dec 10, 2019 at 10:15 AM Reid Pinchback <
> rpinchb...@tripadvisor.com> wrote:
>
>> Note that DynamoDB I/O throughput scaling doesn’t work well with brief
>> spikes.  Unless you write your own machinery to manage the provisioning, by
>> the time AWS scales the I/O bandwidth your incident has long since passed.
>> It’s not a thing to rely on if you have a latency SLA.  It really only
>> works for situations like a sustained alteration in load, e.g. if you have
>> a sinusoidal daily traffic pattern, or periodic large batch operations that
>> run for an hour or two, and you need the I/O adjustment while that takes
>> place.
>>
>>
>>
>> Also note that DynamoDB routinely chokes on write contention, which C*
>> would rarely do.  About the only benefit DynamoDB has over C* is that more
>> of its operations function as atomic mutations of an existing row.
>>
>>
>>
>> One thing to also factor into the comparison is developer effort.  The
>> DynamoDB API isn’t exactly tuned to making developers productive.  Most of
>> the AWS APIs aren’t, really, once you use them for non-toy projects. AWS
>> scales in many dimensions, but total developer effort is not one of them
>> when you are talking about high-volume tier one production systems.
>>
>>
>>
>> To respond to one of the other original points/questions, yes key and row
>> caches don’t seem to be a win, but that would vary with your specific usage
>> pattern.  Caches need a good enough hit rate to offset the GC impact.  Even
>> when C* lets you move things off heap, you’ll see a fair number of GC-able
>> artifacts associated with data in caches.  Chunk cache somewhat wins with
>> being off-heap, because it isn’t just I/O avoidance with that cache, you’re
>> also benefitting from the decompression.  However I’ve started to wonder
>> how often sstable compression is worth the performance drag and internal C*
>> complexity.  If you compare to where a more traditional RDBMS would use
>> compression, e.g. Postgres, use of compression is more selective; you only
>> bear the cost in the places already determined to win from the tradeoff.
>>
>>
>>
>> *From: *Dor Laor 
>> *Reply-To: *"user@cassandra.apache.org" 
>> *Date: *Monday, December 9, 2019 at 5:58 PM
>> *To: *"user@cassandra.apache.org" 
>> *Subject: *Re: Dynamo autoscaling: does it beat cassandra?
>>
>>
>>
>> *Message from External Sender*
>>
>> The DynamoDB model has several key benefits over Cassandra's.
>>
>> The most notable one is the tablet concept - data is partitioned into 10GB
>>
>> chunks. So scaling happens where such a tablet reaches maximum capacity
>>
>> and it is automatically divided to two. It can happen in parallel across
>> the entire
>>
>> data set, thus there is no concept of growing the amount of nodes or
>> vnodes.
>>
>> As the actual hardware is multi-tenant, the average server should have
>> plenty
>>
>> of capacity to receive these streams.
>>
>>
>>
>> That said, when we benchmarked DynamoDB and just hit it with ingest
>> workload,
>>
>> even when it was reserved, we had to slow down the pace since we received
>> many
>>
>> 'error 500' which means internal server errors. Their hot parti

Re: Dynamo autoscaling: does it beat cassandra?

2019-12-09 Thread Dor Laor

The DynamoDB model has several key benefits over Cassandra's.
The most notable one is the tablet concept - data is partitioned into 10GB
chunks. So scaling happens where such a tablet reaches maximum capacity
and it is automatically divided to two. It can happen in parallel across
the entire
data set, thus there is no concept of growing the amount of nodes or vnodes.
As the actual hardware is multi-tenant, the average server should have
plenty
of capacity to receive these streams.

That said, when we benchmarked DynamoDB and just hit it with ingest
workload,
even when it was reserved, we had to slow down the pace since we received
many
'error 500' which means internal server errors. Their hot partitions do not
behave great
as well.

So I believe a growth of 10% the capacity with good key distribution can be
handled well
but a growth of 2x in a short time will fail. It's something you're expect
from any database
but Dynamo has an advantage with tablets and multitenancy and issues with
hot partitions
and accounting of hot keys which will get cached in Cassandra better.

Dynamo allows you to detach compute from the storage which is a key benefit
in a serverless, spiky deployment.

On Mon, Dec 9, 2019 at 1:02 PM Jeff Jirsa  wrote:

> Expansion probably much faster in 4.0 with complete sstable streaming
> (skips ser/deser), though that may have diminishing returns with vnodes
> unless you're using LCS.
>
> Dynamo on demand / autoscaling isn't magic - they're overprovisioning to
> give you the burst, then expanding on demand. That overprovisioning comes
> with a cost. Unless you're actively and regularly scaling, you're probably
> going to pay more for it.
>
> It'd be cool if someone focused on this - I think the faster streaming
> goes a long way. The way vnodes work today make it difficult to add more
> than one at a time without violating consistency, and thats unlikely to
> change, but if each individual node is much faster, that may mask it a bit.
>
>
>
> On Mon, Dec 9, 2019 at 12:35 PM Carl Mueller
>  wrote:
>
>> Dynamo salespeople have been pushing autoscaling abilities that have been
>> one of the key temptations to our management to switch off of cassandra.
>>
>> Has anyone done any numbers on how well dynamo will autoscale demand
>> spikes, and how we could architect cassandra to compete with such abilities?
>>
>> We probably could overprovision and with the presumably higher cost of
>> dynamo beat it, although the sales engineers claim they are closing the
>> cost factor too. We could vertically scale to some degree, but node
>> expansion seems close.
>>
>> VNode expansion is still limited to one at a time?
>>
>> We use VNodes so we can't do netflix's cluster doubling, correct? With
>> cass 4.0's alleged segregation of the data by token we could though and
>> possibly also "prep" the node by having the necessary sstables already
>> present ahead of time?
>>
>> There's always "caching" too, but there isn't a lot of data on general
>> fronting of cassandra with caches, and the row cache continues to be mostly
>> useless?
>>
>

Re: [EXTERNAL] Re: Good way of configuring Apache spark with Apache Cassandra

2019-01-09 Thread Dor Laor

On Wed, Jan 9, 2019 at 7:28 AM Durity, Sean R 
wrote:

> I think you could consider option C: Create a (new) analytics DC in
> Cassandra and run your spark nodes there. Then you can address the scaling
> just on that DC. You can also use less vnodes, only replicate certain
> keyspaces, etc. in order to perform the analytics more efficiently.
>

But this way you duplicate the entire dataset RF times over. It's very very
expensive.
It is a common practice to run Spark on a separate Cassandra (virtual)
datacenter but it's done
in order to isolate the analytic workload from the realtime workload for
isolation and low latency guarantees.
We addressed this problem elsewhere, beyond this scope.


>
>
>
> Sean Durity
>
>
>
> *From:* Dor Laor 
> *Sent:* Friday, January 04, 2019 4:21 PM
> *To:* user@cassandra.apache.org
> *Subject:* [EXTERNAL] Re: Good way of configuring Apache spark with
> Apache Cassandra
>
>
>
> I strongly recommend option B, separate clusters. Reasons:
>
>  - Networking of node-node is negligible compared to networking within the
> node
>
>  - Different scaling considerations
>
>Your workload may require 10 Spark nodes and 20 database nodes, so why
> bundle them?
>
>This ratio may also change over time as your application evolves and
> amount of data changes.
>
>  - Isolation - If Spark has a spike in cpu/IO utilization, you wouldn't
> want it to affect Cassandra and the opposite.
>
>If you isolate it with cgroups, you may have too much idle time when
> the above doesn't happen.
>
>
>
>
>
> On Fri, Jan 4, 2019 at 12:47 PM Goutham reddy 
> wrote:
>
> Hi,
>
> We have requirement of heavy data lifting and analytics requirement and
> decided to go with Apache Spark. In the process we have come up with two
> patterns
>
> a. Apache Spark and Apache Cassandra co-located and shared on same nodes.
>
> b. Apache Spark on one independent cluster and Apache Cassandra as one
> independent cluster.
>
>
>
> Need good pattern how to use the analytic engine for Cassandra. Thanks in
> advance.
>
>
>
> Regards
>
> Goutham.
>
>
> --
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>

Re: Cassandra Splitting databases

2019-01-06 Thread Dor Laor

Basically it's all variations of alter keyspace command, depends on your
actual starting point.
ALTER KEYSPACE mykespace WITH replication = { 'class' :
'NetworkTopologyStrategy', 'replication_factor': '3', '' : 3,
 : 4};

The best is to play a bit with multi DC setup with docker images and repeat
it with a test dataset until you are confidence about the
commands and their outcome. This example should work with Cassandra:
https://www.scylladb.com/2018/03/28/mms-day7-multidatacenter-consistency/

On Sat, Jan 5, 2019 at 5:57 AM R1 J1  wrote:

> Dor Laor,
> I like your approach. If I restrict the replication factor of a keyspace
> to on premise  data center and another to azure and attempt to split the
> cluster?
> Do you have some documentation I can refer to ?
>
> Regards
> R1J1
>
> On Fri, Jan 4, 2019 at 5:32 PM Dor Laor  wrote:
>
>> Not sure I understand correctly but if you have one cluster with 2
>> separate datacenters
>> you can define keyspace A to be on premise with a single DC and keyspace
>> B only on Azure.
>>
>>
>> On Fri, Jan 4, 2019 at 2:23 PM R1 J1  wrote:
>>
>>> We currently  have  2 databases (A and B ) on a 6 node cluster.
>>> 3 nodes are on premise and 3 in azure.   I want  database A to live on
>>> onpremise cluster and  I want Database B to stay in the Azure.  I want to
>>> then split the cluster into 2 clusters one onpremise (3 node )  having
>>> Database A and other in Azure (3 node ) having Database B.
>>>
>>> How do we accomplish such a split ?
>>>
>>>
>>> Regards
>>> R1J1
>>>
>>

Re: Cassandra Splitting databases

2019-01-04 Thread Dor Laor

Not sure I understand correctly but if you have one cluster with 2 separate
datacenters
you can define keyspace A to be on premise with a single DC and keyspace B
only on Azure.


On Fri, Jan 4, 2019 at 2:23 PM R1 J1  wrote:

> We currently  have  2 databases (A and B ) on a 6 node cluster.
> 3 nodes are on premise and 3 in azure.   I want  database A to live on
> onpremise cluster and  I want Database B to stay in the Azure.  I want to
> then split the cluster into 2 clusters one onpremise (3 node )  having
> Database A and other in Azure (3 node ) having Database B.
>
> How do we accomplish such a split ?
>
>
> Regards
> R1J1
>

Re: Good way of configuring Apache spark with Apache Cassandra

2019-01-04 Thread Dor Laor

I strongly recommend option B, separate clusters. Reasons:
 - Networking of node-node is negligible compared to networking within the
node
 - Different scaling considerations
   Your workload may require 10 Spark nodes and 20 database nodes, so why
bundle them?
   This ratio may also change over time as your application evolves and
amount of data changes.
 - Isolation - If Spark has a spike in cpu/IO utilization, you wouldn't
want it to affect Cassandra and the opposite.
   If you isolate it with cgroups, you may have too much idle time when the
above doesn't happen.


On Fri, Jan 4, 2019 at 12:47 PM Goutham reddy 
wrote:

> Hi,
> We have requirement of heavy data lifting and analytics requirement and
> decided to go with Apache Spark. In the process we have come up with two
> patterns
> a. Apache Spark and Apache Cassandra co-located and shared on same nodes.
> b. Apache Spark on one independent cluster and Apache Cassandra as one
> independent cluster.
>
> Need good pattern how to use the analytic engine for Cassandra. Thanks in
> advance.
>
> Regards
> Goutham.
>

Re: Migrating from DSE5.1.2 to Opensource cassandra

2018-12-05 Thread Dor Laor

An alternative approach is to form another new cluster, leave the original
cluster alive (many times
it's a must since it needs to be 24x7 online). Double write to the two
clusters and later migrate the
data to it. Either by taking a snapshot and pass those files to the new
cluster or with sstableloader.
With this procedure, you'll need to have the same token range ownership.

Another solution is to migrate using Spark which will full-table-scan. We
have generic code that
does it and we can open source it. This way the new cluster can be of any
size and speed is also good
with large amount of data (100s of TB). This process is also restartable as
it takes days to transfer such
amount of data.

Good luck

On Tue, Dec 4, 2018 at 9:04 PM dinesh.jo...@yahoo.com.INVALID
 wrote:

> Thanks, nice summary of the overall process.
>
> Dinesh
>
>
> On Tuesday, December 4, 2018, 9:38:47 PM EST, Jonathan Koppenhofer <
> j...@koppedomain.com> wrote:
>
>
> Unfortunately, we found this to be a little tricky. We did migrations from
> DSE 4.8 and 5.0 to OSS 3.0.x, so you may run into additional issues. I will
> also say your best option may be to install a fresh cluster and stream the
> data. This wasn't feasible for us at the size and scale in the time frames
> and infrastructure restrictions we had. I will have to review my notes for
> more detail, but off the top of my head, for an in place migration...
>
> Pre-upgrade
> * Be sure you are not using any Enterprise features like Search or Graph.
> Not only are there not equivalent features in open source, but theses
> features require proprietary classes to be in the classpath, or Cassandra
> will not even start up.
> * By default, I think DSE uses their own custom authenticators,
> authorizors, and such. Make sure what you are doing has an open source
> equivalent.
> * The DSE system keyapaces use custom replication strategies. Convert
> these to NTS before upgrade.
> * Otherwise, follow the same processes you would do before an upgrade
> (repair, snapshot, etc)
>
> Upgrade
> * The easy part is just replacing the binaries as you would in normal
> upgrade. Drain and stop the existing node first. You can also do this same
> process in a rolling fashion to maintain availability. In our case, we were
> doing an in-place upgrade and reusing the same IPs
> * DSE unfortunately creates a custom column in a system table that
> requires you to remove one (or more) system tables (peers?) to be able to
> start the node. You delete these system tables by  removing the sstbles on
> disk while the node is down. This is a bit of a headache if using vnodes.
> As we are using vnodes, it required us to manually specify num tokens, and
> the specific tokens the node was responsible for in Cassandra.yaml. You
> have to do this before you start the node. If not using vnodes, this is
> simpler, but we used vnodes. Again, I'll double check my notes. Once the
> node is up, you can revert to your normal vnodes/num tokens settings.
>
> Post upgrade:
> * Drop DSE system tables.
>
> I'll revert with more detail if needed.
>
> On Tue, Dec 4, 2018, 5:46 PM Nandakishore Tokala <
> nandakishore.tok...@gmail.com wrote:
>
> HI All,
>
> we are migrating from DSE to open source Cassandra. if anyone has recently
> migrated, Can you please share their experience, steps you followed and
> challenges you guys faced.
>
> we want to migrate to the same computable version in open source, can you
> give us version number(even with the minor version) for DSE 5.1.2
>
> 5.1 DSE production-certified 3.10 + enhancements 3.4 + enhancements big m
>
> --
> Thanks & Regards,
> Nanda Kishore
>
>

Re: Amazon Time Sync Service + ntpd vs chrony

2018-03-08 Thread Dor Laor

There is no one size fits all but take a look at this scenario:

- T0  -- T1
Op0   Op1
 Client   Client
 deletes  writes Y to
  cell X   cell X

T0 < T1 in the real world.

When using client timestamp T0 clearly happens before T1.
If you use server timestamp, with Chrony, there is miliseconds difference:
https://chrony.tuxfamily.org/

So if Op0 was handled by S0 node and Op1 handles by S1 node
were time(S0) > time(S1) you get an empty cell in X.

Of course the same can happens when you use multiple clients issuing
multiple
operations but at least from the perspective of the single client logic
things are coherent.



On Thu, Mar 8, 2018 at 6:46 PM, Ben Bromhead <b...@instaclustr.com> wrote:

> I wouldn't 100% rely on your clients to generate timestamps (actually
> don't 100% rely on timestamps at all!) .
>
> Clients tend to be stateless, scaled up and down,  stopped, started, ntp
> takes time to skew a clock and they are more likely to be moved between
> hypervisor's in cloud environments etc. All these combine to give you more
> scenarios where clients have unreliable clocks that are not roughly in sync
> with each other.
>
> By far and large the worst time related bugs I've experienced are due to
> Cassandra clients having the wrong timestamp set for it's writetime.
>
> Of course this depends on what you are prioritising... relative accuracy
> of any given writetime on one row to any other given row or just respecting
> what the client thinks is right.
>
>
> On Thu, Mar 8, 2018, 21:24 Jeff Jirsa <jji...@gmail.com> wrote:
>
>> Clients can race (and go backward), so the more computer answer tends to
>> be to use LWT/CAS to guarantee state if you have a data model where it
>> matters.
>>
>>
>> --
>> Jeff Jirsa
>>
>>
>> On Mar 8, 2018, at 6:18 PM, Dor Laor <d...@scylladb.com> wrote:
>>
>> While NTP on the servers is important, make sure that you use client
>> timestamps and
>> not server. Since the last write wins, the data generator should be the
>> one setting its timestamp.
>>
>> On Thu, Mar 8, 2018 at 2:12 PM, Ben Slater <ben.sla...@instaclustr.com>
>> wrote:
>>
>>> It is important to make sure you are using the same NTP servers across
>>> your cluster - we used to see relatively frequent NTP issues across our
>>> fleet using default/public NTP servers until (back in 2015) we implemented
>>> our own NTP pool (see https://www.instaclustr.com/apache-cassandra-
>>> synchronization/ which references some really good and detailed posts
>>> from logentries.com on the potential issues).
>>>
>>> Cheers
>>> Ben
>>>
>>> On Fri, 9 Mar 2018 at 02:07 Michael Shuler <mich...@pbandjelly.org>
>>> wrote:
>>>
>>>> As long as your nodes are syncing time using the same method, that
>>>> should be good. Don't mix daemons, however, since they may sync from
>>>> different sources. Whether you use ntpd, openntp, ntpsec, chrony isn't
>>>> really important, since they are all just background daemons to sync the
>>>> system clock. There is nothing Cassandra-specific.
>>>>
>>>> --
>>>> Kind regards,
>>>> Michael
>>>>
>>>> On 03/08/2018 04:15 AM, Kyrylo Lebediev wrote:
>>>> > Hi!
>>>> >
>>>> > Recently Amazon announced launch of Amazon Time Sync Service
>>>> > (https://aws.amazon.com/blogs/aws/keeping-time-with-amazon-
>>>> time-sync-service/)
>>>> > and now it's AWS-recommended way for time sync on EC2 instances
>>>> > (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/set-time.html).
>>>> > It's stated there that chrony is faster / more precise than ntpd.
>>>> >
>>>> > Nothing to say correct time sync configuration is very important for
>>>> any
>>>> > C* setup.
>>>> >
>>>> > Does anybody have positive experience using crony, Amazon Time Sync
>>>> > Service with Cassandra and/or combination of them?
>>>> > Any concerns regarding chrony + Amazon Time Sync Service + Cassandra?
>>>> > Are there any chrony best-practices/custom settings for C* setups?
>>>> >
>>>> > Thanks,
>>>> > Kyrill
>>>> >
>>>>
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache

Re: Amazon Time Sync Service + ntpd vs chrony

2018-03-08 Thread Dor Laor

Agree! When using client timestamp, NTP should be running on them as well.


On Thu, Mar 8, 2018 at 6:24 PM, Jeff Jirsa <jji...@gmail.com> wrote:

> Clients can race (and go backward), so the more computer answer tends to
> be to use LWT/CAS to guarantee state if you have a data model where it
> matters.
>
> --
> Jeff Jirsa
>
>
> On Mar 8, 2018, at 6:18 PM, Dor Laor <d...@scylladb.com> wrote:
>
> While NTP on the servers is important, make sure that you use client
> timestamps and
> not server. Since the last write wins, the data generator should be the
> one setting its timestamp.
>
> On Thu, Mar 8, 2018 at 2:12 PM, Ben Slater <ben.sla...@instaclustr.com>
> wrote:
>
>> It is important to make sure you are using the same NTP servers across
>> your cluster - we used to see relatively frequent NTP issues across our
>> fleet using default/public NTP servers until (back in 2015) we implemented
>> our own NTP pool (see https://www.instaclustr.c
>> om/apache-cassandra-synchronization/ which references some really good
>> and detailed posts from logentries.com on the potential issues).
>>
>> Cheers
>> Ben
>>
>> On Fri, 9 Mar 2018 at 02:07 Michael Shuler <mich...@pbandjelly.org>
>> wrote:
>>
>>> As long as your nodes are syncing time using the same method, that
>>> should be good. Don't mix daemons, however, since they may sync from
>>> different sources. Whether you use ntpd, openntp, ntpsec, chrony isn't
>>> really important, since they are all just background daemons to sync the
>>> system clock. There is nothing Cassandra-specific.
>>>
>>> --
>>> Kind regards,
>>> Michael
>>>
>>> On 03/08/2018 04:15 AM, Kyrylo Lebediev wrote:
>>> > Hi!
>>> >
>>> > Recently Amazon announced launch of Amazon Time Sync Service
>>> > (https://aws.amazon.com/blogs/aws/keeping-time-with-amazon-t
>>> ime-sync-service/)
>>> > and now it's AWS-recommended way for time sync on EC2 instances
>>> > (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/set-time.html).
>>> > It's stated there that chrony is faster / more precise than ntpd.
>>> >
>>> > Nothing to say correct time sync configuration is very important for
>>> any
>>> > C* setup.
>>> >
>>> > Does anybody have positive experience using crony, Amazon Time Sync
>>> > Service with Cassandra and/or combination of them?
>>> > Any concerns regarding chrony + Amazon Time Sync Service + Cassandra?
>>> > Are there any chrony best-practices/custom settings for C* setups?
>>> >
>>> > Thanks,
>>> > Kyrill
>>> >
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>>
>>> --
>>
>>
>> *Ben Slater*
>>
>> *Chief Product Officer <https://www.instaclustr.com/>*
>>
>> <https://www.facebook.com/instaclustr>
>> <https://twitter.com/instaclustr>
>> <https://www.linkedin.com/company/instaclustr>
>>
>> Read our latest technical blog posts here
>> <https://www.instaclustr.com/blog/>.
>>
>> This email has been sent on behalf of Instaclustr Pty. Limited
>> (Australia) and Instaclustr Inc (USA).
>>
>> This email and any attachments may contain confidential and legally
>> privileged information.  If you are not the intended recipient, do not copy
>> or disclose its content, but please reply to this email immediately and
>> highlight the error to the sender and then immediately delete the message.
>>
>
>

Re: Amazon Time Sync Service + ntpd vs chrony

2018-03-08 Thread Dor Laor

While NTP on the servers is important, make sure that you use client
timestamps and
not server. Since the last write wins, the data generator should be the one
setting its timestamp.

On Thu, Mar 8, 2018 at 2:12 PM, Ben Slater 
wrote:

> It is important to make sure you are using the same NTP servers across
> your cluster - we used to see relatively frequent NTP issues across our
> fleet using default/public NTP servers until (back in 2015) we implemented
> our own NTP pool (see https://www.instaclustr.com/apache-cassandra-
> synchronization/ which references some really good and detailed posts
> from logentries.com on the potential issues).
>
> Cheers
> Ben
>
> On Fri, 9 Mar 2018 at 02:07 Michael Shuler  wrote:
>
>> As long as your nodes are syncing time using the same method, that
>> should be good. Don't mix daemons, however, since they may sync from
>> different sources. Whether you use ntpd, openntp, ntpsec, chrony isn't
>> really important, since they are all just background daemons to sync the
>> system clock. There is nothing Cassandra-specific.
>>
>> --
>> Kind regards,
>> Michael
>>
>> On 03/08/2018 04:15 AM, Kyrylo Lebediev wrote:
>> > Hi!
>> >
>> > Recently Amazon announced launch of Amazon Time Sync Service
>> > (https://aws.amazon.com/blogs/aws/keeping-time-with-amazon-
>> time-sync-service/)
>> > and now it's AWS-recommended way for time sync on EC2 instances
>> > (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/set-time.html).
>> > It's stated there that chrony is faster / more precise than ntpd.
>> >
>> > Nothing to say correct time sync configuration is very important for any
>> > C* setup.
>> >
>> > Does anybody have positive experience using crony, Amazon Time Sync
>> > Service with Cassandra and/or combination of them?
>> > Any concerns regarding chrony + Amazon Time Sync Service + Cassandra?
>> > Are there any chrony best-practices/custom settings for C* setups?
>> >
>> > Thanks,
>> > Kyrill
>> >
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>> --
>
>
> *Ben Slater*
>
> *Chief Product Officer *
>
>    
>
>
> Read our latest technical blog posts here
> .
>
> This email has been sent on behalf of Instaclustr Pty. Limited (Australia)
> and Instaclustr Inc (USA).
>
> This email and any attachments may contain confidential and legally
> privileged information.  If you are not the intended recipient, do not copy
> or disclose its content, but please reply to this email immediately and
> highlight the error to the sender and then immediately delete the message.
>

Re: GDPR, Right to Be Forgotten, and Cassandra

2018-02-09 Thread Dor Laor

I think you're introducing a layer violation. GDPR is a business
requirement and
compaction is an implementation detail.

IMHO it's enough to delete the partition using regular CQL.
It's true that it won't be deleted immedietly but it will be eventually
deleted (welcome to eventual consistency ;).

Even with user defined compaction, compaction may not be running instantly,
repair will be required,
there are other nodes in the cluster, maybe partitioned nodes with the
data. There is data in snapshots
and backups.

The business idea is to delete the data in a fast, reasonable time for
humans and make it
first unreachable and later delete completely.

On Fri, Feb 9, 2018 at 8:51 AM, Jonathan Haddad  wrote:

> That might be fine for a one off but is totally impractical at scale or
> when using TWCS.
> On Fri, Feb 9, 2018 at 8:39 AM DuyHai Doan  wrote:
>
>> Or use the new user-defined compaction option recently introduced,
>> provided you can determine over which SSTables a partition is spread
>>
>> On Fri, Feb 9, 2018 at 5:23 PM, Jon Haddad  wrote:
>>
>>> Give this a read through:
>>>
>>> https://github.com/protectwise/cassandra-util/tree/master/deleting-
>>> compaction-strategy
>>>
>>> Basically you write your own logic for how stuff gets forgotten, then
>>> you can recompact every sstable with upgradesstables -a.
>>>
>>> Jon
>>>
>>>
>>> On Feb 9, 2018, at 8:10 AM, Nicolas Guyomar 
>>> wrote:
>>>
>>> Hi everyone,
>>>
>>> Because of GDPR we really face the need to support “Right to Be
>>> Forgotten” requests => https://gdpr-info.eu/art-17-gdpr/  stating that *"the
>>> controller shall have the obligation to erase personal data without undue
>>> delay"*
>>>
>>> Because I usually meet customers that do not have that much clients,
>>> modeling one partition per client is almost always possible, easing
>>> deletion by partition key.
>>>
>>> Then, appart from triggering a manual compaction on impacted tables
>>> using STCS, I do not see how I can be GDPR compliant.
>>>
>>> I'm kind of surprised not to find any thread on that matter on the ML,
>>> do you guys have any modeling strategy that would make it easier to get rid
>>> of data ?
>>>
>>> Thank you for any given advice
>>>
>>> Nicolas
>>>
>>>
>>>
>>

Re: Too many open files

2018-01-22 Thread Dor Laor

It's a high number, your compaction may run behind and thus
many small sstables exist. However, you're also taking the
number of network connection in the calculation (everything
in *nix is a file). If it makes you feel better my laptop
has 40k open files for Chrome..

On Sun, Jan 21, 2018 at 11:59 PM, Andreou, Arys (Nokia - GR/Athens) <
arys.andr...@nokia.com> wrote:

> Hi,
>
>
>
> I keep getting a “Last error: Too many open files” followed by a list of
> node IPs.
>
> The output of “lsof -n|grep java|wc -l” is about 674970 on each node.
>
>
>
> What is a normal number of open files?
>
>
>
> Thank you.
>
>
>

Re: Meltdown/Spectre Linux patch - Performance impact on Cassandra?

2018-01-09 Thread Dor Laor

Hard to tell from the first 10 google search results which Intel CPUs
has it so I went to ask my /proc/cpuinfo, turns out my >1 year Dell XPS
laptop has it. AWS's i3 has it too.

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb
rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology
nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor
ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic
movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm
3dnowprefetch cpuid_fault epb invpcid_single pti intel_pt tpr_shadow vnmi
flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid
mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves dtherm ida
arat pln pts hwp hwp_notify hwp_act_window hwp_epp


On Tue, Jan 9, 2018 at 11:19 PM, daemeon reiydelle <daeme...@gmail.com>
wrote:

> Good luck with that. Pcid out since mid 2017 as I recall?
>
>
> Daemeon (Dæmœn) Reiydelle
> USA 1.415.501.0198 <(415)%20501-0198>
>
> On Jan 9, 2018 10:31 AM, "Dor Laor" <d...@scylladb.com> wrote:
>
> Make sure you pick instances with PCID cpu capability, their TLB overhead
> flush
> overhead is much smaller
>
> On Tue, Jan 9, 2018 at 2:04 AM, Steinmaurer, Thomas <
> thomas.steinmau...@dynatrace.com> wrote:
>
>> Quick follow up.
>>
>>
>>
>> Others in AWS reporting/seeing something similar, e.g.:
>> https://twitter.com/BenBromhead/status/950245250504601600
>>
>>
>>
>> So, while we have seen an relative CPU increase of ~ 50% since Jan 4,
>> 2018, we now also have applied a kernel update at OS/VM level on a single
>> node (loadtest and not production though), thus more or less double patched
>> now. Additional CPU impact by OS/VM level kernel patching is more or less 
>> negligible,
>> so looks highly Hypervisor related.
>>
>>
>>
>> Regards,
>>
>> Thomas
>>
>>
>>
>> *From:* Steinmaurer, Thomas [mailto:thomas.steinmau...@dynatrace.com]
>> *Sent:* Freitag, 05. Jänner 2018 12:09
>> *To:* user@cassandra.apache.org
>> *Subject:* Meltdown/Spectre Linux patch - Performance impact on
>> Cassandra?
>>
>>
>>
>> Hello,
>>
>>
>>
>> has anybody already some experience/results if a patched Linux kernel
>> regarding Meltdown/Spectre is affecting performance of Cassandra negatively?
>>
>>
>>
>> In production, all nodes running in AWS with m4.xlarge, we see up to a
>> 50% relative (e.g. AVG CPU from 40% => 60%) CPU increase since Jan 4, 2018,
>> most likely correlating with Amazon finished patching the underlying
>> Hypervisor infrastructure …
>>
>>
>>
>> Anybody else seeing a similar CPU increase?
>>
>>
>>
>> Thanks,
>>
>> Thomas
>>
>>
>>
>> The contents of this e-mail are intended for the named addressee only. It
>> contains information that may be confidential. Unless you are the named
>> addressee or an authorized designee, you may not copy or use it, or
>> disclose it to anyone else. If you received it in error please notify us
>> immediately and then destroy it. Dynatrace Austria GmbH (registration
>> number FN 91482h) is a company registered in Linz whose registered office
>> is at 4040 Linz, Austria, Freistädterstraße 313
>> <https://maps.google.com/?q=4040+Linz,+Austria,+Freist%C3%A4dterstra%C3%9Fe+313=gmail=g>
>> The contents of this e-mail are intended for the named addressee only. It
>> contains information that may be confidential. Unless you are the named
>> addressee or an authorized designee, you may not copy or use it, or
>> disclose it to anyone else. If you received it in error please notify us
>> immediately and then destroy it. Dynatrace Austria GmbH (registration
>> number FN 91482h) is a company registered in Linz whose registered office
>> is at 4040 Linz, Austria, Freistädterstraße 313
>> <https://maps.google.com/?q=4040+Linz,+Austria,+Freist%C3%A4dterstra%C3%9Fe+313=gmail=g>
>>
>
>
>

Re: Meltdown/Spectre Linux patch - Performance impact on Cassandra?

2018-01-09 Thread Dor Laor

Make sure you pick instances with PCID cpu capability, their TLB overhead
flush
overhead is much smaller

On Tue, Jan 9, 2018 at 2:04 AM, Steinmaurer, Thomas <
thomas.steinmau...@dynatrace.com> wrote:

> Quick follow up.
>
>
>
> Others in AWS reporting/seeing something similar, e.g.:
> https://twitter.com/BenBromhead/status/950245250504601600
>
>
>
> So, while we have seen an relative CPU increase of ~ 50% since Jan 4,
> 2018, we now also have applied a kernel update at OS/VM level on a single
> node (loadtest and not production though), thus more or less double patched
> now. Additional CPU impact by OS/VM level kernel patching is more or less 
> negligible,
> so looks highly Hypervisor related.
>
>
>
> Regards,
>
> Thomas
>
>
>
> *From:* Steinmaurer, Thomas [mailto:thomas.steinmau...@dynatrace.com]
> *Sent:* Freitag, 05. Jänner 2018 12:09
> *To:* user@cassandra.apache.org
> *Subject:* Meltdown/Spectre Linux patch - Performance impact on Cassandra?
>
>
>
> Hello,
>
>
>
> has anybody already some experience/results if a patched Linux kernel
> regarding Meltdown/Spectre is affecting performance of Cassandra negatively?
>
>
>
> In production, all nodes running in AWS with m4.xlarge, we see up to a 50%
> relative (e.g. AVG CPU from 40% => 60%) CPU increase since Jan 4, 2018,
> most likely correlating with Amazon finished patching the underlying
> Hypervisor infrastructure …
>
>
>
> Anybody else seeing a similar CPU increase?
>
>
>
> Thanks,
>
> Thomas
>
>
>
> The contents of this e-mail are intended for the named addressee only. It
> contains information that may be confidential. Unless you are the named
> addressee or an authorized designee, you may not copy or use it, or
> disclose it to anyone else. If you received it in error please notify us
> immediately and then destroy it. Dynatrace Austria GmbH (registration
> number FN 91482h) is a company registered in Linz whose registered office
> is at 4040 Linz, Austria, Freistädterstraße 313
> 
> The contents of this e-mail are intended for the named addressee only. It
> contains information that may be confidential. Unless you are the named
> addressee or an authorized designee, you may not copy or use it, or
> disclose it to anyone else. If you received it in error please notify us
> immediately and then destroy it. Dynatrace Austria GmbH (registration
> number FN 91482h) is a company registered in Linz whose registered office
> is at 4040 Linz, Austria, Freistädterstraße 313
> 
>

Re: Full table scan with cassandra

2017-08-17 Thread Dor Laor

On Thu, Aug 17, 2017 at 9:36 AM, Alex Kotelnikov <
alex.kotelni...@diginetica.com> wrote:

> Dor,
>
> I believe, I tried it in many ways and the result is quite disappointing.
> I've run my scans on 3 different clusters, one of which was using on VMs
> and I was able to scale it up and down (3-5-7 VMs, 8 to 24 cores) to see,
> how this affects the performance.
>
> I also generated the flow from spark cluster ranging from 4 to 40 parallel
> tasks as well as just multi-threaded client.
>
> The surprise is that trivial fetch of all records using token ranges takes
> pretty much the same time in all setups.
>
> The only beneficial thing I've learned is that it is much more efficient
> to create a MATERIALIZED VIEW than to filter (even using secondary index).
>
> Say, I have a typical dataset, around 3Gb of data, 1M records. And I have
> a trivial scan practice:
>
> String.format("SELECT token(user_id), user_id, events FROM user_events
> WHERE token(user_id) >= %d ", start) + (end != null ? String.format(" AND
> token(user_id) < %d ", end) : "")
>

Is user_id the primary key? Looks like this query will just go to the
cluster and access a random coordinator each time.
C* doesn't save the subsequent token on the same node. It's hashed.
The idea of parallel cluster scan is to go directly to all nodes in
parallel and query them for the hashed keys they own.


> I split all tokens into start-end ranges (except for last range, which
> only has start) and query ranges in multiple threads, up to 40.
>
> Whole process takes ~40s on 3 VMs cluster  2+2+4 cores, 16Gb RAM each 1
> virtual disk. And it takes ~30s on real hardware clusters
> 8servers*8cores*32Gb. Level of the concurrency does not matter pretty much
> at all. Util it is too high or too low.
> Size of tokens range matters, but here I see the rule "make it larger, but
> avoid cassandra timeouts".
> I also tried spark connector to validate that my test multithreaded app is
> not the bottleneck. It is not.
>
> I expected some kind of elasticity, I see none. Feels like I do something
> wrong...
>
>
>
> On 17 August 2017 at 00:19, Dor Laor <d...@scylladb.com> wrote:
>
>> Hi Alex,
>>
>> You probably didn't get the paralelism right. Serial scan has
>> a paralelism of one. If the paralelism isn't large enough, perf will be
>> slow.
>> If paralelism is too large, Cassandra and the disk will trash and have too
>> many context switches.
>>
>> So you need to find your cluster's sweet spot. We documented the procedure
>> to do it in this blog: http://www.scylladb.com/
>> 2017/02/13/efficient-full-table-scans-with-scylla-1-6/
>> and the results are here: http://www.scylladb.com/
>> 2017/03/28/parallel-efficient-full-table-scan-scylla/
>> The algorithm should translate to Cassandra but you'll have to use
>> different rules of the thumb.
>>
>> Best,
>> Dor
>>
>>
>> On Wed, Aug 16, 2017 at 9:50 AM, Alex Kotelnikov <
>> alex.kotelni...@diginetica.com> wrote:
>>
>>> Hey,
>>>
>>> we are trying Cassandra as an alternative for storage huge stream of
>>> data coming from our customers.
>>>
>>> Storing works quite fine, and I started to validate how retrieval does.
>>> We have two types of that: fetching specific records and bulk retrieval for
>>> general analysis.
>>> Fetching single record works like charm. But it is not so with bulk
>>> fetch.
>>>
>>> With a moderately small table of ~2 million records, ~10Gb raw data I
>>> observed very slow operation (using token(partition key) ranges). It takes
>>> minutes to perform full retrieval. We tried a couple of configurations
>>> using virtual machines, real hardware and overall looks like it is not
>>> possible to all table data in a reasonable time (by reasonable I mean that
>>> since we have 1Gbit network 10Gb can be transferred in a couple of minutes
>>> from one server to another and when we have 10+ cassandra servers and 10+
>>> spark executors total time should be even smaller).
>>>
>>> I tried datastax spark connector. Also I wrote a simple test case using
>>> datastax java driver and see how fetch of 10k records takes ~10s so I
>>> assume that "sequential" scan will take 200x more time, equals ~30 minutes.
>>>
>>> May be we are totally wrong trying to use Cassandra this way?
>>>
>>> --
>>>
>>> Best Regards,
>>>
>>>
>>> *Alexander Kotelnikov*
>>>
>>> *Team Lead*
>>>
>>> DIGINETICA
>>> Retail Technology Company
>>>
>>> m: +7.921.915.06.28 <+7%20921%20915-06-28>
>>>
>>> *www.diginetica.com <http://www.diginetica.com/>*
>>>
>>
>>
>
>
> --
>
> Best Regards,
>
>
> *Alexander Kotelnikov*
>
> *Team Lead*
>
> DIGINETICA
> Retail Technology Company
>
> m: +7.921.915.06.28 <+7%20921%20915-06-28>
>
> *www.diginetica.com <http://www.diginetica.com/>*
>

Re: Full table scan with cassandra

2017-08-16 Thread Dor Laor

Hi Alex,

You probably didn't get the paralelism right. Serial scan has
a paralelism of one. If the paralelism isn't large enough, perf will be
slow.
If paralelism is too large, Cassandra and the disk will trash and have too
many context switches.

So you need to find your cluster's sweet spot. We documented the procedure
to do it in this blog:
http://www.scylladb.com/2017/02/13/efficient-full-table-scans-with-scylla-1-6/
and the results are here:
http://www.scylladb.com/2017/03/28/parallel-efficient-full-table-scan-scylla/
The algorithm should translate to Cassandra but you'll have to use
different rules of the thumb.

Best,
Dor


On Wed, Aug 16, 2017 at 9:50 AM, Alex Kotelnikov <
alex.kotelni...@diginetica.com> wrote:

> Hey,
>
> we are trying Cassandra as an alternative for storage huge stream of data
> coming from our customers.
>
> Storing works quite fine, and I started to validate how retrieval does. We
> have two types of that: fetching specific records and bulk retrieval for
> general analysis.
> Fetching single record works like charm. But it is not so with bulk fetch.
>
> With a moderately small table of ~2 million records, ~10Gb raw data I
> observed very slow operation (using token(partition key) ranges). It takes
> minutes to perform full retrieval. We tried a couple of configurations
> using virtual machines, real hardware and overall looks like it is not
> possible to all table data in a reasonable time (by reasonable I mean that
> since we have 1Gbit network 10Gb can be transferred in a couple of minutes
> from one server to another and when we have 10+ cassandra servers and 10+
> spark executors total time should be even smaller).
>
> I tried datastax spark connector. Also I wrote a simple test case using
> datastax java driver and see how fetch of 10k records takes ~10s so I
> assume that "sequential" scan will take 200x more time, equals ~30 minutes.
>
> May be we are totally wrong trying to use Cassandra this way?
>
> --
>
> Best Regards,
>
>
> *Alexander Kotelnikov*
>
> *Team Lead*
>
> DIGINETICA
> Retail Technology Company
>
> m: +7.921.915.06.28 <+7%20921%20915-06-28>
>
> *www.diginetica.com *
>

Re: EC2 instance recommendations

2017-05-23 Thread Dor Laor

Note that EBS durability isn't perfect, you cannot rely on them entirely:
https://aws.amazon.com/ebs/details/
"Amazon EBS volumes are designed for an annual failure rate (AFR) of
between 0.1% - 0.2%, where failure refers to a complete or partial loss of
the volume, depending on the size and performance of the volume. This makes
EBS volumes 20 times more reliable than typical commodity disk drives"
It's rare but once in awhile we get an email notification that one of our
EBS drives was corrupted.
In this case a combined write through cache and EBS is certainly better
than EBS w/o backup.



On Tue, May 23, 2017 at 10:40 AM, Jonathan Haddad  wrote:

> Another option that I like the idea of but never see used unfortunately is
> using ZFS, with EBS for storage and the SSD ephemeral drive as L2Arc.
> You'd get the performance of ephemeral storage with all the features of
> EBS.  Something to consider.
>
> On Tue, May 23, 2017 at 10:30 AM Gopal, Dhruva 
> wrote:
>
>> Thanks! So, I assume that as long we make sure we never explicitly
>> “shutdown” the instance, we are good. Are you also saying we won’t be able
>> to snapshot a directory with ephemeral storage and that is why EBS is
>> better? We’re just finding that to get a reasonable amount of IOPS (gp2)
>> out of EBS at a reasonable rate, it gets more expensive than an I3.
>>
>>
>>
>> *From: *Jonathan Haddad 
>>
>>
>> *Date: *Tuesday, May 23, 2017 at 9:42 AM
>> *To: *"Gopal, Dhruva" , Matija Gobec <
>> matija0...@gmail.com>, Bhuvan Rawal 
>>
>> *Cc: *"user@cassandra.apache.org" 
>>
>>
>> *Subject: *Re: EC2 instance recommendations
>>
>>
>>
>> > Oh, so all the data is lost if the instance is shutdown or restarted
>> (for that instance)?
>>
>>
>>
>> When you restart the OS, you're technically not shutting down the
>> instance.  As long as the instance isn't stopped / terminated, your data is
>> fine.  I ran my databases on ephemeral storage for years without issue.  In
>> general, ephemeral storage is going to give you lower latency since there's
>> no network overhead.  EBS is generally cheaper than ephemeral, is
>> persistent, and you can take snapshots easily.
>>
>>
>>
>> On Tue, May 23, 2017 at 9:35 AM Gopal, Dhruva 
>> wrote:
>>
>> Oh, so all the data is lost if the instance is shutdown or restarted (for
>> that instance)? If we take a naïve approach to backing up the directory,
>> and restoring it, if we ever have to bring down the instance and back up,
>> will that work as a strategy? Data is only kept around for 2 days and is
>> TTL’d after.
>>
>>
>>
>> *From: *Matija Gobec 
>> *Date: *Tuesday, May 23, 2017 at 8:15 AM
>> *To: *Bhuvan Rawal 
>> *Cc: *"Gopal, Dhruva" , "
>> user@cassandra.apache.org" 
>> *Subject: *Re: EC2 instance recommendations
>>
>>
>>
>> We are running on I3s since they came out. NVMe SSDs are really fast and
>> I managed to push them to 75k IOPs.
>>
>> As Bhuvan mentioned the i3 storage is ephemeral. If you can work around
>> it and plan for failure recovery you are good to go.
>>
>>
>>
>> I ran Cassandra on m4s before and had no problems with EBS volumes (gp2)
>> even in low latency use cases. With the cost of M4 instances and EBS
>> volumes that make sense in IOPs, I would recommend going with more i3s and
>> working around the ephemeral issue (if its an issue).
>>
>>
>>
>> Best,
>>
>> Matija
>>
>> On Tue, May 23, 2017 at 2:13 AM, Bhuvan Rawal 
>> wrote:
>>
>> i3 instances will undoubtedly give you more meat for buck - easily 40K+
>> iops whereas on the other hand EBS maxes out at 20K PIOPS which is highly
>> expensive (at times they can cost you significantly more than cost of
>> instance).
>>
>> But they have ephemeral local storage and data is lost once instance is
>> stopped, you need to be prudent in case of i series, it is generally used
>> for large persistent caches.
>>
>>
>>
>> Regards,
>>
>> Bhuvan
>>
>> On Tue, May 23, 2017 at 4:55 AM, Gopal, Dhruva 
>> wrote:
>>
>> Hi –
>>
>>   We’ve been running M4.2xlarge EC2 instances with 2-3 TB of storage and
>> have been comparing this to I-3.2xlarge, which seems more cost effective
>> when dealing with this amount of storage and from an IOPS perspective. Does
>> anyone have any recommendations/ on the I-3s and how it performs overall,
>> compared to the M4 equivalent? On the surface, without us having taken it
>> through its paces performance-wise, it does seem to be pretty powerful. We
>> just ran through an exercise with a RAIDed 200 TB volume (as opposed to a
>> non RAIDed 3 TB volume) and were seeing a 20-30% improvement with the
>> RAIDed setup, on a 6 node Cassandra ring. Just looking for any
>> feedback/experience folks may have had with the I-3s.
>>
>>
>>
>> Regards,
>>
>>

Re: Bootstraping a Node With a Newer Version

2017-05-17 Thread Dor Laor

We've done such in-place upgrade in the past but not for a real production.

However you're MISSING the point. The root filesystem along with the entire
OS should be completely separated from your data directories. It should
reside
in a different logical volume and thus you can easily change the OS while
not
changing the data volume. Not to mention that there are fancier options like
snapshoting the data volume and thus having zero risk.

Happy LVMing.
Dor

On Wed, May 17, 2017 at 12:51 AM, Shalom Sagges 
wrote:

> Our DevOPS team told me that their policy is not to perform major kernel
> upgrades but simply install a clean new version.
> I also checked online and found a lot of recommendations *not *to do so
> as there might be a lot of dependencies issues that may affect processes
> such as yum.
> e.g.
> https://www.centos.org/forums/viewtopic.php?t=53678
> "The upgrade from CentOS 6 to 7 is a process that is fraught with danger
> and very very untested. Almost no-one succeeds without extreme effort. The
> CentOS wiki page about it has a big fat warning saying "Do not do this". If
> at all possible you should do a parallel install, migrate your data, apps
> and settings to the new box and decommission the old one.
>
> The problem comes about because there are a large number of packages in
> el6 that already have a higher version number than those in el7. This means
> that the el6 packages take precedence in the update and there are quite a
> few orphans left behind and these break lilttle things like yum. For
> example, one that I know about is openldap which is
> openldap-2.4.40-5.el6.x86_64 and openldap-2.4.39-6.el7.x86_64 so the el6
> package is seen as newer than the el7 one. Anything that's linked against
> openldap (a *lot*) now will not function until that package is replaced
> with its el7 equivalent, The easiest way to do this would be to yum
> downgrade openldap but, ooops, one of the things that needs openldap is
> yum so it doesn't work."
>
>
> I've also checked the Centos Wiki page and found the same recommendation:
> https://wiki.centos.org/FAQ/General?highlight=%28upgrade%
> 29%7C%28to%29%7C%28centos7%29#head-3ac1bdb51f0fecde1f98142cef90e8
> 87b1b12a00 :
>
> *"Upgrades in place are not supported nor recommended by CentOS or TUV. A
> backup followed by a fresh install is the only recommended upgrade path.
> See the Migration Guide for more information."*
>
>
> Since I have around twenty 2TB nodes in each DC (2 DCs in 6 different
> farms) and I don't want it to take forever, perhaps the best way would be
> to either leave it with Centos 6 and install Python 2.7 (I understand
> that's not so user friendly) or perform the backup recommendations shown on
> the Centos page (which sounds extremely agonizing as well).
>
> What do you think?
>
> Thanks!
>
>
> Shalom Sagges
> DBA
> T: +972-74-700-4035 <074-700-4035>
>  
>  We Create Meaningful Connections
>
>
>
> On Tue, May 16, 2017 at 6:48 PM, daemeon reiydelle 
> wrote:
>
>> What makes you think you cannot upgrade the kernel?
>>
>> “All men dream, but not equally. Those who dream by night in the dusty
>> recesses of their minds wake up in the day to find it was vanity, but the
>> dreamers of the day are dangerous men, for they may act their dreams with
>> open eyes, to make it possible.” — T.E. Lawrence
>>
>> sent from my mobile
>> Daemeon Reiydelle
>> skype daemeon.c.m.reiydelle
>> USA 415.501.0198 <(415)%20501-0198>
>>
>> On May 16, 2017 5:27 AM, "Shalom Sagges"  wrote:
>>
>>> Hi All,
>>>
>>> Hypothetically speaking, let's say I want to upgrade my Cassandra
>>> cluster, but I also want to perform a major upgrade to the kernel of all
>>> nodes.
>>> In order to upgrade the kernel, I need to reinstall the server, hence
>>> lose all data on the node.
>>>
>>> My question is this, after reinstalling the server with the new kernel,
>>> can I first install the upgraded Cassandra version and then bootstrap it to
>>> the cluster?
>>>
>>> Since there's already no data on the node, I wish to skip the agonizing
>>> sstable upgrade process.
>>>
>>> Does anyone know if this is doable?
>>>
>>> Thanks!
>>>
>>>
>>>
>>> Shalom Sagges
>>> DBA
>>> T: +972-74-700-4035 <+972%2074-700-4035>
>>>  
>>>  We Create Meaningful Connections
>>>
>>>
>>>
>>> This message may contain confidential and/or privileged information.
>>> If you are not the addressee or authorized to receive this on behalf of
>>> the addressee you must not use, copy, disclose or take action based on this
>>> message or any information herein.
>>> If you have received this message in error, please advise the sender
>>> immediately by reply email and delete this message. Thank you.
>>>
>>
>
> This message may contain

Re: scylladb

2017-03-14 Thread Dor Laor

On Tue, Mar 14, 2017 at 7:43 AM, Eric Evans 
wrote:

> On Sun, Mar 12, 2017 at 4:01 PM, James Carman
>  wrote:
> > Does all of this Scylla talk really even belong on the Cassandra user
> > mailing list in the first place?
>
> I personally found it interesting, informative, and on-topic when it
> was about justification of the 10x performance claim, numa,
> scheduling, concurrency, etc.  At some point it got a little sales-y
> on the one side, and caremad on the other.
>

I tried to just provide accurate answers. There is a meme on twitter that
matches it:

Pi Day is just a fake holiday created by math companies to sell more math.

It's that good so I couldn't resist.

> ¯\_(ツ)_/¯
>
>
> --
> Eric Evans
> john.eric.ev...@gmail.com
>

Re: scylladb

2017-03-13 Thread Dor Laor

On Mon, Mar 13, 2017 at 12:17 AM, benjamin roth <brs...@gmail.com> wrote:

> @Dor,Jeff:
>
> I think Jeff pointed out an important fact: You cannot stop CS, swap
> binaries and start Scylla. To be honest that was AFAIR the only "Oooh :(" I
> had when reading the Scylla "marketing material".
>

If you're on 2.1.x you can. You will need to stop your running cluster and
many users cannot do that.
If you do have ability to sustain downtime, take a snapshot and thus your
data is safe.
There is a way to do it online, explained below

>
> If that worked it would be very valuable from both Scylla's and a users'
> point of view. As a user I would love to give scylla a try as soon as it
> provides all the features my application requires. But the hurdle is quite
> high. I have to create a separate scylla cluster and I have to migrate a
> lot of data and I have to manage somehow that my application can use (r+w)
> both CS + Scylla at the same time to not run any risk of data loss or dead
> end road if something goes wrong. And still: I would not be able to
>

So there are two options to do double writes, one is do it on the client
side and send to the two separate clusters.
Another option is we have a CQL-proxy small go application. You install it
on all of your C* nodes and it duplicated the CQL
traffic to remote Scylla cluster.

When you do the above you start Scylla from a C* snapshot/backup or sstable
load the previous data (works for 2.x and 3.y)(first start the double
writes). If you have a double cluster, you have zero risk if something goes
wrong.

> compare CS + Scylla for my workload totally fair as the conditions
> changed. New hardware, maybe partial dataset, probably only "test traffic".
>

50% of users run on AWS so it's easy to use the same conditions.
If you're on physical machines, just give us half the hardware and see we
cope with it - meaning
err on our side (as long as you don't use crappy hardware which will cap
us).

The advantage is that you can run in this mode weeks/months until you're
absolute sure things work fine.
Most of our users who came from C* did it.

> However, if I was able to just replace a single node in an existing
> cluster I'd have:
>

Everybody wanted it but

> 1. Superlow hurdle to give it a try: No risk, no effort
>

Yes there is risk. We considered that and even toyed with such an
implementation.
The risk is that this node will be part of the cluster and although it
supposed to behave and you'll
have replicas, something may go wrong and your cluster's RPC will suffer.

The Cassandra RPC wasn't documented as changed over the releases, I bet you
cannot mix a 2.1 with 3.4
and can only do latest x.minor -> (x+1).0
In addition, in case there is an issue, whose fault would it be?
Lastly, we have our own internal RPC with various optimizations.

> 2. Fair comparison by comparing new node against some equally equipeed old
> node in the same cluster with the same workload
>

Even if it would work, the new node won't run faster unless CL=1 which is
rare.

> 3. Easy to make a decision if to continue or not
>
> That would be totally awesome!
>
>
> 2017-03-12 23:16 GMT+01:00 Kant Kodali <k...@peernova.com>:
>
>> I don't think ScyallDB guys started this conversation in the first place
>> to suggest or promote "drop-in replacement". It was something that is
>> brought up by one of the Cassandra users and ScyallDB guys just clarified
>> it. They are gracious enough to share the internals in detail.
>>
>> honestly, I find it weird when I see questions like whether a question
>> belongs  to a mailing list or not especially in this case. If one doesn't
>> like it they can simply not follow the thread. I am not sure what is the
>> harm here.
>>
>>
>>
>> On Sun, Mar 12, 2017 at 2:29 PM, James Carman <ja...@carmanconsulting.com
>> > wrote:
>>
>>> Well, looking back, it appears this thread is from 2015, so apparently
>>> everyone is okay with it.
>>>
>>> Promoting a value-add product that makes using Cassandra easier/more
>>> efficient/etc would be cool, but coming to the Cassandra mailing list to
>>> promote a "drop-in replacement" (use us, not Cassandra) isn't cool, IMHO.
>>>
>>>
>>> On Sun, Mar 12, 2017 at 5:04 PM Kant Kodali <k...@peernova.com> wrote:
>>>
>>> yes.
>>>
>>> On Sun, Mar 12, 2017 at 2:01 PM, James Carman <
>>> ja...@carmanconsulting.com> wrote:
>>>
>>> Does all of this Scylla talk really even belong on the Cassandra user
>>> mailing list in the first place?
>>>
>>>
>>>
>>>
>>> On Sun, Mar 12, 2017 at 4:07

Re: scylladb

2017-03-13 Thread Dor Laor

We came to the thread to provide technical answers about whether the
difference in performance arise from
C++ only or beyond. When the discussion included numa, we even dove deep
into the weeds. I think we provided enough answers and I respect all of the
opinions here and thus if someone has further questions they are welcome to
ask on our mailing list or privately.

Cheers,
Dor

On Mon, Mar 13, 2017 at 12:43 AM, Dor Laor <d...@scylladb.com> wrote:

> On Mon, Mar 13, 2017 at 12:17 AM, benjamin roth <brs...@gmail.com> wrote:
>
>> @Dor,Jeff:
>>
>> I think Jeff pointed out an important fact: You cannot stop CS, swap
>> binaries and start Scylla. To be honest that was AFAIR the only "Oooh :(" I
>> had when reading the Scylla "marketing material".
>>
>
> If you're on 2.1.x you can. You will need to stop your running cluster and
> many users cannot do that.
> If you do have ability to sustain downtime, take a snapshot and thus your
> data is safe.
> There is a way to do it online, explained below
>
>
>>
>> If that worked it would be very valuable from both Scylla's and a users'
>> point of view. As a user I would love to give scylla a try as soon as it
>> provides all the features my application requires. But the hurdle is quite
>> high. I have to create a separate scylla cluster and I have to migrate a
>> lot of data and I have to manage somehow that my application can use (r+w)
>> both CS + Scylla at the same time to not run any risk of data loss or dead
>> end road if something goes wrong. And still: I would not be able to
>>
>
> So there are two options to do double writes, one is do it on the client
> side and send to the two separate clusters.
> Another option is we have a CQL-proxy small go application. You install it
> on all of your C* nodes and it duplicated the CQL
> traffic to remote Scylla cluster.
>
> When you do the above you start Scylla from a C* snapshot/backup or
> sstable load the previous data (works for 2.x and 3.y)(first start the
> double writes). If you have a double cluster, you have zero risk if
> something goes wrong.
>
>
>> compare CS + Scylla for my workload totally fair as the conditions
>> changed. New hardware, maybe partial dataset, probably only "test traffic".
>>
>
> 50% of users run on AWS so it's easy to use the same conditions.
> If you're on physical machines, just give us half the hardware and see we
> cope with it - meaning
> err on our side (as long as you don't use crappy hardware which will cap
> us).
>
> The advantage is that you can run in this mode weeks/months until you're
> absolute sure things work fine.
> Most of our users who came from C* did it.
>
>
>> However, if I was able to just replace a single node in an existing
>> cluster I'd have:
>>
>
> Everybody wanted it but
>
>
>> 1. Superlow hurdle to give it a try: No risk, no effort
>>
>
> Yes there is risk. We considered that and even toyed with such an
> implementation.
> The risk is that this node will be part of the cluster and although it
> supposed to behave and you'll
> have replicas, something may go wrong and your cluster's RPC will suffer.
>
> The Cassandra RPC wasn't documented as changed over the releases, I bet
> you cannot mix a 2.1 with 3.4
> and can only do latest x.minor -> (x+1).0
> In addition, in case there is an issue, whose fault would it be?
> Lastly, we have our own internal RPC with various optimizations.
>
>
>> 2. Fair comparison by comparing new node against some equally equipeed
>> old node in the same cluster with the same workload
>>
>
> Even if it would work, the new node won't run faster unless CL=1 which is
> rare.
>
>
>> 3. Easy to make a decision if to continue or not
>>
>> That would be totally awesome!
>>
>>
>> 2017-03-12 23:16 GMT+01:00 Kant Kodali <k...@peernova.com>:
>>
>>> I don't think ScyallDB guys started this conversation in the first place
>>> to suggest or promote "drop-in replacement". It was something that is
>>> brought up by one of the Cassandra users and ScyallDB guys just clarified
>>> it. They are gracious enough to share the internals in detail.
>>>
>>> honestly, I find it weird when I see questions like whether a question
>>> belongs  to a mailing list or not especially in this case. If one doesn't
>>> like it they can simply not follow the thread. I am not sure what is the
>>> harm here.
>>>
>>>
>>>
>>> On Sun, Mar 12, 2017 at 2:29 PM, James Carman <
>>> ja...@carmanconsulting.com> wrote:

Re: scylladb

2017-03-12 Thread Dor Laor

On Sun, Mar 12, 2017 at 12:11 PM, Edward Capriolo <edlinuxg...@gmail.com>
wrote:

> The simple claim that "Scylla IS a drop in replacement for C*" shows that
> they clearly don't know as much as they think they do.
>
> Even if it did supposedly "support everything" it would not actually work
> like that. For example, some things in Cassandra work "the way they work" .
> They are not specifically defined in a unit test or a document that
> describes how they are supposed to work. During a massive code port you can
> not reason if the code still works the same way in all situations.
>
> Example, without using SEDA and using something else it definitely wont
> work the same way when the thread pools fill up and it starts blocking,
> dropping, whatever. There is so much implicitly undefined behavior.
>

According to your definition there is no such a thing as drop and
replacement, doesn't it?

One of our users asked us to add a protocol verb that identify Scylla as
Scylla so they'll know which
is which for the time they run 2 clusters.

Look, if we'll claim we have all the features and when someone checks they
see we don't have LWT then it makes us a bad service. Usually when we get
someone (specific) interested, we map their C* usage and say what feature
isn't yet there. So far it's just lack of those not-implemented yet
features that hold users back. We do try to mimic the exact behaviour of C*.

Clearly, I can't defend a 100% drop-in replacement. Once we implement
someone's selected
featureset, then we're a drop-in replacement for them and we're not a good
match for others.
We're not after quick wins, quite the opposite.

> Also just for argument sake. YCSB proves nothing. Nothing. It generates
> key-value data, and well frankly that is not the primary use case of
> Cassandra. So again. Know what you don't know.
>
>
a. We do not pretend we know it all.
We do have a 3 year mileage with Cassandra and 2.5 with Scylla and we
gained some knowledge... before we decided to go after the C* path, we
considered
to reimplement Mongo, HDFS, Kafka and few more examples and the fact we
chose
C* shows our appreciation to this project and not the opposite.

b. YCSB is an industry standard, and that's why everybody use it.
We don't like it at all since it doesn't have prepared statements (it's
time that
someone will merge this support).
It's not a plain K/V since it's a table of 10 columns of 100b each.
We do support wide rows and learned (the hard way) their challenge,
especially
with compaction, repair and streaming. The current Scylla code doesn't
cache
wide row beyond 10MB which isn't ideal. In 1.8 (next month) we have a
partial
row caching which supposed to be very good. During the past 20 months
since
our beta we tried to focus on good out-of-the-box experience to all
real workloads
and we knowingly deferred features like LWT since we wanted a good
solid base
before we reach feature parity. If we'll do a good job with a benchmark
but a bad
one in real workload, we just shot ourselves in the foot. This was the
case around our
beta but it was just a beta. Today we think we're in a very solid
position. We still
have lots to complete around repair (which is ok but not great). There
is a work
in progress to switch out from Merkle tree to a new algorithm and
reduced latency
(almost there). We have mixed feelings about anti-compaction for
incremental repair
but we're likely to go through this route too

>
>
>
> On Sun, Mar 12, 2017 at 2:15 PM, Jonathan Haddad <j...@jonhaddad.com>
> wrote:
>
>> I don't think Jeff comes across as angry.  He's simply pointing out that
>> ScyllaDB isn't a drop in replacement for Cassandra.  Saying that it is is
>> very misleading.  The marketing material should really say something like
>> "drop in replacement for some workloads" or "aims to be a drop in
>> replacement".  As is, it doesn't support everything, so it's not a drop in.
>>
>>
>> On Sat, Mar 11, 2017 at 10:34 PM Dor Laor <d...@scylladb.com> wrote:
>>
>>> On Sat, Mar 11, 2017 at 10:02 PM, Jeff Jirsa <jji...@gmail.com> wrote:
>>>
>>>
>>>
>>> On 2017-03-10 09:57 (-0800), Rakesh Kumar wrote:
>>> > Cassanda vs Scylla is a valid comparison because they both are
>>> compatible. Scylla is a drop-in replacement for Cassandra.
>>>
>>> No, they aren't, and no, it isn't
>>>
>>>
>>> Jeff is angry with us for some reason. I don't know why, it's natural
>>> that when
>>> a new opponent there are objections and the proof lies on us.
>>> We go through great deal of doing it and we don't just throw comments
>>> without back

Re: scylladb

2017-03-12 Thread Dor Laor

On Sun, Mar 12, 2017 at 11:15 AM, Jonathan Haddad <j...@jonhaddad.com> wrote:

> I don't think Jeff comes across as angry.  He's simply pointing out that
> ScyllaDB isn't a drop in
>

Agree, I take it back, it's wasn't due to this.


> replacement for Cassandra.  Saying that it is is very misleading.  The
> marketing material should really say something like "drop in replacement
> for some workloads" or "aims to be a drop in replacement".  As is, it
> doesn't support everything, so it's not a drop in.
>
>
When we need to describe what Scylla is in 140 characters or one liner, we
use drop-in-replacement. When we talk about the details, we provide the
full details as I did above.
The code is open and we take the upstream-first approach and there is the
status page
to summarize it. If someone depends on LWT or UDF we don't have an
immediate answer.
We do have answers for the rest. The vast majority of users don't get to
use these features
and thus they can (and some did) seamlessly migrate.

For a reference sanity check, see all the databases/tools who claim SQL
ability, most of them
don't comply to the ANSI standard. As you said, our desire is to be 100%
compatible.

Btw, going back to technology discussion, while there are lots of reasons
to use C++, the only
challenge is in features like UDF/triggers which relay on JVM based code
execution. We are likely to use Lua for it initially, and later we'll
integrate it with a JVM based solution.



>
> On Sat, Mar 11, 2017 at 10:34 PM Dor Laor <d...@scylladb.com> wrote:
>
>> On Sat, Mar 11, 2017 at 10:02 PM, Jeff Jirsa <jji...@gmail.com> wrote:
>>
>>
>>
>> On 2017-03-10 09:57 (-0800), Rakesh Kumar wrote:
>> > Cassanda vs Scylla is a valid comparison because they both are
>> compatible. Scylla is a drop-in replacement for Cassandra.
>>
>> No, they aren't, and no, it isn't
>>
>>
>> Jeff is angry with us for some reason. I don't know why, it's natural
>> that when
>> a new opponent there are objections and the proof lies on us.
>> We go through great deal of doing it and we don't just throw comments
>> without backing.
>>
>> Scylla IS a drop in replacement for C*. We support the same CQL (from
>> version 1.7 it's cql 3.3.1, protocol v4), the same SStable format (based on
>> 2.1.8). In 1.7 release we support cql uploader
>> from 3.x. We will support the SStable format of 3.x natively in 3 month
>> time. Soon all of the feature set will be implemented. We always have been
>> using this page (not 100% up to date, we'll update it this week):
>> http://www.scylladb.com/technology/status/
>>
>> We add a jmx-proxy daemon in java in order to make the transition as
>> smooth as possible. Almost all the nodetool commands just work, for sure
>> all the important ones.
>> Btw: we have a RESTapi and Prometheus formats, much better than the hairy
>> jmx one.
>>
>> Spark, Kairosdb, Presto and probably Titan (we add Thrift just for legacy
>> users and we don't intend
>> to decommission an api).
>>
>> Regarding benchmarks, if someone finds a flaw in them, we'll do the best
>> to fix it.
>> Let's ignore them and just here what our users have to say:
>> http://www.scylladb.com/users/
>>
>>
>>

Re: scylladb

2017-03-12 Thread Dor Laor

On Sun, Mar 12, 2017 at 6:40 AM, Stefan Podkowinski  wrote:

> If someone would create a benchmark showing that Cassandra is 10x faster
> than Aerospike, would that mean Cassandra is 100x faster than ScyllaDB?
>
> Joking aside, I personally don't pay a lot of attention to any published
> benchmarks and look at them as pure marketing material. What I'm interested
> in instead is to learn why exactly one solution is faster than the other
> and I have to say that Avi is doing a really good job explaining the design
> motivations behind ScyllaDB in his presentations.
>
> But the Aerospike comparison also has a good point by showing that you
> probably always will be able to find a solution that is faster for a
> certain work load. Therefor the most important step when looking for the
> fastest datastore, is to first really understand your work load
> characteristic. Unfortunately this is something people tend to skip and
> instead get lost in controversial benchmark discussions, which are more fun
> than thinking about your data model and talking to people about projected
> long term load. Because if you do, you might realize that those benchmark
> test scenarios (e.g. insert 1TB as fast as possible and measure compaction
> times) aren't actually that relevant for your application.
>
Agree, however, it allows you to realize what a real workload will suffer
from and that's why we
measured a 'read while heavily writing' result too. In addition we measured
small, medium and large datasets for read only. Still, benchmarks are not a
real workload and we always advise to use our Prometheus detailed metrics
to realize if the hardware is utilized and to understand what's the
bottleneck. Scylla implemented the CQL tracing and can run the slow query
tracing all of the time with a low performance impact



>
> On 03/10/2017 05:58 PM, Bhuvan Rawal wrote:
>
> Agreed C++ gives an added advantage to talk to underlying hardware with
> better efficiency, it sound good but can a pice of code written in C++ give
> 1000% throughput than a Java app? Is TPC design 10X more performant than
> SEDA arch?
>
> And if C/C++ is indeed that fast how can Aerospike (which is itself
> written in C) claim to be 10X faster than Scylla here
> http://www.aerospike.com/benchmarks/scylladb-initial/ ? (Combining your's
> and aerospike's benchmarks it appears that Aerospike is 100X performant
> than C* - I highly doubt that!! )
>
> For a moment lets forget about evaluating 2 different databases, one can
> observe 10X performance difference between a mistuned cassandra cluster and
> one thats tuned as per data model - there are so many Tunables in yaml as
> well as table configs.
>
> Idea is - in order to strengthen your claim, you need to provide complete
> system metrics (Disk, CPU, Network), the OPS increase starts to decay along
> with the configs used. Having plain ops per second and 99p latency is
> blackbox.
>
> Regards,
> Bhuvan
>
> On Fri, Mar 10, 2017 at 12:47 PM, Avi Kivity  wrote:
>
>> ScyllaDB engineer here.
>>
>> C++ is really an enabling technology here. It is directly responsible for
>> a small fraction of the gain by executing faster than Java.  But it is
>> indirectly responsible for the gain by allowing us direct control over
>> memory and threading.  Just as an example, Scylla starts by taking over
>> almost all of the machine's memory, and dynamically assigning it to
>> memtables, cache, and working memory needed to handle requests in flight.
>> Memory is statically partitioned across cores, allowing us to exploit NUMA
>> fully.  You can't do these things in Java.
>>
>> I would say the major contributors to Scylla performance are:
>>  - thread-per-core design
>>  - replacement of the page cache with a row cache
>>  - careful attention to many small details, each contributing a little,
>> but with a large overall impact
>>
>> While I'm here I can say that performance is not the only goal here, it
>> is stable and predictable performance over varying loads and during
>> maintenance operations like repair, without any special tuning.  We measure
>> the amount of CPU and I/O spent on foreground (user) and background
>> (maintenance) tasks and divide them fairly.  This work is not complete but
>> already makes operating Scylla a lot simpler.
>>
>>
>> On 03/10/2017 01:42 AM, Kant Kodali wrote:
>>
>> I dont think ScyllaDB performance is because of C++. The design decisions
>> in scylladb are indeed different from Cassandra such as getting rid of SEDA
>> and moving to TPC and so on.
>>
>> If someone thinks it is because of C++ then just show the benchmarks that
>> proves it is indeed the C++ which gave 10X performance boost as ScyllaDB
>> claims instead of stating it.
>>
>>
>> On Thu, Mar 9, 2017 at 3:22 PM, Richard L. Burton III > > wrote:
>>
>>> They spend an enormous amount of time focusing on performance. You can
>>> expect them to continue on with their optimization and keep

Re: scylladb

2017-03-11 Thread Dor Laor

On Sat, Mar 11, 2017 at 2:19 PM, Kant Kodali  wrote:

> My response is inline.
>
> On Sat, Mar 11, 2017 at 1:43 PM, Avi Kivity  wrote:
>
>> There are several issues at play here.
>>
>> First, a database runs a large number of concurrent operations, each of
>> which only consumes a small amount of CPU. The high concurrency is need to
>> hide latency: disk latency, or the latency of contacting a remote node.
>>
>
> *Ok so you are talking about hiding I/O latency.  If all these I/O are
> non-blocking system calls then a thread per core and callback mechanism
> should suffice isn't it?*
>

In general, yes but in practice it's more complicated.
Each such thread runs different tasks, you need a mechanism to switch
between these
tasks, this is the seastar continuation engine in our case. However, things
get more
complicated. We found that we need a cpu scheduler which takes into account
the priority
of different tasks, such as repair, compaction, streaming, read operations
and write operations.
We always prioritize foreground operations over background ones and thus
even when we
repair TBs of data, latency is still very low (this feature is coming in
Scylla 1.8)

>
>
>> This means that the scheduler will need to switch contexts very often. A
>> kernel thread scheduler knows very little about the application, so it has
>> to switch a lot of context.  A user level scheduler is tightly bound to the
>> application, so it can perform the switching faster.
>>
>
> *sure but this applies in other direction as well. A user level scheduler
> has no idea about kernel level scheduler either.  There is literally no
> coordination between kernel level scheduler and user level scheduler in
> linux or any major OS. It may be possible with OS's *
>

Correct. That's why we let the OS scheduler to run just one thread per core
and we bind the thread to the cpu. Inside, we do our own stuff with the
seastar scheduler and the OS doesn't know and doesn't care.

More below

> *that support scheduler activation(LWP's) and upcall mechanism. Even then
> it is hard to say if it is all worth it (The research shows performance may
> not outweigh the complexity). Golang problem is exactly this if one creates
> 1000 go routines/green threads where each of them is making a blocking
> system call then it would create 1000 kernel threads underneath because it
> has no way to know that the kernel thread is blocked (no upcall). And in
> non-blocking case I still don't even see a significant performance when
> compared to few kernel threads with callback mechanism.  If you are saying
> user level scheduling is the Future (perhaps I would just let the
> researchers argue about it) As of today that is not case else languages
> would have had it natively instead of using third party frameworks or
> libraries. *
>

That's why we do not run blocking system calls at all. We had to limit
ourselves to the XFS filesystem
only since the others did have got AIO support. Recently we bypassed some
of the issues which
made EXT4 to block and it may be ok with our AIO pattern.

We even write a DNS implementation that doesn't block and doesn't lock (for
us, even a library that uses spin locks under the hood is bad).

Bare in mind that the whole thing is simple to run and the user doesn't
need to know anything of this complexity.

>
>
>> There are also implications on the concurrency primitives in use (locks
>> etc.) -- they will be much faster for the user-level scheduler, because
>> they cooperate with the scheduler.  For example, no atomic
>> read-modify-write instructions need to be executed.
>>
>
>
>  Second, how many (kernel) threads should you run?* This question one
> will always have. If there are 10K user level threads that maps to only one
> kernel thread then they cannot exploit parallelism. so there is no right
> answer but a thread per core is a reasonable/good choice. *
>

+1

>
>
>> If you run too few threads, then you will not be able to saturate the CPU
>> resources.  This is a common problem with Cassandra -- it's very hard to
>> get it to consume all of the CPU power on even a moderately large machine.
>> On the other hand, if you have too many threads, you will see latency rise
>> very quickly, because kernel scheduling granularity is on the order of
>> milliseconds.  User-level scheduling, because it leaves control in the hand
>> of the application, allows you to both saturate the CPU and maintain low
>> latency.
>>
>
> F*or my workload and probably others I had seen Cassandra was always
> been CPU bound.*
>

Could be. However, try to make it CPU bound on 10 core, 20 core and more.
The more core you use, the less nodes you need and the overall overhead
decreases.

>
>> There are other factors, like NUMA-friendliness, but in the end it all
>> boils down to efficiency and control.
>>
>> None of this is new btw, it's pretty common in the storage world.
>>
>> Avi
>>
>>
>> On 03/11/2017 11:18 PM,

Re: scylladb

2017-03-11 Thread Dor Laor

On Sat, Mar 11, 2017 at 10:02 PM, Jeff Jirsa  wrote:

>
>
> On 2017-03-10 09:57 (-0800), Rakesh Kumar wrote:
> > Cassanda vs Scylla is a valid comparison because they both are
> compatible. Scylla is a drop-in replacement for Cassandra.
>
> No, they aren't, and no, it isn't
>

Jeff is angry with us for some reason. I don't know why, it's natural that
when
a new opponent there are objections and the proof lies on us.
We go through great deal of doing it and we don't just throw comments
without backing.

Scylla IS a drop in replacement for C*. We support the same CQL (from
version 1.7 it's cql 3.3.1, protocol v4), the same SStable format (based on
2.1.8). In 1.7 release we support cql uploader
from 3.x. We will support the SStable format of 3.x natively in 3 month
time. Soon all of the feature set will be implemented. We always have been
using this page (not 100% up to date, we'll update it this week):
http://www.scylladb.com/technology/status/

We add a jmx-proxy daemon in java in order to make the transition as smooth
as possible. Almost all the nodetool commands just work, for sure all the
important ones.
Btw: we have a RESTapi and Prometheus formats, much better than the hairy
jmx one.

Spark, Kairosdb, Presto and probably Titan (we add Thrift just for legacy
users and we don't intend
to decommission an api).

Regarding benchmarks, if someone finds a flaw in them, we'll do the best to
fix it.
Let's ignore them and just here what our users have to say:
http://www.scylladb.com/users/

Re: scylladb

2017-03-10 Thread Dor Laor

On Fri, Mar 10, 2017 at 4:45 PM, Kant Kodali <k...@peernova.com> wrote:

> http://performanceterracotta.blogspot.com/2012/09/numa-java.html
> http://docs.oracle.com/javase/7/docs/technotes/guides/vm/
> performance-enhancements-7.html
> http://openjdk.java.net/jeps/163
>
>
Java can exploit NUMA but it's not as a efficient as can be done in c++.
Andrea Arcangeli is the engineer behind Linux transparent huge pages(THP),
he
reported to me and the idea belongs to Avi. We did it for KVM's sake but
it was designed to any long running process like Cassandra.
However, the entire software stack should be aware. If you get a huge page
(2MB)
but keep in it only 1KB you waste lots of mem. On top of this, threads need
to
touch their data structures and they need to be well aligned, otherwise the
memory
page will bounce between the different cores.
With Cassandra it gets more complicated since there is a heap and off-heap
data.

Do programmers really track their data alignment? I doubt it.
Do users run C* with the JVM numa options and the right Linux THP options?
Again, I doubt.

Scylla on the other side is designed for NUMA. We have 2-level sharding.
The inner shards are transparent
to the user and are per-core (hyper thread). Such a shard access RAM only
within its numa node. Memory
is bonded to each thread/numa node. We have our own malloc allocator built
for this scheme.



> If scyllaDB has efficient Secondary indexes, LWT and MV's then that is
> something. I would be glad to see how they perform.
>
>
MV will be in 1.8, we haven't measured performance yet. We did measure our
counter implementation
and it looks promising (4X better throughput and 4X better latency on a
8-core machine).
The not-written yet LWT will kick-a** since our fully async engine is ideal
for the larger number
of round trips the LWT needs.

This is with the Linux tcp stack, once we'll use our dpdk one, performance
will improve further ;)



>
> On Fri, Mar 10, 2017 at 10:45 AM, Dor Laor <d...@scylladb.com> wrote:
>
>> Scylla isn't just about performance too.
>>
>> First, a disclaimer, I am a Scylla co-founder. I respect open source a
>> lot,
>> so you guys are welcome to shush me out of this thread. I only participate
>> to provide value if I can (this is a thread about Scylla and our users are
>> on our mailing list).
>>
>> Scylla is all about what Cassandra is plus:
>>  - Efficient hardware utilization (scale-up, performance)
>>  - Low tail latency
>>  - Auto/dynamic tuning (no JVM tuning, we tune the OS ourselves, we have
>> cpu scheduler,
>>I/O userspace scheduler and more to come).
>>  - SLA between compaction, repair, streaming and your r/w operations
>>
>> We started with a great foundation (C*) and wish to improve almost any
>> aspect of it.
>> Admittedly, we're way behind C* in terms of adoption. One need to start
>> somewhere.
>> However, users such as AppNexus run Scylla in production with 47 physical
>> nodes
>> across 5 datacenters and their VP estimate that C* would have at least
>> doubled the
>> size. So this is equal for a 100-node C* cluster. Since we have the same
>> gossip, murmur3 hash,
>> CQL, nothing stops us to scale to 1,000 nodes. Another user (Mogujie) run
>> 10s of TBs per node(!)
>> in production.
>>
>> Also, since we try to compare Scylla and C* in a fair way, we invested a
>> great deal of time
>> to run C*. I can say it's not simple at all.
>> Lastly, in a couple of months we'll reach parity in functionality with C*
>> (counters are in 1.7 as experimental, in 1.8 counters will be stable and
>> we'll have MV as experimental, LWT will be
>> in the summer). We hope to collaborate with the C* community with the
>> development of future
>> features.
>>
>> Dor
>>
>>
>> On Fri, Mar 10, 2017 at 10:19 AM, Jacques-Henri Berthemet <
>> jacques-henri.berthe...@genesys.com> wrote:
>>
>>> Cassandra is not about pure performance, there are many other DBs that
>>> are much faster than Cassandra. Cassandra strength is all about
>>> scalability, performance increases in a linear way as you add more nodes.
>>> During Cassandra summit 2014 Apple said they have a 10k node cluster. The
>>> usual limiting factor is your disk write speed and latency, I don’t see how
>>> C++ changes anything in this regard unless you can cache all your data in
>>> memory.
>>>
>>>
>>>
>>> I’d be curious to know how ScyllaDB performs with a 100+ nodes cluster
>>> with PBs of data compared to Cassandra.
>>>
>>> *--*
>>>
>>> *Jacques-Henri Berthemet*
>>>
>>>
&g

Re: scylladb

2017-03-10 Thread Dor Laor

Scylla isn't just about performance too.

First, a disclaimer, I am a Scylla co-founder. I respect open source a lot,
so you guys are welcome to shush me out of this thread. I only participate
to provide value if I can (this is a thread about Scylla and our users are
on our mailing list).

Scylla is all about what Cassandra is plus:
 - Efficient hardware utilization (scale-up, performance)
 - Low tail latency
 - Auto/dynamic tuning (no JVM tuning, we tune the OS ourselves, we have
cpu scheduler,
   I/O userspace scheduler and more to come).
 - SLA between compaction, repair, streaming and your r/w operations

We started with a great foundation (C*) and wish to improve almost any
aspect of it.
Admittedly, we're way behind C* in terms of adoption. One need to start
somewhere.
However, users such as AppNexus run Scylla in production with 47 physical
nodes
across 5 datacenters and their VP estimate that C* would have at least
doubled the
size. So this is equal for a 100-node C* cluster. Since we have the same
gossip, murmur3 hash,
CQL, nothing stops us to scale to 1,000 nodes. Another user (Mogujie) run
10s of TBs per node(!)
in production.

Also, since we try to compare Scylla and C* in a fair way, we invested a
great deal of time
to run C*. I can say it's not simple at all.
Lastly, in a couple of months we'll reach parity in functionality with C*
(counters are in 1.7 as experimental, in 1.8 counters will be stable and
we'll have MV as experimental, LWT will be
in the summer). We hope to collaborate with the C* community with the
development of future
features.

Dor


On Fri, Mar 10, 2017 at 10:19 AM, Jacques-Henri Berthemet <
jacques-henri.berthe...@genesys.com> wrote:

> Cassandra is not about pure performance, there are many other DBs that are
> much faster than Cassandra. Cassandra strength is all about scalability,
> performance increases in a linear way as you add more nodes. During
> Cassandra summit 2014 Apple said they have a 10k node cluster. The usual
> limiting factor is your disk write speed and latency, I don’t see how C++
> changes anything in this regard unless you can cache all your data in
> memory.
>
>
>
> I’d be curious to know how ScyllaDB performs with a 100+ nodes cluster
> with PBs of data compared to Cassandra.
>
> *--*
>
> *Jacques-Henri Berthemet*
>
>
>
> *From:* Rakesh Kumar [mailto:rakeshkumar...@outlook.com]
> *Sent:* vendredi 10 mars 2017 09:58
>
> *To:* user@cassandra.apache.org
> *Subject:* Re: scylladb
>
>
>
> Cassanda vs Scylla is a valid comparison because they both are
> compatible.  Scylla is a drop-in replacement for Cassandra.
> Is Aerospike a drop-in replacement for Cassandra? If yes, and only if yes,
> then the comparison is valid with Scylla.
>
>
> --
>
> *From:* Bhuvan Rawal 
> *To:* user@cassandra.apache.org
> *Sent:* Friday, March 10, 2017 11:59 AM
> *Subject:* Re: scylladb
>
>
>
> Agreed C++ gives an added advantage to talk to underlying hardware with
> better efficiency, it sound good but can a pice of code written in C++ give
> 1000% throughput than a Java app? Is TPC design 10X more performant than
> SEDA arch?
>
>
>
> And if C/C++ is indeed that fast how can Aerospike (which is itself
> written in C) claim to be 10X faster than Scylla here
> http://www.aerospike.com/benchmarks/scylladb-initial/ ? (Combining your's
> and aerospike's benchmarks it appears that Aerospike is 100X performant
> than C* - I highly doubt that!! )
>
>
>
> For a moment lets forget about evaluating 2 different databases, one can
> observe 10X performance difference between a mistuned cassandra cluster and
> one thats tuned as per data model - there are so many Tunables in yaml as
> well as table configs.
>
>
>
> Idea is - in order to strengthen your claim, you need to provide complete
> system metrics (Disk, CPU, Network), the OPS increase starts to decay along
> with the configs used. Having plain ops per second and 99p latency is
> blackbox.
>
>
>
> Regards,
>
> Bhuvan
>
>
>
> On Fri, Mar 10, 2017 at 12:47 PM, Avi Kivity  wrote:
>
> ScyllaDB engineer here.
>
> C++ is really an enabling technology here. It is directly responsible for
> a small fraction of the gain by executing faster than Java.  But it is
> indirectly responsible for the gain by allowing us direct control over
> memory and threading.  Just as an example, Scylla starts by taking over
> almost all of the machine's memory, and dynamically assigning it to
> memtables, cache, and working memory needed to handle requests in flight.
> Memory is statically partitioned across cores, allowing us to exploit NUMA
> fully.  You can't do these things in Java.
>
> I would say the major contributors to Scylla performance are:
>  - thread-per-core design
>  - replacement of the page cache with a row cache
>  - careful attention to many small details, each contributing a little,
> but with a large overall impact
>
> While I'm here I can say that performance is not the

Re: How many vnodes should I use for each node in my cluster?

2016-09-16 Thread Dor Laor

On Fri, Sep 16, 2016 at 11:29 AM, Li, Guangxing 
wrote:

> Hi,
>
> I have a 3 nodes cluster, each with less than 200 GB data. Currently all
> nodes have the default 256 value for num_tokens. My colleague told me that
> with the data size I have (less than 200 GB on each node), I should change
> num_tokens to something like 32 to get better performance, especially speed
> up the repair time. Do any of you guys have experience on
>

It's not enough to know the volume size, it's important to know the amount
of keys which effect the merkle tree. I wouldn't change it, I doubt you'll
see a significant difference in repair speed and if you'll grow the cluster
you would want to have enough vnodes.

> this? I am running Cassandra Community version 2.0.9. The cluster resides
> in AWS. All keyspaces have RC 3.
>
> Thanks.
>
> George.
>

Re: IO scheduler for SSDs on EC2?

2015-03-15 Thread Dor Laor

On Sun, Mar 15, 2015 at 2:03 PM, Ali Akhtar ali.rac...@gmail.com wrote:

 I was watching a talk recently on Elasticsearch performance in EC2, and
 they recommended setting the IO scheduler to noop for SSDs. Is that the
 case for Cassandra as well, or is it recommended to keep the default
 'deadline' scheduler for Cassandra?


In general the noop IO scheduler is the best suitable one for virtual
machines (regardless
of C* or SSDs). Usually the hypervisor below or the target SAN/NAS run
their own
IO scheduler so there is no need to do that twice.

From my own experience (managing the Xen and KVM dev teams @redhat) both
deadline and
noop gave similar results and were better than CFQ's.



 Thanks.

38 matches

Mail list logo