Re: Change primary key from int to bigint

2017-01-11 Thread Tom van der Woerdt
Actually, come to think of it, there's a subtle serialization difference
between varint and int that will break token generation (see bottom of
mail). I think it's a bug that Cassandra will allow this, so don't do this
in production.

You can think of varint encoding as regular bigints with all the leading
zero bytes stripped off. This means the varint decoder will happily decode
the tinyint, smallint, int, and bigint types, but the encoder won't
necessarily re-encode to the same thing. Specifically, any int below
8388608 will have a different encoding in a varint.

There's a small performance impact with the varint encoding and decoding
scheme, but likely insignificant for any reasonable use case.

Tom






cqlsh> select * from foo where id in (1, 128, 256, 65535, 65536, 16777215,
16777216, 2147483647);

 id | value
+---
  1 |  test
128 |  test
256 |  test
  65535 |  test
  65536 |  test
   16777215 |  test
   16777216 |  test
 2147483647 |  test

(8 rows)
cqlsh> alter table foo alter id TYPE varint;
cqlsh> select * from foo where id in (1, 128, 256, 65535, 65536, 16777215,
16777216, 2147483647);

 id | value
+---
   16777215 |  test
   16777216 |  test
 2147483647 |  test

(3 rows)
cqlsh> select * from foo;

 id | value
+---
128 |  test
   16777216 |  test
  1 |  test
 2147483647 |  test
   16777215 |  test
256 |  test
  65535 |  test
  65536 |  test



On Wed, Jan 11, 2017 at 9:54 AM, Benjamin Roth <benjamin.r...@jaumo.com>
wrote:

> Few! You saved my life, thanks!
>
> For my understanding:
> When creating a new table, is bigint or varint a better choice for storing
> (up to) 64bit ints? Is there a difference in performance?
>
> 2017-01-11 9:39 GMT+01:00 Tom van der Woerdt <tom.vanderwoe...@booking.com
> >:
>
>> Hi Benjamin,
>>
>> bigint and int have incompatible serialization types, so that won't work.
>> However, changing to 'varint' will work fine.
>>
>> Hope that helps.
>>
>> Tom
>>
>>
>>
>> On Wed, Jan 11, 2017 at 9:21 AM, Benjamin Roth <benjamin.r...@jaumo.com>
>> wrote:
>>
>>> Hi there,
>>>
>>> Does anyone know if there is a hack to change a "int" to a "bigint" in a
>>> primary key?
>>> I recognized very late, I took the wrong type and our production DB
>>> already contains billions of records :(
>>> Is there maybe a hack for it, because int and bigint are similar types
>>> or does the SSTable serialization and maybe the token generation require
>>> the tables to be completely reread+rewritten?
>>>
>>> --
>>> Benjamin Roth
>>> Prokurist
>>>
>>> Jaumo GmbH · www.jaumo.com
>>> Wehrstraße 46 · 73035 Göppingen · Germany
>>> Phone +49 7161 304880-6 <+49%207161%203048806> · Fax +49 7161 304880-1
>>> <+49%207161%203048801>
>>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>>>
>>
>>
>
>
> --
> Benjamin Roth
> Prokurist
>
> Jaumo GmbH · www.jaumo.com
> Wehrstraße 46 · 73035 Göppingen · Germany
> Phone +49 7161 304880-6 <+49%207161%203048806> · Fax +49 7161 304880-1
> <+49%207161%203048801>
> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>


Re: Change primary key from int to bigint

2017-01-11 Thread Tom van der Woerdt
Hi Benjamin,

bigint and int have incompatible serialization types, so that won't work.
However, changing to 'varint' will work fine.

Hope that helps.

Tom


On Wed, Jan 11, 2017 at 9:21 AM, Benjamin Roth 
wrote:

> Hi there,
>
> Does anyone know if there is a hack to change a "int" to a "bigint" in a
> primary key?
> I recognized very late, I took the wrong type and our production DB
> already contains billions of records :(
> Is there maybe a hack for it, because int and bigint are similar types or
> does the SSTable serialization and maybe the token generation require the
> tables to be completely reread+rewritten?
>
> --
> Benjamin Roth
> Prokurist
>
> Jaumo GmbH · www.jaumo.com
> Wehrstraße 46 · 73035 Göppingen · Germany
> Phone +49 7161 304880-6 <+49%207161%203048806> · Fax +49 7161 304880-1
> <+49%207161%203048801>
> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>


Re: Change primary key from int to bigint

2017-01-11 Thread Tom van der Woerdt
My understanding is that it's safe... but considering "alter type" is going
to be removed completely (
https://issues.apache.org/jira/browse/CASSANDRA-12443), maybe not.

As for faster ways to do this: no idea :-(

Tom


On Wed, Jan 11, 2017 at 12:12 PM, Benjamin Roth <benjamin.r...@jaumo.com>
wrote:

> But it is safe to change non-primary-key columns from int to varint, right?
>
> 2017-01-11 10:09 GMT+01:00 Tom van der Woerdt <
> tom.vanderwoe...@booking.com>:
>
>> Actually, come to think of it, there's a subtle serialization difference
>> between varint and int that will break token generation (see bottom of
>> mail). I think it's a bug that Cassandra will allow this, so don't do this
>> in production.
>>
>> You can think of varint encoding as regular bigints with all the leading
>> zero bytes stripped off. This means the varint decoder will happily decode
>> the tinyint, smallint, int, and bigint types, but the encoder won't
>> necessarily re-encode to the same thing. Specifically, any int below
>> 8388608 will have a different encoding in a varint.
>>
>> There's a small performance impact with the varint encoding and decoding
>> scheme, but likely insignificant for any reasonable use case.
>>
>> Tom
>>
>>
>>
>>
>>
>>
>> cqlsh> select * from foo where id in (1, 128, 256, 65535, 65536,
>> 16777215, 16777216, 2147483647 <%28214%29%20748-3647>);
>>
>>  id | value
>> +---
>>   1 |  test
>> 128 |  test
>> 256 |  test
>>   65535 |  test
>>   65536 |  test
>>16777215 |  test
>>16777216 |  test
>>  2147483647 <%28214%29%20748-3647> |  test
>>
>> (8 rows)
>> cqlsh> alter table foo alter id TYPE varint;
>> cqlsh> select * from foo where id in (1, 128, 256, 65535, 65536,
>> 16777215, 16777216, 2147483647 <%28214%29%20748-3647>);
>>
>>  id | value
>> +---
>>16777215 |  test
>>16777216 |  test
>>  2147483647 <%28214%29%20748-3647> |  test
>>
>> (3 rows)
>> cqlsh> select * from foo;
>>
>>  id | value
>> +---
>> 128 |  test
>>16777216 |  test
>>   1 |  test
>>  2147483647 <%28214%29%20748-3647> |  test
>>16777215 |  test
>>     256 |  test
>>   65535 |  test
>>   65536 |  test
>>
>>
>>
>>
>> On Wed, Jan 11, 2017 at 9:54 AM, Benjamin Roth <benjamin.r...@jaumo.com>
>> wrote:
>>
>>> Few! You saved my life, thanks!
>>>
>>> For my understanding:
>>> When creating a new table, is bigint or varint a better choice for
>>> storing (up to) 64bit ints? Is there a difference in performance?
>>>
>>> 2017-01-11 9:39 GMT+01:00 Tom van der Woerdt <
>>> tom.vanderwoe...@booking.com>:
>>>
>>>> Hi Benjamin,
>>>>
>>>> bigint and int have incompatible serialization types, so that won't
>>>> work. However, changing to 'varint' will work fine.
>>>>
>>>> Hope that helps.
>>>>
>>>> Tom
>>>>
>>>>
>>>>
>>>> On Wed, Jan 11, 2017 at 9:21 AM, Benjamin Roth <benjamin.r...@jaumo.com
>>>> > wrote:
>>>>
>>>>> Hi there,
>>>>>
>>>>> Does anyone know if there is a hack to change a "int" to a "bigint" in
>>>>> a primary key?
>>>>> I recognized very late, I took the wrong type and our production DB
>>>>> already contains billions of records :(
>>>>> Is there maybe a hack for it, because int and bigint are similar types
>>>>> or does the SSTable serialization and maybe the token generation require
>>>>> the tables to be completely reread+rewritten?
>>>>>
>>>>> --
>>>>> Benjamin Roth
>>>>> Prokurist
>>>>>
>>>>> Jaumo GmbH · www.jaumo.com
>>>>> Wehrstraße 46 · 73035 Göppingen · Germany
>>>>> Phone +49 7161 304880-6 <+49%207161%203048806> · Fax +49 7161 304880-1
>>>>> <+49%207161%203048801>
>>>>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Benjamin Roth
>>> Prokurist
>>>
>>> Jaumo GmbH · www.jaumo.com
>>> Wehrstraße 46 · 73035 Göppingen · Germany
>>> Phone +49 7161 304880-6 <+49%207161%203048806> · Fax +49 7161 304880-1
>>> <+49%207161%203048801>
>>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>>>
>>
>>
>
>
> --
> Benjamin Roth
> Prokurist
>
> Jaumo GmbH · www.jaumo.com
> Wehrstraße 46 · 73035 Göppingen · Germany
> Phone +49 7161 304880-6 <+49%207161%203048806> · Fax +49 7161 304880-1
> <+49%207161%203048801>
> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>


Re: Netty SSL memory leak

2017-05-31 Thread Tom van der Woerdt
Hi John,

That's the bug I filed the ticket for, yup. I recommend updating to a newer
Cassandra version (3.0.11 or newer), which fixes this issue (and many
others).

Tom


On Wed, May 31, 2017 at 12:39 AM, John Sanda  wrote:

> I have Cassandra 3.0.9 cluster that is hitting OutOfMemoryErrors with byte
> buffer allocation. The stack trace looks like:
>
> java.lang.OutOfMemoryError: Direct buffer memory
> at java.nio.Bits.reserveMemory(Bits.java:694) ~[na:1.8.0_131]
> at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
> ~[na:1.8.0_131]
> at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
> ~[na:1.8.0_131]
> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434)
> ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179)
> ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:168)
> ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:98)
> ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
> at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(
> PooledByteBufAllocator.java:250) ~[netty-all-4.0.23.Final.jar:
> 4.0.23.Final]
> at io.netty.buffer.AbstractByteBufAllocator.directBuffer(
> AbstractByteBufAllocator.java:155) ~[netty-all-4.0.23.Final.jar:
> 4.0.23.Final]
> at io.netty.buffer.AbstractByteBufAllocator.directBuffer(
> AbstractByteBufAllocator.java:146) ~[netty-all-4.0.23.Final.jar:
> 4.0.23.Final]
> at io.netty.buffer.AbstractByteBufAllocator.buffer(
> AbstractByteBufAllocator.java:83) ~[netty-all-4.0.23.Final.jar:
> 4.0.23.Final]
> at io.netty.handler.ssl.SslHandler.allocate(SslHandler.java:1265)
> ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
> at io.netty.handler.ssl.SslHandler.allocateOutNetBuf(SslHandler.java:1275)
> ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
> at io.netty.handler.ssl.SslHandler.wrap(SslHandler.java:453)
> ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
> at io.netty.handler.ssl.SslHandler.flush(SslHandler.java:432)
> ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
> at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(
> AbstractChannelHandlerContext.java:688) ~[netty-all-4.0.23.Final.jar:
> 4.0.23.Final]
>
> I do not yet have a heap dump. The two relevant tickets are
> CASSANDRA-13114 
>  and CASSANDRA-13126
> . The upstream
> Netty ticket is 3057 .
> Cassandra 3.0.11 upgraded Netty to the version with the fix. Is there
> anything I can check to confirm that this is in fact the issue I am hitting?
>
> Secondly, is there a way to monitor for this? The OOME does not cause the
> JVM to exit. Instead, the logs are getting filled up with OutOfMemoryErrors.
> nodetool status reports UN, and nodetool statusbinary reports running.
>
> --
>
> - John
>


Unexpected rows in MV after upgrading to 3.0.15

2017-11-03 Thread Tom van der Woerdt
Hello,

While testing 3.0.15, we noticed that some materialized views started
showing rows that shouldn't exist, as multiple rows in the view map to a
single row in the base table.

I've pasted the table structure below, but essentially there's a base table
"((pk1,pk2,pk3),ck1),col1" and MV "((pk1,pk2,pk3),col1,ck1)". This means
that if col1 changes, we expect a delete and insert on the MV. And yet this
happens:

> select col1, ck1, dateof(col1) FROM view_1 where pk1='abc' and pk2='123'
and pk3='def';

 col1 |
ck1  | system.dateof(col1)
--+--+-
 7bd437d9-bccc-11e7-9748-40749b41c1e0 |
295eae9b-d544-4064-8dbc-0c56772759f3 | 2017-10-29 17:13:29.494000+
 df39e364-bed3-11e7-8a3d-953c29bf01ff |
295eae9b-d544-4064-8dbc-0c56772759f3 | 2017-11-01 07:11:25.057000+
 928980ae-bed5-11e7-8b41-6e709b16923d |
295eae9b-d544-4064-8dbc-0c56772759f3 | 2017-11-01 07:23:35.388000+
 # Only relevant rows are shown

> select col1,writetime(col1),dateof(col1) from table_1 where pk1='abc' and
pk2='123' and pk3='def' and ck1='295eae9b-d544-4064-8dbc-0c56772759f3';

 col1 | writetime(col1) |
system.dateof(col1)
--+-+-
 928980ae-bed5-11e7-8b41-6e709b16923d |1509728864328000 |
2017-11-01 07:23:35.388000+

It's not supposed to be possible, and yet there are three rows that all map
onto the same primary key in the base table.

The cluster was upgraded on 2017-10-31, so the first row could *maybe* be
explained by CASSANDRA-11500, but the second row can't. The third row is
the one we expect to be there.

Is this a new regression in 3.0.15? Is anyone else experiencing this, or
should I file a ticket?

Thanks,
Tom


--- Full structure: -

CREATE TABLE the_keyspace.table_1 (
pk1 ascii,
pk2 ascii,
pk3 ascii,
ck1 ascii,
col1 timeuuid,
PRIMARY KEY ((pk1, pk2, pk3), ck1)
) WITH CLUSTERING ORDER BY (ck1 ASC)
AND bloom_filter_fp_chance = 0.1
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class':
'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
AND compression = {'chunk_length_in_kb': '64', 'class':
'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';

CREATE MATERIALIZED VIEW the_keyspace.view_1 AS
SELECT *
FROM the_keyspace.table_1
WHERE pk1 IS NOT NULL AND pk2 IS NOT NULL AND pk3 IS NOT NULL AND col1
IS NOT NULL AND ck1 IS NOT NULL
PRIMARY KEY ((pk1, pk2, pk3), col1, ck1)
WITH CLUSTERING ORDER BY (col1 ASC, ck1 ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class':
'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
AND compression = {'chunk_length_in_kb': '64', 'class':
'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';


Tom van der Woerdt
Site Reliability Engineer

Booking.com B.V.
Vijzelstraat 66-80 Amsterdam 1017HL Netherlands
[image: Booking.com] <http://www.booking.com/>
The world's #1 accommodation site
43 languages, 204+ offices worldwide, 118,000+ global destinations,
1,500,000+ room nights booked every day
No booking fees, best price always guaranteed
Subsidiary of the Priceline Group (NASDAQ: PCLN)


Re: [External] Maximum SSTable size

2018-06-27 Thread Tom van der Woerdt
I’ve had SSTables as big as 11TB. It works, read performance is fine. But,
compaction is hell, because you’ll need twice that in disk space and it
will take many hours 

Avoid large SSTables unless you really know what you’re doing. LCS is a
great default for almost every workload, especially if your cluster has a
single large table. STCS is the actual Cassandra default but it often
causes more trouble than it solves, because of large SSTables 

Hope that helps!

Tom


On Wed, 27 Jun 2018 at 08:02, Lucas Benevides 
wrote:

> Hello Community,
>
> Is there a maximum SSTable Size?
> If there is not, does it go up to the maximum Operational System values?
>
> Thanks in advance,
> Lucas Benevides
>
-- 
Tom van der Woerdt
Site Reliability Engineer

Booking.com B.V.
Vijzelstraat 66-80 Amsterdam 1017HL Netherlands
Direct +31207153426
[image: Booking.com] <https://www.booking.com/>
The world's #1 accommodation site
43 languages, 198+ offices worldwide, 120,000+ global destinations,
1,550,000+ room nights booked every day
No booking fees, best price always guaranteed
Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)


Re: Meltdown/Spectre Linux patch - Performance impact on Cassandra?

2018-01-05 Thread Tom van der Woerdt
Hi Thomas,

No clue about AWS, and it is of course highly dependent on hardware, but on
CentOS 7 on bare metal, the patched kernel
(kernel-3.10.0-693.11.6.el7.x86_64) seems to have a roughly 50% CPU
increase compared to an unpatched kernel
(kernel-3.10.0-693.11.1.el7.x86_64). On a happier note, the latest mainline
kernel from elrepo (kernel-ml-4.14.11-1.el7.elrepo.x86_64) seems to recover
the entire performance loss, likely due to recent PCID patches (or the
other 3+ years of kernel development).

That's on lab servers though, the numbers here will likely vary a lot based
on test setup, and may not be reproducible for production workloads.

If you have the infrastructure to test a variety of kernels, I'd be very
interested to see your numbers.

Thanks,

Tom van der Woerdt
Site Reliability Engineer

Booking.com B.V.
Vijzelstraat 66-80 Amsterdam 1017HL Netherlands
Direct +31207153426
[image: Booking.com] <http://www.booking.com/>
The world's #1 accommodation site
43 languages, 198+ offices worldwide, 120,000+ global destinations,
1,550,000+ room nights booked every day
No booking fees, best price always guaranteed
Subsidiary of the Priceline Group (NASDAQ: PCLN)

On Fri, Jan 5, 2018 at 12:09 PM, Steinmaurer, Thomas <
thomas.steinmau...@dynatrace.com> wrote:

> Hello,
>
>
>
> has anybody already some experience/results if a patched Linux kernel
> regarding Meltdown/Spectre is affecting performance of Cassandra negatively?
>
>
>
> In production, all nodes running in AWS with m4.xlarge, we see up to a 50%
> relative (e.g. AVG CPU from 40% => 60%) CPU increase since Jan 4, 2018,
> most likely correlating with Amazon finished patching the underlying
> Hypervisor infrastructure …
>
>
>
> Anybody else seeing a similar CPU increase?
>
>
>
> Thanks,
>
> Thomas
>
>
> The contents of this e-mail are intended for the named addressee only. It
> contains information that may be confidential. Unless you are the named
> addressee or an authorized designee, you may not copy or use it, or
> disclose it to anyone else. If you received it in error please notify us
> immediately and then destroy it. Dynatrace Austria GmbH (registration
> number FN 91482h) is a company registered in Linz whose registered office
> is at 4040 Linz, Austria, Freist
> <https://maps.google.com/?q=4040+Linz,+Austria,+Freist%C3%A4dterstra%C3%9Fe+313=gmail=g>
> ädterstra
> <https://maps.google.com/?q=4040+Linz,+Austria,+Freist%C3%A4dterstra%C3%9Fe+313=gmail=g>
> ße 313
> <https://maps.google.com/?q=4040+Linz,+Austria,+Freist%C3%A4dterstra%C3%9Fe+313=gmail=g>
>


Re: Decommissioned nodes and FailureDetector

2018-01-19 Thread Tom van der Woerdt
Hi Oleksandr,

Here's the code I use, hope it helps:

ownership = jolokia_read("org.apache.cassandra.db:type=StorageService",
"Ownership")
unreachable =
jolokia_read("org.apache.cassandra.db:type=StorageService",
"UnreachableNodes")
ownership_by_ip = {}
for nodeinfo, ownership_ratio in ownership.items():
ownership_by_ip[nodeinfo.split('/')[1]] = ownership_ratio

unreachable_and_has_data = []
for node in set(unreachable):
if node not in ownership_by_ip or ownership_by_ip[node] == 0:
continue
unreachable_and_has_data.append(node)

unreachable_racks = {}
for node in unreachable_and_has_data:
its_rack =
jolokia_exec("org.apache.cassandra.db:type=EndpointSnitchInfo",
"getRack/%s" % node)
its_dc =
jolokia_exec("org.apache.cassandra.db:type=EndpointSnitchInfo",
"getDatacenter/%s" % node)
rack_name = "%s %s" % (its_rack, its_dc)
unreachable_racks[rack_name] = 1

racks_unreachable = len(unreachable_racks.keys())
nodes_unreachable = len(unreachable_and_has_data)

This also looks at the number of unreachable racks, so if you only care
about nodes you should be able to get rid of most code here.

Tom van der Woerdt
Site Reliability Engineer

Booking.com B.V.
Vijzelstraat 66-80 Amsterdam 1017HL Netherlands
[image: Booking.com] <http://www.booking.com/>
The world's #1 accommodation site
43 languages, 198+ offices worldwide, 120,000+ global destinations,
1,550,000+ room nights booked every day
No booking fees, best price always guaranteed
Subsidiary of the Priceline Group (NASDAQ: PCLN)

On Fri, Jan 19, 2018 at 12:28 PM, Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

> On Fri, Jan 19, 2018 at 11:17 AM, Nicolas Guyomar <
> nicolas.guyo...@gmail.com> wrote:
>
>> Hi,
>>
>> Not sure if StorageService should be accessed, but you can check node
>> movement here :
>> 'org.apache.cassandra.db:type=StorageService/LeavingNodes',
>> 'org.apache.cassandra.db:type=StorageService/LiveNodes',
>> 'org.apache.cassandra.db:type=StorageService/UnreachableNodes',
>>
>
> Checking the list of  Unreachable Nodes doesn't help unfortunately, since
> it contains a mix of decommissioned and just DOWN nodes.  So the total
> number of addresses in this list is equal to the DownEndpointCount, from
> the perspective of a node where you query it.
>
> --
> Alex
>
>


Re: [External] Is there any limit in the number of partitions that a table can have

2018-03-07 Thread Tom van der Woerdt
Hi Javier,

When our users ask this question, I tend to answer "keep it above a
billion". More partitions is better.

I'm not aware of any actual limits on partition count. Practically it's
almost always limited by the disk space in a server.

Tom van der Woerdt
Site Reliability Engineer

Booking.com B.V.
Vijzelstraat 66-80 Amsterdam 1017HL Netherlands
[image: Booking.com] <http://www.booking.com/>
The world's #1 accommodation site
43 languages, 198+ offices worldwide, 120,000+ global destinations,
1,550,000+ room nights booked every day
No booking fees, best price always guaranteed
Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)

On Wed, Mar 7, 2018 at 12:06 PM, Javier Pareja <pareja.jav...@gmail.com>
wrote:

> Hello all,
>
> I have been trying to find an answer to the following but I have had no
> luck so far:
> Is there any limit to the number of partitions that a table can have?
> Let's say a table has a partition key an no clustering key, is there a
> recommended limit on the number of values that this partition key can have?
> Is it recommended to have a clustering key to reduce this number by storing
> several rows in each partition instead of one row per partition.
>
> Regards,
>
> F Javier Pareja
>


Re: [External] Re: Whch version is the best version to run now?

2018-03-05 Thread Tom van der Woerdt
We run on the order of a thousand Cassandra nodes in production. Most of
that is 3.0.16, but new clusters are defaulting to 3.11.2 and some older
clusters have been upgraded to it as well.

All of the bugs I encountered in 3.11.x were also seen in 3.0.x, but 3.11.x
seems to get more love from the community wrt patches. This is why I'd
recommend 3.11.x for new projects.

Stay away from any of the 2.x series, they're going EOL soonish and the
newer versions are very stable.

Tom van der Woerdt
Site Reliability Engineer

Booking.com B.V.
Vijzelstraat 66-80 Amsterdam 1017HL Netherlands
[image: Booking.com] <http://www.booking.com/>
The world's #1 accommodation site
43 languages, 198+ offices worldwide, 120,000+ global destinations,
1,550,000+ room nights booked every day
No booking fees, best price always guaranteed
Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)

On Sat, Mar 3, 2018 at 12:25 AM, Jeff Jirsa <jji...@gmail.com> wrote:

> I’d personally be willing to run 3.0.16
>
> 3.11.2 or 3 whatever should also be similar, but I haven’t personally
> tested it at any meaningful scale
>
>
> --
> Jeff Jirsa
>
>
> On Mar 2, 2018, at 2:37 PM, Kenneth Brotman <kenbrot...@yahoo.com.INVALID>
> wrote:
>
> Seems like a lot of people are running old versions of Cassandra.  What is
> the best version, most reliable stable version to use now?
>
>
>
> Kenneth Brotman
>
>


Re: Five Questions for Cassandra Users

2019-03-28 Thread Tom van der Woerdt
1.   Do the same people where you work operate the cluster and write
the code to develop the application?

No, we have a small infrastructure team, and many people developing
applications using Cassandra

2.   Do you have a metrics stack that allows you to see graphs of
various metrics with all the nodes displayed together?

Yes, we use a re-implementation of Graphite, which we open-sourced and now
lives at https://github.com/go-graphite

3.   Do you have a log stack that allows you to see the logs for all
the nodes together?

Yes, although in practice we don't use it much for Cassandra

4.   Do you regularly repair your clusters - such as by using Reaper?

Yes, we have built our own tools for this

5.   Do you use artificial intelligence to help manage your clusters?

It's not "artificial intelligence" the way most people would describe it,
but we certainly don't run our clusters manually



Tom van der Woerdt
Site Reliability Engineer

Booking.com B.V.
Vijzelstraat 66-80 Amsterdam 1017HL Netherlands
[image: Booking.com] <https://www.booking.com/>
Empowering people to experience the world since 1996
43 languages, 214+ offices worldwide, 141,000+ global destinations, 29
million reported listings
Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)


On Thu, Mar 28, 2019 at 10:03 AM Kenneth Brotman
 wrote:

> I’m looking to get a better feel for how people use Cassandra in
> practice.  I thought others would benefit as well so may I ask you the
> following five questions:
>
>
>
> 1.   Do the same people where you work operate the cluster and write
> the code to develop the application?
>
>
>
> 2.   Do you have a metrics stack that allows you to see graphs of
> various metrics with all the nodes displayed together?
>
>
>
> 3.   Do you have a log stack that allows you to see the logs for all
> the nodes together?
>
>
>
> 4.   Do you regularly repair your clusters - such as by using Reaper?
>
>
>
> 5.   Do you use artificial intelligence to help manage your clusters?
>
>
>
>
>
> Thank you for taking your time to share this information!
>
>
>
> Kenneth Brotman
>


Re: Running and Managing Large Cassandra Clusters

2020-10-28 Thread Tom van der Woerdt
Does 360 count? :-)

num_tokens is 16, works fine (had 256 on a 300 node cluster as well, not
too many problems either). Roughly 2.5TB per node, running on-prem on
reasonably stable hardware so replacements end up happening once a week at
most, and there's no particular change needed in the automation. Scaling up
or down takes a while, but it doesn't appear to be slower than any other
cluster. Configuration wise it's no different than a 5-node cluster either.
Pretty uneventful tbh.

Tom van der Woerdt
Senior Site Reliability Engineer

Booking.com BV
Vijzelstraat Amsterdam Netherlands 1017HL
[image: Booking.com] <https://www.booking.com/>
Making it easier for everyone to experience the world since 1996
43 languages, 214+ offices worldwide, 141,000+ global destinations, 29
million reported listings
Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)


On Wed, Oct 28, 2020 at 8:58 AM Gediminas Blazys
 wrote:

> Hello,
>
>
>
> I wanted to seek out your opinion and experience.
>
>
>
> Has anyone of you had a chance to run a Cassandra cluster of more than 350
> nodes?
>
> What are the major configuration considerations that you had to focus on?
> What number of vnodes did you use?
>
> Once the cluster was up and running what would you have done differently?
>
> Perhaps it would be more manageable to run multiple smaller clusters? Did
> you try this approach? What were the major challenges?
>
>
>
> I don’t know if questions like that are allowed here but I’m really
> interested in what other folks ran into while running massive operations.
>
>
>
> Gediminas
>
>
>


Re: Running and Managing Large Cassandra Clusters

2020-10-28 Thread Tom van der Woerdt
Heya,

We're running version 3.11.7, can't use 3.11.8 as it won't even start
(CASSANDRA-16091). Our policy is to use LCS for everything unless there's a
good argument for a different compaction strategy (I don't think we have
*any* STCS at all other than system keyspaces). Since our nodes are mostly
on-prem they are generally oversized on cpu count, but when idle the
cluster with 360 nodes ends up using less than two cores *peak* for
background tasks like (full, weekly) repairs and tombstone compactions.
That said they do get 32 logical threads because that's what the hardware
ships with (-:

Haven't had major problems with Gossip over the years. I think we've had to
run nodetool assassinate exactly once, a few years ago. Probably the only
gossip related annoyance is that when you decommission all seed nodes
Cassandra will happily run a single core at 100% trying to connect until
you update the list of seeds, but that's really minor.

There's also one cluster that has 50TB nodes, 60 of them, storing
reasonably large cells (using LCS, previously TWCS, both fine). Replacing a
node takes a few days, but other than that it's not particularly
problematic.

In my experience it's the small clusters that wake you up ;-)

Tom van der Woerdt
Senior Site Reliability Engineer

Booking.com BV
Vijzelstraat Amsterdam Netherlands 1017HL
[image: Booking.com] <https://www.booking.com/>
Making it easier for everyone to experience the world since 1996
43 languages, 214+ offices worldwide, 141,000+ global destinations, 29
million reported listings
Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)


On Wed, Oct 28, 2020 at 12:32 PM Joshua McKenzie 
wrote:

> A few questions for you Tom if you have 30 seconds and care to disclose:
>
>1. What version of C*?
>2. What compaction strategy?
>3. What's core count allocated per C* node?
>4. Gossip give you any headaches / you have to be delicate there or
>does it behave itself?
>
> Context: pmc/committer and I manage the OSS C* team at DataStax. We're
> doing a lot of thinking about how to generally improve the operator
> experience across the board for folks in the post 4.0 time frame, so data
> like the above (where things are going well at scale and why) is super
> useful to help feed into that effort.
>
> Thanks!
>
>
>
> On Wed, Oct 28, 2020 at 7:14 AM, Tom van der Woerdt <
> tom.vanderwoe...@booking.com.invalid> wrote:
>
>> Does 360 count? :-)
>>
>> num_tokens is 16, works fine (had 256 on a 300 node cluster as well, not
>> too many problems either). Roughly 2.5TB per node, running on-prem on
>> reasonably stable hardware so replacements end up happening once a week at
>> most, and there's no particular change needed in the automation. Scaling up
>> or down takes a while, but it doesn't appear to be slower than any other
>> cluster. Configuration wise it's no different than a 5-node cluster either.
>> Pretty uneventful tbh.
>>
>> Tom van der Woerdt
>> Senior Site Reliability Engineer
>>
>> Booking.com <http://booking.com/> BV
>> Vijzelstraat Amsterdam Netherlands 1017HL
>> [image: Booking.com] <https://www.booking.com/>
>> Making it easier for everyone to experience the world since 1996
>> 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29
>> million reported listings
>> Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)
>>
>>
>> On Wed, Oct 28, 2020 at 8:58 AM Gediminas Blazys <
>> gediminas.bla...@microsoft.com.invalid> wrote:
>>
>>> Hello,
>>>
>>>
>>>
>>> I wanted to seek out your opinion and experience.
>>>
>>>
>>>
>>> Has anyone of you had a chance to run a Cassandra cluster of more than
>>> 350 nodes?
>>>
>>> What are the major configuration considerations that you had to focus
>>> on? What number of vnodes did you use?
>>>
>>> Once the cluster was up and running what would you have done differently?
>>>
>>> Perhaps it would be more manageable to run multiple smaller clusters?
>>> Did you try this approach? What were the major challenges?
>>>
>>>
>>>
>>> I don’t know if questions like that are allowed here but I’m really
>>> interested in what other folks ran into while running massive operations.
>>>
>>>
>>>
>>> Gediminas
>>>
>>
>


Re: Running and Managing Large Cassandra Clusters

2020-10-28 Thread Tom van der Woerdt
That particular cluster exists for archival purposes, and as such gets a
very low amount of traffic (maybe 5 queries per minute). So not
particularly helpful to answer your question :-) With that said, we've seen
in other clusters that scalability issues are much more likely to come from
hot partitions, hardware change rate (so basically any change to the token
ring, which we never do concurrently), repairs (though largely mitigated
now that we've switched to num_tokens=16), and connection count (sometimes
I'd consider it advisable to configure drivers to *not* establish a
connection to every node, but bound this and let the Cassandra coordinator
route requests instead).

The scalability in terms of client requests/reads/writes tends to be pretty
linear with the node count (and size of course), and on clusters that are
slightly smaller we can see this as well, easily doing hundreds of
thousands to a million queries per second.

As for repairs, we have our own tools for this, but it's fairly similar to
what Reaper does: we take all the ranges in the cluster and then schedule
them to be repaired over the course of a week. No manual `nodetool repair`
invocations, but specific single-range repairs.

Tom van der Woerdt
Senior Site Reliability Engineer

Booking.com BV
Vijzelstraat Amsterdam Netherlands 1017HL
[image: Booking.com] <https://www.booking.com/>
Making it easier for everyone to experience the world since 1996
43 languages, 214+ offices worldwide, 141,000+ global destinations, 29
million reported listings
Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)


On Wed, Oct 28, 2020 at 2:20 PM Gediminas Blazys
 wrote:

> Hey,
>
>
>
> Thanks chipping in Tomas. Could you describe what sort of workload is the
> big cluster receiving in terms of local C* reads, writes and client
> requests as well?
>
>
>
> You mention repairs, how do you run them?
>
>
>
> Gediminas
>
>
>
> *From:* Tom van der Woerdt 
> *Sent:* Wednesday, October 28, 2020 14:35
> *To:* user 
> *Subject:* [EXTERNAL] Re: Running and Managing Large Cassandra Clusters
>
>
>
> Heya,
>
>
>
> We're running version 3.11.7, can't use 3.11.8 as it won't even start
> (CASSANDRA-16091). Our policy is to use LCS for everything unless there's a
> good argument for a different compaction strategy (I don't think we have
> *any* STCS at all other than system keyspaces). Since our nodes are mostly
> on-prem they are generally oversized on cpu count, but when idle the
> cluster with 360 nodes ends up using less than two cores *peak* for
> background tasks like (full, weekly) repairs and tombstone compactions.
> That said they do get 32 logical threads because that's what the hardware
> ships with (-:
>
>
>
> Haven't had major problems with Gossip over the years. I think we've had
> to run nodetool assassinate exactly once, a few years ago. Probably the
> only gossip related annoyance is that when you decommission all seed nodes
> Cassandra will happily run a single core at 100% trying to connect until
> you update the list of seeds, but that's really minor.
>
>
>
> There's also one cluster that has 50TB nodes, 60 of them, storing
> reasonably large cells (using LCS, previously TWCS, both fine). Replacing a
> node takes a few days, but other than that it's not particularly
> problematic.
>
>
>
> In my experience it's the small clusters that wake you up ;-)
>
>
> *Tom van der Woerdt*
>
> Senior Site Reliability Engineer
>
> Booking.com BV
> Vijzelstraat Amsterdam Netherlands 1017HL
>
> *[image: Booking.com]*
> <https://urldefense.com/v3/__https://nam06.safelinks.protection.outlook.com/?url=https*3A*2F*2Fwww.booking.com*2F=04*7C01*7CGediminas.Blazys*40microsoft.com*7C49ac72df223f4567734408d87b3deb3b*7C72f988bf86f141af91ab2d7cd011db47*7C0*7C0*7C637394853158328199*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C1000=wnoZ9an0Fz1ePKGR2XYIODNFCKnbYxD03PYAStuzxKE*3D=0__;JSUlJSUlJSUlJSUlJSUlJSU!!FzMMvhmfRQ!8ClTsEZMT0xcNIA1_EUu62obyz5_K7M-6eMbcN-EoBpl70j7fNXjIJVae3wItRFQzhgzIsc$>
>
> Making it easier for everyone to experience the world since 1996
>
> 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29
> million reported listings
> Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)
>
>
>
>
>
> On Wed, Oct 28, 2020 at 12:32 PM Joshua McKenzie 
> wrote:
>
> A few questions for you Tom if you have 30 seconds and care to disclose:
>
>1. What version of C*?
>2. What compaction strategy?
>3. What's core count allocated per C* node?
>4. Gossip give you any headaches / you have to be delicate there or
>does it behave itself?
>
> Context: pmc/committer and I manage the OSS C* team at DataStax. We're
> doing a lot