Re: Bottleneck for small inserts?

2017-06-13 Thread Eric Pederson
Shoot - I didn't see that one.  I subscribe to the digest but was focusing
on the direct replies and accidentally missed Patrick and Jeff Jirsa's
messages.  Sorry about that...

I've been using a combination of cassandra-stress, cqlsh COPY FROM and a
custom C++ application for my ingestion testing.   My default setting for
my custom client application is 96 threads, and then by default I run one
client application process on each of 3 machines.  I tried
doubling/quadrupling the number of client threads (and doubling/tripling
the number of client processes but keeping the threads per process the
same) but didn't see any change.   If I recall correctly I started getting
timeouts after I went much beyond concurrent_writes which is 384 (for a 48
CPU box) - meaning at 500 threads per client machine I started seeing
timeouts.I'll try again to be sure.

For the purposes of this conversation I will try to always use
cassandra-stress to keep the number of unknowns limited.  I'll will run
more cassandra-stress clients tomorrow in line with Patrick's 3-5 per
server recommendation.

Thanks!


-- Eric

On Wed, Jun 14, 2017 at 12:40 AM, Jonathan Haddad  wrote:

> Did you try adding more client stress nodes as Patrick recommended?
>
> On Tue, Jun 13, 2017 at 9:31 PM Eric Pederson  wrote:
>
>> Scratch that theory - the flamegraphs show that GC is only 3-4% of two
>> newer machine's overall processing, compared to 18% on the slow machine.
>>
>> I took that machine out of the cluster completely and recreated the
>> keyspaces.  The ingest tests now run slightly faster (!).   I would have
>> expected a linear slowdown since the load is fairly balanced across
>> partitions.  GC appears to be the bottleneck in the 3-server
>> configuration.  But still in the two-server configuration the
>> CPU/disk/network is still not being fully utilized (the closest is CPU at
>> ~45% on one ingest test).  nodetool tpstats shows only blips of
>> queueing.
>>
>>
>>
>>
>> -- Eric
>>
>> On Mon, Jun 12, 2017 at 9:50 PM, Eric Pederson  wrote:
>>
>>> Hi all - I wanted to follow up on this.  I'm happy with the throughput
>>> we're getting but I'm still curious about the bottleneck.
>>>
>>> The big thing that sticks out is one of the nodes is logging frequent
>>> GCInspector messages: 350-500ms every 3-6 seconds.  All three nodes in
>>> the cluster have identical Cassandra configuration, but the node that is
>>> logging frequent GCs is an older machine with slower CPU and SSD.  This
>>> node logs frequent GCInspectors both under load and when compacting but
>>> otherwise unloaded.
>>>
>>> My theory is that the other two nodes have similar GC frequency (because
>>> they are seeing the same basic load), but because they are faster machines,
>>> they don't spend as much time per GC and don't cross the GCInspector
>>> threshold.  Does that sound plausible?   nodetool tpstats doesn't show
>>> any queueing in the system.
>>>
>>> Here's flamegraphs from the system when running a cqlsh COPY FROM:
>>>
>>>- http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>/flamegraph_ultva01_cars_batch2.svg
>>>
>>> 
>>>- http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>/flamegraph_ultva02_cars_batch2.svg
>>>
>>> 
>>>- http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>/flamegraph_ultva03_cars_batch2.svg
>>>
>>> 
>>>
>>> The slow node (ultva03) spends disproportional time in GC.
>>>
>>> Thanks,
>>>
>>>
>>> -- Eric
>>>
>>> On Thu, May 25, 2017 at 8:09 PM, Eric Pederson 
>>> wrote:
>>>
 Due to a cut and paste error those flamegraphs were a recording of the
 whole system, not just Cassandra.Throughput is approximately 30k
 rows/sec.

 Here's the graphs with just the Cassandra PID:

- http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
/flamegraph_ultva01_sars2.svg
- http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
/flamegraph_ultva02_sars2.svg
- http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
/flamegraph_ultva03_sars2.svg


 And here's graphs during a cqlsh COPY FROM to the same table, using
 real data, MAXBATCHSIZE=2.Throughput is good at approximately 110k
 rows/sec.

- http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
/flamegraph_ultva01_cars_batch2.svg

 
- http://sourcedelica.com/wordpress/wp-content/uploads/2017/05

Re: Bottleneck for small inserts?

2017-06-13 Thread Jonathan Haddad
Did you try adding more client stress nodes as Patrick recommended?

On Tue, Jun 13, 2017 at 9:31 PM Eric Pederson  wrote:

> Scratch that theory - the flamegraphs show that GC is only 3-4% of two
> newer machine's overall processing, compared to 18% on the slow machine.
>
> I took that machine out of the cluster completely and recreated the
> keyspaces.  The ingest tests now run slightly faster (!).   I would have
> expected a linear slowdown since the load is fairly balanced across
> partitions.  GC appears to be the bottleneck in the 3-server
> configuration.  But still in the two-server configuration the
> CPU/disk/network is still not being fully utilized (the closest is CPU at
> ~45% on one ingest test).  nodetool tpstats shows only blips of queueing.
>
>
>
>
>
> -- Eric
>
> On Mon, Jun 12, 2017 at 9:50 PM, Eric Pederson  wrote:
>
>> Hi all - I wanted to follow up on this.  I'm happy with the throughput
>> we're getting but I'm still curious about the bottleneck.
>>
>> The big thing that sticks out is one of the nodes is logging frequent
>> GCInspector messages: 350-500ms every 3-6 seconds.  All three nodes in
>> the cluster have identical Cassandra configuration, but the node that is
>> logging frequent GCs is an older machine with slower CPU and SSD.  This
>> node logs frequent GCInspectors both under load and when compacting but
>> otherwise unloaded.
>>
>> My theory is that the other two nodes have similar GC frequency (because
>> they are seeing the same basic load), but because they are faster machines,
>> they don't spend as much time per GC and don't cross the GCInspector
>> threshold.  Does that sound plausible?   nodetool tpstats doesn't show
>> any queueing in the system.
>>
>> Here's flamegraphs from the system when running a cqlsh COPY FROM:
>>
>>-
>>
>> http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg
>>-
>>
>> http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg
>>-
>>
>> http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg
>>
>> The slow node (ultva03) spends disproportional time in GC.
>>
>> Thanks,
>>
>>
>> -- Eric
>>
>> On Thu, May 25, 2017 at 8:09 PM, Eric Pederson  wrote:
>>
>>> Due to a cut and paste error those flamegraphs were a recording of the
>>> whole system, not just Cassandra.Throughput is approximately 30k
>>> rows/sec.
>>>
>>> Here's the graphs with just the Cassandra PID:
>>>
>>>-
>>>
>>> http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_sars2.svg
>>>-
>>>
>>> http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_sars2.svg
>>>-
>>>
>>> http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_sars2.svg
>>>
>>>
>>> And here's graphs during a cqlsh COPY FROM to the same table, using
>>> real data, MAXBATCHSIZE=2.Throughput is good at approximately 110k
>>> rows/sec.
>>>
>>>-
>>>
>>> http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg
>>>-
>>>
>>> http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg
>>>-
>>>
>>> http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg
>>>
>>>
>>>
>>>
>>> -- Eric
>>>
>>> On Thu, May 25, 2017 at 6:44 PM, Eric Pederson 
>>> wrote:
>>>
 Totally understood :)

 I forgot to mention - I set the /proc/irq/*/smp_affinity mask to
 include all of the CPUs.  Actually most of them were set that way already
 (for example, ,) - it might be because irqbalanced is
 running.  But for some reason the interrupts are all being handled on CPU 0
 anyway.

 I see this in /var/log/dmesg on the machines:

>
> Your BIOS has requested that x2apic be disabled.
> This will leave your machine vulnerable to irq-injection attacks.
> Use 'intremap=no_x2apic_optout' to override BIOS request.
> Enabled IRQ remapping in xapic mode
> x2apic not enabled, IRQ remapping is in xapic mode


 In a reply to one of the comments, he says:


 When IO-APIC configured to spread interrupts among all cores, it can
> handle up to eight cores. If you have more than eight cores, kernel will
> not configure IO-APIC to spread interrupts. Thus the trick I described in
> the article will not work.
> Otherwise it may be caused by buggy BIOS or even buggy hardware.


 I'm not sure if either of them is relevant to my situation.


 Thanks!





 -- Eric

 On Thu, May 25, 2017 at 4:16 PM, Jonathan Haddad 
 wrote:

> You shouldn't need a kernel recompile.  Check out the section "Simple
> solution for the 

Re: Bottleneck for small inserts?

2017-06-13 Thread Eric Pederson
Scratch that theory - the flamegraphs show that GC is only 3-4% of two
newer machine's overall processing, compared to 18% on the slow machine.

I took that machine out of the cluster completely and recreated the
keyspaces.  The ingest tests now run slightly faster (!).   I would have
expected a linear slowdown since the load is fairly balanced across
partitions.  GC appears to be the bottleneck in the 3-server
configuration.  But still in the two-server configuration the
CPU/disk/network is still not being fully utilized (the closest is CPU at
~45% on one ingest test).  nodetool tpstats shows only blips of queueing.




-- Eric

On Mon, Jun 12, 2017 at 9:50 PM, Eric Pederson  wrote:

> Hi all - I wanted to follow up on this.  I'm happy with the throughput
> we're getting but I'm still curious about the bottleneck.
>
> The big thing that sticks out is one of the nodes is logging frequent
> GCInspector messages: 350-500ms every 3-6 seconds.  All three nodes in
> the cluster have identical Cassandra configuration, but the node that is
> logging frequent GCs is an older machine with slower CPU and SSD.  This
> node logs frequent GCInspectors both under load and when compacting but
> otherwise unloaded.
>
> My theory is that the other two nodes have similar GC frequency (because
> they are seeing the same basic load), but because they are faster machines,
> they don't spend as much time per GC and don't cross the GCInspector
> threshold.  Does that sound plausible?   nodetool tpstats doesn't show
> any queueing in the system.
>
> Here's flamegraphs from the system when running a cqlsh COPY FROM:
>
>- http://sourcedelica.com/wordpress/wp-content/uploads/2017/
>05/flamegraph_ultva01_cars_batch2.svg
>
> 
>- http://sourcedelica.com/wordpress/wp-content/uploads/2017/
>05/flamegraph_ultva02_cars_batch2.svg
>
> 
>- http://sourcedelica.com/wordpress/wp-content/uploads/2017/
>05/flamegraph_ultva03_cars_batch2.svg
>
> 
>
> The slow node (ultva03) spends disproportional time in GC.
>
> Thanks,
>
>
> -- Eric
>
> On Thu, May 25, 2017 at 8:09 PM, Eric Pederson  wrote:
>
>> Due to a cut and paste error those flamegraphs were a recording of the
>> whole system, not just Cassandra.Throughput is approximately 30k
>> rows/sec.
>>
>> Here's the graphs with just the Cassandra PID:
>>
>>- http://sourcedelica.com/wordpress/wp-content/uploads/2017/
>>05/flamegraph_ultva01_sars2.svg
>>
>> 
>>- http://sourcedelica.com/wordpress/wp-content/uploads/2017/
>>05/flamegraph_ultva02_sars2.svg
>>
>> 
>>- http://sourcedelica.com/wordpress/wp-content/uploads/2017/
>>05/flamegraph_ultva03_sars2.svg
>>
>> 
>>
>>
>> And here's graphs during a cqlsh COPY FROM to the same table, using real
>> data, MAXBATCHSIZE=2.Throughput is good at approximately 110k
>> rows/sec.
>>
>>- http://sourcedelica.com/wordpress/wp-content/uploads/2017/
>>05/flamegraph_ultva01_cars_batch2.svg
>>
>> 
>>- http://sourcedelica.com/wordpress/wp-content/uploads/2017/
>>05/flamegraph_ultva02_cars_batch2.svg
>>
>> 
>>- http://sourcedelica.com/wordpress/wp-content/uploads/2017/
>>05/flamegraph_ultva03_cars_batch2.svg
>>
>> 
>>
>>
>>
>>
>> -- Eric
>>
>> On Thu, May 25, 2017 at 6:44 PM, Eric Pederson  wrote:
>>
>>> Totally understood :)
>>>
>>> I forgot to mention - I set the /proc/irq/*/smp_affinity mask to
>>> include all of the CPUs.  Actually most of them were set that way already
>>> (for example, ,) - it might be because irqbalanced is
>>> running.  But for some reason the interrupts are all being handled on CPU 0
>>> anyway.
>>>
>>> I see this in /var/log/dmesg on the machines:
>>>

 Your BIOS has requested that x2apic be disabled.
 This will leave your machine vulnerable to irq-injection attacks.
 Use 'intremap=no_x2apic_optout' to override BIOS request.
 Enabled IRQ remapping in xapic mode
 x2apic not enabled, IRQ remapping is in xapic mode
>>>
>>>
>>> In a reply to one of the comments, he says:
>>>
>>>
>>> When IO-APIC 

Re: Data in multi disks is not evenly distributed

2017-06-13 Thread Akhil Mehra
Hi,

I came across the following method (
https://github.com/apache/cassandra/blob/afd68abe60742c6deb6357ba4605268dfb3d06ea/src/java/org/apache/cassandra/service/StorageService.java#L5006-L5021).
It seems data is evenly split across disks according to local token ranges.

It might be that data stored is not evenly spread across your partition key
and thus the imbalance in disk usage.

Regards,
Akhil



On Mon, Jun 12, 2017 at 4:59 PM, Erick Ramirez  wrote:

> That's the cause of the imbalance -- an excessively large sstable which
> suggests to me that at some point you performed a manual major compaction
> with nodetool compact.
>
> If the table is using STCS, there won't be other compaction partners in
> the near future so you split the sstable manually with sstablesplit
> (offline tool so requires C* on the node to be shutdown temporarily).
>
>
> On Mon, Jun 12, 2017 at 1:25 PM, Xihui He  wrote:
>
>> Hi Vladimir,
>>
>> The disks size are all the same, 1.8T as show in df, and only used by
>> cassandra.
>> It seems to me that maybe the compacted lagetst file is on data01 which
>> uses 1.1T.
>>
>> Thanks,
>> Xihui
>>
>> On 11 June 2017 at 17:26, Vladimir Yudovin  wrote:
>>
>>> Hi,
>>>
>>> Do your disks have the same size? AFAK Cassandra distributes data with
>>> proportion to disk size, i.e. keeps the same percent of busy space.
>>>
>>> Best regards, Vladimir Yudovin,
>>> *Winguzone  - Cloud Cassandra Hosting*
>>>
>>>
>>>  On Wed, 07 Jun 2017 06:15:48 -0400 *Xihui He >> >* wrote 
>>>
>>> Dear All,
>>>
>>> We are using multiple disks per node and find the data is not evenly
>>> distributed (data01 uses 1.1T, but data02 uses 353G). Is this expected? If
>>> data01 becomes full, would the node be still writable? We are using 2.2.6.
>>>
>>> Thanks,
>>> Xihui
>>>
>>> data_file_directories:
>>> - /data00/cassandra
>>> - /data01/cassandra
>>> - /data02/cassandra
>>> - /data03/cassandra
>>> - /data04/cassandra
>>>
>>> df
>>> /dev/sde1   1.8T  544G  1.2T  32% /data03
>>> /dev/sdc1   1.8T  1.1T  683G  61% /data01
>>> /dev/sdf1   1.8T  491G  1.3T  29% /data04
>>> /dev/sdd1   1.8T  353G  1.4T  21% /data02
>>> /dev/sdb1   1.8T  285G  1.5T  17% /data00
>>>
>>> root@n9-016-015:~# du -sh /data01/cassandra/album_media_feature/*
>>> 143M /data01/cassandra/album_media_feature/media_feature_blur-066
>>> e5700c41511e5beacf197ae340934
>>> 4.4G /data01/cassandra/album_media_feature/media_feature_c1-dbadf
>>> 930c41411e5974743d3a691d887
>>> 56K /data01/cassandra/album_media_feature/media_feature_duplicat
>>> e-09d4b380c41511e58501e9aa37be91a5
>>> 16K /data01/cassandra/album_media_feature/media_feature_emotion-
>>> b8570470054d11e69fb88f073bab8267
>>> 240M /data01/cassandra/album_media_feature/media_feature_exposure
>>> -f55449c0c41411e58f5c9b66773b60c3
>>> 649M /data01/cassandra/album_media_feature/media_feature_group-f8
>>> de0cc0c41411e5827b995f709095c8
>>> 22G /data01/cassandra/album_media_feature/media_feature_multi_cl
>>> ass-cf3bb72006c511e69fb88f073bab8267
>>> 44K /data01/cassandra/album_media_feature/media_feature_pool5-11
>>> 85b200c41511e5b7d8757e25e34d67
>>> 15G /data01/cassandra/album_media_feature/media_feature_poster-f
>>> cf45850c41411e597bb1507d1856305
>>> 8.0K /data01/cassandra/album_media_feature/media_feature_quality-
>>> 155d9500c41511e5974743d3a691d887
>>> 17G /data01/cassandra/album_media_feature/media_feature_quality_
>>> rc-51babf50dba811e59fb88f073bab8267
>>> 8.7G /data01/cassandra/album_media_feature/media_feature_scene-00
>>> 8a5050c41511e59ebcc3582d286c8d
>>> 8.0K /data01/cassandra/album_media_feature/media_region_features_
>>> v4-29a0cd10150611e6bd3e3f41faa2612a
>>> 971G /data01/cassandra/album_media_feature/media_region_features_
>>> v5-1b805470a3d711e68121757e9ac51b7b
>>>
>>> root@n9-016-015:~# du -sh /data02/cassandra/album_media_feature/*
>>> 1.6G /data02/cassandra/album_media_feature/media_feature_blur-066
>>> e5700c41511e5beacf197ae340934
>>> 44G /data02/cassandra/album_media_feature/media_feature_c1-dbadf
>>> 930c41411e5974743d3a691d887
>>> 64K /data02/cassandra/album_media_feature/media_feature_duplicat
>>> e-09d4b380c41511e58501e9aa37be91a5
>>> 75G /data02/cassandra/album_media_feature/media_feature_emotion-
>>> b8570470054d11e69fb88f073bab8267
>>> 2.0G /data02/cassandra/album_media_feature/media_feature_exposure
>>> -f55449c0c41411e58f5c9b66773b60c3
>>> 21G /data02/cassandra/album_media_feature/media_feature_group-f8
>>> de0cc0c41411e5827b995f709095c8
>>> 336M /data02/cassandra/album_media_feature/media_feature_multi_cl
>>> ass-cf3bb72006c511e69fb88f073bab8267
>>> 44K /data02/cassandra/album_media_feature/media_feature_pool5-11
>>> 85b200c41511e5b7d8757e25e34d67
>>> 2.0G /data02/cassandra/album_media_feature/media_feature_poster-f
>>> cf45850c41411e597bb1507d1856305
>>> 8.0K 

Re: Convert single node C* to cluster (rebalancing problem)

2017-06-13 Thread John Hughes
OP, I was just looking at your original numbers and I have some questions:

270GB on one node and 414KB on the other, but something close to 50/50 on
"Owns(effective)".
What replication factor are your keyspaces set up with? 1x or 2x or ??

I would say you are seeing 50/50 because the tokens are allocated
50/50(others on the list please correct what are for me really just
assumptions), but I would hazard a guess that your replication factor
is still 1x, so it isn't moving anything around. Or your keyspace
rplication is incorrect and isn't being distributed(I have had issues with
the AWSMultiRegionSnitch and not getting the region correct[us-east vs
us-east-1). It doesn't throw an error, but it doesn't work very well either
=)

Can you do a 'describe keyspace XXX' and show the first line(the CREATE
KEYSPACE line).

Mind you, these are all just shots in the dark from here.

Cheers,


On Tue, Jun 13, 2017 at 3:13 AM Junaid Nasir  wrote:

> Is the OP expecting a perfect 50%/50% split?
>
>
> best result I got was 240gb/30gb split, which I think is not properly
> balanced.
>
>
>> Also, what are your outputs when you call out specific keyspaces? Do the
>> numbers get more even?
>
>
> i don't know what you mean by *call out specific key spaces?* can you
> please explain that a bit.
>
>
> If your schema is not modelled correctly you can easily end up unevenly
>> distributed data.
>
>
> I think that is the problem. initial 270gb data might not by modeled
> correctly. I have run a lot of tests on 270gb data including downsizing it
> to 5gb, they all resulted in same uneven distribution. I also tested a
> dummy dataset of 2gb which was balanced evenly. coming from rdb, I didn't
> give much thought to data modeling. can anyone please point me to some
> resources regarding this problem.
>
> On Tue, Jun 13, 2017 at 3:24 AM, Akhil Mehra  wrote:
>
>> Great point John.
>>
>> The OP should also note that data distribution also depends on your
>> schema and incoming data profile.
>>
>> If your schema is not modelled correctly you can easily end up unevenly
>> distributed data.
>>
>> Cheers,
>> Akhil
>>
>> On Tue, Jun 13, 2017 at 3:36 AM, John Hughes 
>> wrote:
>>
>>> Is the OP expecting a perfect 50%/50% split? That, to my experience, is
>>> not going to happen, it is almost always shifted from a fraction of a
>>> percent to a couple percent.
>>>
>>> Datacenter: eu-west
>>> ===
>>> Status=Up/Down
>>> |/ State=Normal/Leaving/Joining/Moving
>>> --  AddressLoad   Tokens   Owns (effective)  Host ID
>>>   Rack
>>> UN  XX.XX.XX.XX22.71 GiB  256  47.6%
>>> 57dafdde-2f62-467c-a8ff-c91e712f89c9  1c
>>> UN  XX.XX.XX.XX  17.17 GiB  256  51.3%
>>> d2a65c51-087d-48de-ae1f-a41142eb148d  1b
>>> UN  XX.XX.XX.XX  26.15 GiB  256  52.4%
>>> acf5dd34-5b81-4e5b-b7be-85a7fccd8e1c  1c
>>> UN  XX.XX.XX.XX   16.64 GiB  256  50.2%
>>> 6c8842dd-a966-467c-a7bc-bd6269ce3e7e  1a
>>> UN  XX.XX.XX.XX  24.39 GiB  256  49.8%
>>> fd92525d-edf2-4974-8bc5-a350a8831dfa  1a
>>> UN  XX.XX.XX.XX   23.8 GiB   256  48.7%
>>> bdc597c0-718c-4ef6-b3ef-7785110a9923  1b
>>>
>>> Though maybe part of what you are experiencing can be cleared up by
>>> repair/compaction/cleanup. Also, what are your outputs when you call out
>>> specific keyspaces? Do the numbers get more even?
>>>
>>> Cheers,
>>>
>>> On Mon, Jun 12, 2017 at 5:22 AM Akhil Mehra 
>>> wrote:
>>>
 auto_bootstrap is true by default. Ensure its set to true. On startup
 look at your logs for your auto_bootstrap value.  Look at the node
 configuration line in your log file.

 Akhil

 On Mon, Jun 12, 2017 at 6:18 PM, Junaid Nasir  wrote:

> No, I didn't set it (left it at default value)
>
> On Fri, Jun 9, 2017 at 3:18 AM, ZAIDI, ASAD A  wrote:
>
>> Did you make sure auto_bootstrap property is indeed set to [true]
>> when you added the node?
>>
>>
>>
>> *From:* Junaid Nasir [mailto:jna...@an10.io]
>> *Sent:* Monday, June 05, 2017 6:29 AM
>> *To:* Akhil Mehra 
>> *Cc:* Vladimir Yudovin ;
>> user@cassandra.apache.org
>> *Subject:* Re: Convert single node C* to cluster (rebalancing
>> problem)
>>
>>
>>
>> not evenly, i have setup a new cluster with subset of data (around
>> 5gb). using the configuration above I am getting these results
>>
>>
>>
>> Datacenter: datacenter1
>>
>> ===
>>
>> Status=Up/Down
>>
>> |/ State=Normal/Leaving/Joining/Moving
>>
>> --  Address  Load   Tokens   Owns (effective)  Host ID 
>> Rack
>>
>> UN  10.128.2.1   4.86 GiB   256  44.9% 
>> e4427611-c247-42ee-9404-371e177f5f17  rack1
>>
>> UN  

Re: Node replacement strategy with AWS EBS

2017-06-13 Thread John Hughes
In aws, I just grow the cluster 2x, then shrink away the old nodes via
decommission. Mind you I am not dealing with TBs of data, just hundreds of
gigs. Also, I have deployment automated with Cloud Formation and Priam.
YMMV.

On Tue, Jun 13, 2017 at 2:22 PM Cogumelos Maravilha <
cogumelosmaravi...@sapo.pt> wrote:

> Simplest way of all, if you are using RF>=2 simple terminate the old
> instance and create a new one.
> Cheers.
>
>
> On 13-06-2017 18:01, Rutvij Bhatt wrote:
>
> Nevermind, I misunderstood the first link. In this case, the replacement
> would just be leaving the listen_address as is (to
> InetAddress.getLocalHost()) and just start the new instance up as you
> pointed out in your original answer Hannu.
>
> Thanks.
>
> On Tue, Jun 13, 2017 at 12:35 PM Rutvij Bhatt  wrote:
>
>> Hannu/Nitan,
>>
>> Thanks for your help so far! From what you said in your first response, I
>> can get away with just attaching the EBS volume to Cassandra and starting
>> it with the old node's private IP as my listen_address because it will take
>> over the token assignment from the old node using the data files? With
>> regards to "Cassandra automatically realizes that have just effectively
>> changed IP address.", it says in the first link to change this manually to
>> the desired address - does this not apply in my case if I'm replacing the
>> old node?
>>
>> As for the plan I outlined earlier, is this more for DR scenarios where I
>> have lost a node due to hardware failure and I need to recover the data in
>> a safe manner by requesting a stream from the other replicas?  Am I
>> understanding this right?
>>
>>
>> On Tue, Jun 13, 2017 at 11:59 AM Hannu Kröger  wrote:
>>
>>> Hello,
>>>
>>> So the local information about tokens is stored in the system keyspace.
>>> Also the host id and all that.
>>>
>>> Also documented here:
>>>
>>> https://support.datastax.com/hc/en-us/articles/204289959-Changing-IP-addresses-in-DSE
>>>
>>> If for any reason that causes issues, you can also check this:
>>> https://issues.apache.org/jira/browse/CASSANDRA-8382
>>>
>>> If you copy all cassandra data, you are on the safe side. Good point in
>>> the links is that if you have IP addresses in topolgy or other files, then
>>> update those as well.
>>>
>>> Hannu
>>>
>>> On 13 June 2017 at 11:53:13, Nitan Kainth (ni...@bamlabs.com) wrote:
>>>
>>> Hannu,
>>>
>>> "Cassandra automatically realizes that have just effectively changed IP
>>> address” —> are you sure C* will take care of IP change as is? How will it
>>> know which token range to be assigned to this new IP address?
>>>
>>> On Jun 13, 2017, at 10:51 AM, Hannu Kröger  wrote:
>>>
>>> Cassandra automatically realizes that have just effectively changed IP
>>> address
>>>
>>>
>>>
>


Re: Node replacement strategy with AWS EBS

2017-06-13 Thread Cogumelos Maravilha
Simplest way of all, if you are using RF>=2 simple terminate the old
instance and create a new one.

Cheers.

On 13-06-2017 18:01, Rutvij Bhatt wrote:
> Nevermind, I misunderstood the first link. In this case, the
> replacement would just be leaving the listen_address as is (to
> InetAddress.getLocalHost()) and just start the new instance up as you
> pointed out in your original answer Hannu.
>
> Thanks.
>
> On Tue, Jun 13, 2017 at 12:35 PM Rutvij Bhatt  > wrote:
>
> Hannu/Nitan,
>
> Thanks for your help so far! From what you said in your first
> response, I can get away with just attaching the EBS volume to
> Cassandra and starting it with the old node's private IP as my
> listen_address because it will take over the token assignment from
> the old node using the data files? With regards to "Cassandra
> automatically realizes that have just effectively changed IP
> address.", it says in the first link to change this manually to
> the desired address - does this not apply in my case if I'm
> replacing the old node?
>
> As for the plan I outlined earlier, is this more for DR scenarios
> where I have lost a node due to hardware failure and I need to
> recover the data in a safe manner by requesting a stream from the
> other replicas?  Am I understanding this right?
>
>
> On Tue, Jun 13, 2017 at 11:59 AM Hannu Kröger  > wrote:
>
> Hello,
>
> So the local information about tokens is stored in the system
> keyspace. Also the host id and all that.
>
> Also documented here:
> 
> https://support.datastax.com/hc/en-us/articles/204289959-Changing-IP-addresses-in-DSE
>
> If for any reason that causes issues, you can also check this:
> https://issues.apache.org/jira/browse/CASSANDRA-8382
>
> If you copy all cassandra data, you are on the safe side. Good
> point in the links is that if you have IP addresses in topolgy
> or other files, then update those as well.
>
> Hannu
>
> On 13 June 2017 at 11:53:13, Nitan Kainth (ni...@bamlabs.com
> ) wrote:
>
>> Hannu, 
>>
>> "Cassandra automatically realizes that have just effectively
>> changed IP address” —> are you sure C* will take care of IP
>> change as is? How will it know which token range to be
>> assigned to this new IP address?
>>
>>> On Jun 13, 2017, at 10:51 AM, Hannu Kröger
>>> > wrote:
>>>
>>> Cassandra automatically realizes that have just effectively
>>> changed IP address
>>



Re: Node replacement strategy with AWS EBS

2017-06-13 Thread Rutvij Bhatt
Nevermind, I misunderstood the first link. In this case, the replacement
would just be leaving the listen_address as is (to
InetAddress.getLocalHost()) and just start the new instance up as you
pointed out in your original answer Hannu.

Thanks.

On Tue, Jun 13, 2017 at 12:35 PM Rutvij Bhatt  wrote:

> Hannu/Nitan,
>
> Thanks for your help so far! From what you said in your first response, I
> can get away with just attaching the EBS volume to Cassandra and starting
> it with the old node's private IP as my listen_address because it will take
> over the token assignment from the old node using the data files? With
> regards to "Cassandra automatically realizes that have just effectively
> changed IP address.", it says in the first link to change this manually to
> the desired address - does this not apply in my case if I'm replacing the
> old node?
>
> As for the plan I outlined earlier, is this more for DR scenarios where I
> have lost a node due to hardware failure and I need to recover the data in
> a safe manner by requesting a stream from the other replicas?  Am I
> understanding this right?
>
>
> On Tue, Jun 13, 2017 at 11:59 AM Hannu Kröger  wrote:
>
>> Hello,
>>
>> So the local information about tokens is stored in the system keyspace.
>> Also the host id and all that.
>>
>> Also documented here:
>>
>> https://support.datastax.com/hc/en-us/articles/204289959-Changing-IP-addresses-in-DSE
>>
>> If for any reason that causes issues, you can also check this:
>> https://issues.apache.org/jira/browse/CASSANDRA-8382
>>
>> If you copy all cassandra data, you are on the safe side. Good point in
>> the links is that if you have IP addresses in topolgy or other files, then
>> update those as well.
>>
>> Hannu
>>
>> On 13 June 2017 at 11:53:13, Nitan Kainth (ni...@bamlabs.com) wrote:
>>
>> Hannu,
>>
>> "Cassandra automatically realizes that have just effectively changed IP
>> address” —> are you sure C* will take care of IP change as is? How will it
>> know which token range to be assigned to this new IP address?
>>
>> On Jun 13, 2017, at 10:51 AM, Hannu Kröger  wrote:
>>
>> Cassandra automatically realizes that have just effectively changed IP
>> address
>>
>>
>>


Re: Node replacement strategy with AWS EBS

2017-06-13 Thread Rutvij Bhatt
Hannu/Nitan,

Thanks for your help so far! From what you said in your first response, I
can get away with just attaching the EBS volume to Cassandra and starting
it with the old node's private IP as my listen_address because it will take
over the token assignment from the old node using the data files? With
regards to "Cassandra automatically realizes that have just effectively
changed IP address.", it says in the first link to change this manually to
the desired address - does this not apply in my case if I'm replacing the
old node?

As for the plan I outlined earlier, is this more for DR scenarios where I
have lost a node due to hardware failure and I need to recover the data in
a safe manner by requesting a stream from the other replicas?  Am I
understanding this right?


On Tue, Jun 13, 2017 at 11:59 AM Hannu Kröger  wrote:

> Hello,
>
> So the local information about tokens is stored in the system keyspace.
> Also the host id and all that.
>
> Also documented here:
>
> https://support.datastax.com/hc/en-us/articles/204289959-Changing-IP-addresses-in-DSE
>
> If for any reason that causes issues, you can also check this:
> https://issues.apache.org/jira/browse/CASSANDRA-8382
>
> If you copy all cassandra data, you are on the safe side. Good point in
> the links is that if you have IP addresses in topolgy or other files, then
> update those as well.
>
> Hannu
>
> On 13 June 2017 at 11:53:13, Nitan Kainth (ni...@bamlabs.com) wrote:
>
> Hannu,
>
> "Cassandra automatically realizes that have just effectively changed IP
> address” —> are you sure C* will take care of IP change as is? How will it
> know which token range to be assigned to this new IP address?
>
> On Jun 13, 2017, at 10:51 AM, Hannu Kröger  wrote:
>
> Cassandra automatically realizes that have just effectively changed IP
> address
>
>
>


Re: Node replacement strategy with AWS EBS

2017-06-13 Thread Nitan Kainth
Thank you Hannu.


> On Jun 13, 2017, at 10:59 AM, Hannu Kröger  wrote:
> 
> Hello,
> 
> So the local information about tokens is stored in the system keyspace. Also 
> the host id and all that.
> 
> Also documented here:
> https://support.datastax.com/hc/en-us/articles/204289959-Changing-IP-addresses-in-DSE
>  
> 
> 
> If for any reason that causes issues, you can also check this:
> https://issues.apache.org/jira/browse/CASSANDRA-8382 
> 
> 
> If you copy all cassandra data, you are on the safe side. Good point in the 
> links is that if you have IP addresses in topolgy or other files, then update 
> those as well.
> 
> Hannu
> 
> On 13 June 2017 at 11:53:13, Nitan Kainth (ni...@bamlabs.com 
> ) wrote:
> 
>> Hannu, 
>> 
>> "Cassandra automatically realizes that have just effectively changed IP 
>> address” —> are you sure C* will take care of IP change as is? How will it 
>> know which token range to be assigned to this new IP address?
>> 
>>> On Jun 13, 2017, at 10:51 AM, Hannu Kröger >> > wrote:
>>> 
>>> Cassandra automatically realizes that have just effectively changed IP 
>>> address



Re: Node replacement strategy with AWS EBS

2017-06-13 Thread Hannu Kröger
Hello,

So the local information about tokens is stored in the system keyspace.
Also the host id and all that.

Also documented here:
https://support.datastax.com/hc/en-us/articles/204289959-Changing-IP-addresses-in-DSE

If for any reason that causes issues, you can also check this:
https://issues.apache.org/jira/browse/CASSANDRA-8382

If you copy all cassandra data, you are on the safe side. Good point in the
links is that if you have IP addresses in topolgy or other files, then
update those as well.

Hannu

On 13 June 2017 at 11:53:13, Nitan Kainth (ni...@bamlabs.com) wrote:

Hannu,

"Cassandra automatically realizes that have just effectively changed IP
address” —> are you sure C* will take care of IP change as is? How will it
know which token range to be assigned to this new IP address?

On Jun 13, 2017, at 10:51 AM, Hannu Kröger  wrote:

Cassandra automatically realizes that have just effectively changed IP
address


Re: Node replacement strategy with AWS EBS

2017-06-13 Thread Nitan Kainth
Hannu, 

"Cassandra automatically realizes that have just effectively changed IP 
address” —> are you sure C* will take care of IP change as is? How will it know 
which token range to be assigned to this new IP address?

> On Jun 13, 2017, at 10:51 AM, Hannu Kröger  wrote:
> 
> Cassandra automatically realizes that have just effectively changed IP address



Re: Node replacement strategy with AWS EBS

2017-06-13 Thread Hannu Kröger
Hello,

I think that’s not the optimal way to handle it.

If you are just attaching the same EBS volume to a new node you can do like
this:
1) nodetool drain on old
2) stop cassandra on old
3) Attach EBS to new node
4) Start Cassandra on new node

Cassandra automatically realizes that have just effectively changed IP
address.

replace_address will also stream all the data, so that’s inefficient way to
do it if you already have all the data.

Hannu

On 13 June 2017 at 11:23:56, Rutvij Bhatt (rut...@sense.com) wrote:

Hi!

We're running a Cassandra cluster on AWS. I want to replace an old node
with EBS storage with a new one. The steps I'm following are as follows and
I want to get a second opinion on whether this is the right thing to do:

1. Remove old node from gossip.
2. Run nodetool drain
3. Stop cassandra
4. Create new new node and update JVM_OPTS in cassandra-env.sh with
cassandra.replace_address= as instructed
here -
http://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsReplaceNode.html
5. Attach the EBS volume from the old node at the same mount point.
6. Start cassandra on the new node.
7. Run nodetool repair to catch the replacing node up on whatever it has
missed.

Thanks!


Re: Node replacement strategy with AWS EBS

2017-06-13 Thread Rutvij Bhatt
Nitan,

Yes, that is what I've done. I snapshotted the volume after step 3 and will
create a new volume from that snapshot and attach it to the new instance.
Curious if I am indeed replacing a node completely, is there any logical
difference between snapshot->create->attach vs detach from old->attach to
new besides a margin of safety?

Thanks for your reply!

On Tue, Jun 13, 2017 at 11:37 AM Nitan Kainth  wrote:

> Steps are good Rutvij. Step 1 is not mandatory.
>
> We snapshot EBS volume and then restored on new node. How are you
> re-attaching EBS volume without snapshot?
>
>
> I
>
> On Jun 13, 2017, at 10:21 AM, Rutvij Bhatt  wrote:
>
> Hi!
>
> We're running a Cassandra cluster on AWS. I want to replace an old node
> with EBS storage with a new one. The steps I'm following are as follows and
> I want to get a second opinion on whether this is the right thing to do:
>
> 1. Remove old node from gossip.
> 2. Run nodetool drain
> 3. Stop cassandra
> 4. Create new new node and update JVM_OPTS in cassandra-env.sh with
> cassandra.replace_address= as instructed
> here -
> http://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsReplaceNode.html
> 5. Attach the EBS volume from the old node at the same mount point.
> 6. Start cassandra on the new node.
> 7. Run nodetool repair to catch the replacing node up on whatever it has
> missed.
>
> Thanks!
>
>
>


Re: Node replacement strategy with AWS EBS

2017-06-13 Thread Nitan Kainth
Steps are good Rutvij. Step 1 is not mandatory. 

We snapshot EBS volume and then restored on new node. How are you re-attaching 
EBS volume without snapshot?


I
> On Jun 13, 2017, at 10:21 AM, Rutvij Bhatt  wrote:
> 
> Hi!
> 
> We're running a Cassandra cluster on AWS. I want to replace an old node with 
> EBS storage with a new one. The steps I'm following are as follows and I want 
> to get a second opinion on whether this is the right thing to do:
> 
> 1. Remove old node from gossip.
> 2. Run nodetool drain
> 3. Stop cassandra
> 4. Create new new node and update JVM_OPTS in cassandra-env.sh with 
> cassandra.replace_address= as instructed here 
> - 
> http://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsReplaceNode.html
>  
> 
> 5. Attach the EBS volume from the old node at the same mount point.
> 6. Start cassandra on the new node.
> 7. Run nodetool repair to catch the replacing node up on whatever it has 
> missed.
> 
> Thanks!



Node replacement strategy with AWS EBS

2017-06-13 Thread Rutvij Bhatt
Hi!

We're running a Cassandra cluster on AWS. I want to replace an old node
with EBS storage with a new one. The steps I'm following are as follows and
I want to get a second opinion on whether this is the right thing to do:

1. Remove old node from gossip.
2. Run nodetool drain
3. Stop cassandra
4. Create new new node and update JVM_OPTS in cassandra-env.sh with
cassandra.replace_address= as instructed
here -
http://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsReplaceNode.html
5. Attach the EBS volume from the old node at the same mount point.
6. Start cassandra on the new node.
7. Run nodetool repair to catch the replacing node up on whatever it has
missed.

Thanks!


Re: Convert single node C* to cluster (rebalancing problem)

2017-06-13 Thread Junaid Nasir
>
> Is the OP expecting a perfect 50%/50% split?


best result I got was 240gb/30gb split, which I think is not properly
balanced.


> Also, what are your outputs when you call out specific keyspaces? Do the
> numbers get more even?


i don't know what you mean by *call out specific key spaces?* can you
please explain that a bit.


If your schema is not modelled correctly you can easily end up unevenly
> distributed data.


I think that is the problem. initial 270gb data might not by modeled
correctly. I have run a lot of tests on 270gb data including downsizing it
to 5gb, they all resulted in same uneven distribution. I also tested a
dummy dataset of 2gb which was balanced evenly. coming from rdb, I didn't
give much thought to data modeling. can anyone please point me to some
resources regarding this problem.

On Tue, Jun 13, 2017 at 3:24 AM, Akhil Mehra  wrote:

> Great point John.
>
> The OP should also note that data distribution also depends on your schema
> and incoming data profile.
>
> If your schema is not modelled correctly you can easily end up unevenly
> distributed data.
>
> Cheers,
> Akhil
>
> On Tue, Jun 13, 2017 at 3:36 AM, John Hughes 
> wrote:
>
>> Is the OP expecting a perfect 50%/50% split? That, to my experience, is
>> not going to happen, it is almost always shifted from a fraction of a
>> percent to a couple percent.
>>
>> Datacenter: eu-west
>> ===
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  AddressLoad   Tokens   Owns (effective)  Host ID
>>   Rack
>> UN  XX.XX.XX.XX22.71 GiB  256  47.6%
>> 57dafdde-2f62-467c-a8ff-c91e712f89c9  1c
>> UN  XX.XX.XX.XX  17.17 GiB  256  51.3%
>> d2a65c51-087d-48de-ae1f-a41142eb148d  1b
>> UN  XX.XX.XX.XX  26.15 GiB  256  52.4%
>> acf5dd34-5b81-4e5b-b7be-85a7fccd8e1c  1c
>> UN  XX.XX.XX.XX   16.64 GiB  256  50.2%
>> 6c8842dd-a966-467c-a7bc-bd6269ce3e7e  1a
>> UN  XX.XX.XX.XX  24.39 GiB  256  49.8%
>> fd92525d-edf2-4974-8bc5-a350a8831dfa  1a
>> UN  XX.XX.XX.XX   23.8 GiB   256  48.7%
>> bdc597c0-718c-4ef6-b3ef-7785110a9923  1b
>>
>> Though maybe part of what you are experiencing can be cleared up by
>> repair/compaction/cleanup. Also, what are your outputs when you call out
>> specific keyspaces? Do the numbers get more even?
>>
>> Cheers,
>>
>> On Mon, Jun 12, 2017 at 5:22 AM Akhil Mehra  wrote:
>>
>>> auto_bootstrap is true by default. Ensure its set to true. On startup
>>> look at your logs for your auto_bootstrap value.  Look at the node
>>> configuration line in your log file.
>>>
>>> Akhil
>>>
>>> On Mon, Jun 12, 2017 at 6:18 PM, Junaid Nasir  wrote:
>>>
 No, I didn't set it (left it at default value)

 On Fri, Jun 9, 2017 at 3:18 AM, ZAIDI, ASAD A  wrote:

> Did you make sure auto_bootstrap property is indeed set to [true]
> when you added the node?
>
>
>
> *From:* Junaid Nasir [mailto:jna...@an10.io]
> *Sent:* Monday, June 05, 2017 6:29 AM
> *To:* Akhil Mehra 
> *Cc:* Vladimir Yudovin ;
> user@cassandra.apache.org
> *Subject:* Re: Convert single node C* to cluster (rebalancing problem)
>
>
>
> not evenly, i have setup a new cluster with subset of data (around
> 5gb). using the configuration above I am getting these results
>
>
>
> Datacenter: datacenter1
>
> ===
>
> Status=Up/Down
>
> |/ State=Normal/Leaving/Joining/Moving
>
> --  Address  Load   Tokens   Owns (effective)  Host ID 
> Rack
>
> UN  10.128.2.1   4.86 GiB   256  44.9% 
> e4427611-c247-42ee-9404-371e177f5f17  rack1
>
> UN  10.128.2.10  725.03 MiB  256 55.1% 
> 690d5620-99d3-4ae3-aebe-8f33af54a08b  rack1
>
> is there anything else I can tweak/check to make the distribution even?
>
>
>
> On Sat, Jun 3, 2017 at 3:30 AM, Akhil Mehra 
> wrote:
>
> So now the data is evenly balanced in both nodes?
>
>
>
> Refer to the following documentation to get a better understanding of
> the roc_address and the broadcast_rpc_address https://
> www.instaclustr.com/demystifying-cassandras-broadcast_address/
> .
> I am surprised that your node started up with rpc_broadcast_address
> set as this is an unsupported property. I am assuming you are using
> Cassandra version 3.10.
>
>
>
>
>
> Regards,
>
> Akhil
>
>