Re: High system CPU during high write workload

2016-11-15 Thread Bhuvan Rawal
Hi Ben,

Thanks for your reply, we tested the same workload on kernel
version 4.6.4-1.el7.elrepo.x86_64 and found the issue to be not present
there.

This had resulted in really high CPU in write workloads -> area in which
cassandra excels. Degrading performance by atleast 5x! I suggest this
mention could be included in cassandra community wiki as it could impact a
large audience.

Thanks & Regards,
Bhuvan

On Tue, Nov 15, 2016 at 12:33 PM, Ben Bromhead  wrote:

> Hi Abhishek
>
> The article with the futex bug description lists the solution, which is to
> upgrade to a version of RHEL or CentOS that have the specified patch.
>
> What help do you specifically need? If you need help upgrading the OS I
> would look at the documentation for RHEL or CentOS.
>
> Ben
>
> On Mon, 14 Nov 2016 at 22:48 Abhishek Gupta 
> wrote:
>
> Hi,
>
> We are seeing an issue where the system CPU is shooting off to a figure or
> > 90% when the cluster is subjected to a relatively high write workload i.e
> 4k wreq/secs.
>
> 2016-11-14T13:27:47.900+0530 Process summary
>   process cpu=695.61%
>   application cpu=676.11% (*user=200.63% sys=475.49%) **<== Very High
> System CPU *
>   other: cpu=19.49%
>   heap allocation rate *403mb*/s
> [000533] user= 1.43% sys= 6.91% alloc= 2216kb/s - SharedPool-Worker-129
> [000274] user= 0.38% sys= 7.78% alloc= 2415kb/s - SharedPool-Worker-34
> [000292] user= 1.24% sys= 6.77% alloc= 2196kb/s - SharedPool-Worker-56
> [000487] user= 1.24% sys= 6.69% alloc= 2260kb/s - SharedPool-Worker-79
> [000488] user= 1.24% sys= 6.56% alloc= 2064kb/s - SharedPool-Worker-78
> [000258] user= 1.05% sys= 6.66% alloc= 2250kb/s - SharedPool-Worker-41
>
> On doing strace it was found that the following system call is consuming
> all the system CPU
>  timeout 10s strace -f -p 5954 -c -q
> % time seconds  usecs/call callserrors syscall
> -- --- --- - - 
>
> *88.33 1712.798399   16674102723 22191 futex* 3.98
> 77.0987304356 17700   read
>  3.27   63.474795  394253   16129 restart_syscall
>  3.23   62.601530   29768  2103   epoll_wait
>
> On searching we found the following bug with the RHEL 6.6, CentOS 6.6
> kernel seems to be a probable cause for the issue:
>
> https://docs.datastax.com/en/landing_page/doc/landing_page/
> troubleshooting/cassandra/fetuxWaitBug.html
>
> The patch fix mentioned in the doc is also not present in our kernel.
>
> sudo rpm -q --changelog kernel-`uname -r` | grep futex | grep ref
> - [kernel] futex_lock_pi() key refcnt fix (Danny Feng) [566347]
> {CVE-2010-0623}
>
> Can some who has faced and resolved this issue help us here.
>
> Thanks,
> Abhishek
>
>
> --
> Ben Bromhead
> CTO | Instaclustr 
> +1 650 284 9692
> Managed Cassandra / Spark on AWS, Azure and Softlayer
>


Re: Schema Changes

2016-11-15 Thread Matija Gobec
We used cassandra migration tool for schema versioning and schema
agreement. Check it out here
.

Short:
When executing schema altering statements use these to wait for schema
propagation
resultSet.getExecutionInfo().isSchemaInAgreement()
and
session.getCluster().getMetadata().checkSchemaAgreement()

For detailed info check driver documentation. This solution is based on this
fix .

Matija

On Tue, Nov 15, 2016 at 7:32 PM, Edward Capriolo 
wrote:

> You can start here:
>
> https://issues.apache.org/jira/browse/CASSANDRA-10699
>
> And here:
>
> http://stackoverflow.com/questions/20293897/cassandra-
> resolution-of-concurrent-schema-changes
>
> In a nutshell, schema changes works best when issued serially, when all
> nodes are up, and reachable. When these 3 conditions are not met a variety
> of behavior can be observed.
>
> On Tue, Nov 15, 2016 at 1:04 PM, Josh Smith 
> wrote:
>
>> Would someone please explain how schema changes happen?
>>
>> Here are some of the ring details
>>
>> We have 5 nodes in 1 DC and 5 nodes in another DC across the country.
>>
>> Here is our problem, we have a tool which automates our schema creation.
>> Our schema consists of 7 keyspaces with 21 tables in each keyspace, so a
>> total of 147 tables are created at the initial provisioning.  During this
>> schema creation we end up with system_schema keyspace corruption, we have
>> found that it is due to schema version disagreement. To combat this we
>> setup a wait until there is only one version in both system.local and
>> system.peers tables.
>>
>> The way I understand it schema changes are made on the local node only;
>> changes are then propagated through either Thrift or Gossip, I could not
>> find a definitive answer online if thrift or gossip was the carrier. So if
>> I make all of the schema changes to one node it should propagate the
>> changes to the other nodes one at a time. This is how I used to think that
>> schema changes are propagated but we still get schema disagreement when
>> changing the schema only on one node. Is the only option to introduce a
>> wait after every table creation?  Should we be looking at another table
>> besides system.local and peers? Any help would be appreciated.
>>
>>
>>
>> Josh Smith
>>
>
>


Re: Schema Changes

2016-11-15 Thread Edward Capriolo
You can start here:

https://issues.apache.org/jira/browse/CASSANDRA-10699

And here:

http://stackoverflow.com/questions/20293897/cassandra-resolution-of-concurrent-schema-changes

In a nutshell, schema changes works best when issued serially, when all
nodes are up, and reachable. When these 3 conditions are not met a variety
of behavior can be observed.

On Tue, Nov 15, 2016 at 1:04 PM, Josh Smith 
wrote:

> Would someone please explain how schema changes happen?
>
> Here are some of the ring details
>
> We have 5 nodes in 1 DC and 5 nodes in another DC across the country.
>
> Here is our problem, we have a tool which automates our schema creation.
> Our schema consists of 7 keyspaces with 21 tables in each keyspace, so a
> total of 147 tables are created at the initial provisioning.  During this
> schema creation we end up with system_schema keyspace corruption, we have
> found that it is due to schema version disagreement. To combat this we
> setup a wait until there is only one version in both system.local and
> system.peers tables.
>
> The way I understand it schema changes are made on the local node only;
> changes are then propagated through either Thrift or Gossip, I could not
> find a definitive answer online if thrift or gossip was the carrier. So if
> I make all of the schema changes to one node it should propagate the
> changes to the other nodes one at a time. This is how I used to think that
> schema changes are propagated but we still get schema disagreement when
> changing the schema only on one node. Is the only option to introduce a
> wait after every table creation?  Should we be looking at another table
> besides system.local and peers? Any help would be appreciated.
>
>
>
> Josh Smith
>


Schema Changes

2016-11-15 Thread Josh Smith
Would someone please explain how schema changes happen?
Here are some of the ring details
We have 5 nodes in 1 DC and 5 nodes in another DC across the country.
Here is our problem, we have a tool which automates our schema creation. Our 
schema consists of 7 keyspaces with 21 tables in each keyspace, so a total of 
147 tables are created at the initial provisioning.  During this schema 
creation we end up with system_schema keyspace corruption, we have found that 
it is due to schema version disagreement. To combat this we setup a wait until 
there is only one version in both system.local and system.peers tables.
The way I understand it schema changes are made on the local node only; changes 
are then propagated through either Thrift or Gossip, I could not find a 
definitive answer online if thrift or gossip was the carrier. So if I make all 
of the schema changes to one node it should propagate the changes to the other 
nodes one at a time. This is how I used to think that schema changes are 
propagated but we still get schema disagreement when changing the schema only 
on one node. Is the only option to introduce a wait after every table creation? 
 Should we be looking at another table besides system.local and peers? Any help 
would be appreciated.

Josh Smith


Re: Some questions to updating and tombstone

2016-11-15 Thread Fabrice Facorat
If you don't want tombstones, don't generate them ;)

More seriously, tombstones are generated when:
- doing a DELETE
- TTL expiration
- set a column to NULL

However tombstones are an issue only if for the same value, you have many
tombstones (i.e you keep overwriting the same values with datas and
tombstones). Having 1 tombstone for 1 value is not an issue, having 1000
tombstone for 1 value is a problem. Do really your use case overwrite data
with DELETE or  NULL ?

So that's why what you may want to know is how many tombstones you have on
average when reading a value. This is available in:
- nodetool cfstats ks.cf : Average tombstones per slice/Maximum tombstones
per slice
- JMX :
org.apache.cassandra.metrics:keyspace=,name=TombstoneScannedHistogram,scope=,type=ColumnFamily
Max/Count/99thPercentile/Mean


2016-11-15 10:05 GMT+01:00 Lu, Boying :

> Thanks a lot for your help.
>
>
>
> We are using STCS strategy and not using TTL
>
>
>
> Is there any API that we can use to query the current number of tombstones
> in a CF?
>
>
>
>
>
>
>
> *From:* Anuj Wadehra [mailto:anujw_2...@yahoo.co.in]
> *Sent:* 2016年11月14日 22:20
> *To:* user@cassandra.apache.org
> *Subject:* Re: Some questions to updating and tombstone
>
>
>
> Hi Boying,
>
>
>
> I agree with Vladimir.If compaction is not compacting the two sstables
> with updates soon, disk space issues will be wasted. For example, if the
> updates are not closer in time, first update might be in a big table by the
> time second update is being written in a new small table. STCS wont compact
> them together soon.
>
>
>
> Just adding column values with new timestamp shouldnt create any
> tombstones. But if data is not merged for long, disk space issues may
> arise. If you are STCS,just  yo get an idea about the extent of the problem
> you can run major compaction and see the amount of disk space created with
> that( dont do this in production as major compaction has its own side
> effects).
>
>
>
> Which compaction strategy are you using?
>
> Are these updates done with TTL?
>
>
>
> Thanks
> Anuj
>
>
>
> On Mon, 14 Nov, 2016 at 1:54 PM, Vladimir Yudovin
>
>  wrote:
>
> Hi Boying,
>
>
>
> UPDATE write new value with new time stamp. Old value is not tombstone,
> but remains until compaction. gc_grace_period is not related to this.
>
>
>
> Best regards, Vladimir Yudovin,
>
>
> *Winguzone  - Hosted Cloud Cassandra
> Launch your cluster in minutes.*
>
>
>
>
>
>  On Mon, 14 Nov 2016 03:02:21 -0500*Lu, Boying  >* wrote 
>
>
>
> Hi, All,
>
>
>
> Will the Cassandra generates a new tombstone when updating a column by
> using CQL update statement?
>
>
>
> And is there any way to get the number of tombstones of a column family
> since we want to void generating
>
> too many tombstones within gc_grace_period?
>
>
>
> Thanks
>
>
>
> Boying
>
>
>
>


-- 
Close the World, Open the Net
http://www.linux-wizard.net


RE: Some questions to updating and tombstone

2016-11-15 Thread Lu, Boying
Thanks a lot for your help.

We are using STCS strategy and not using TTL

Is there any API that we can use to query the current number of tombstones in a 
CF?



From: Anuj Wadehra [mailto:anujw_2...@yahoo.co.in]
Sent: 2016年11月14日 22:20
To: user@cassandra.apache.org
Subject: Re: Some questions to updating and tombstone

Hi Boying,

I agree with Vladimir.If compaction is not compacting the two sstables with 
updates soon, disk space issues will be wasted. For example, if the updates are 
not closer in time, first update might be in a big table by the time second 
update is being written in a new small table. STCS wont compact them together 
soon.

Just adding column values with new timestamp shouldnt create any tombstones. 
But if data is not merged for long, disk space issues may arise. If you are 
STCS,just  yo get an idea about the extent of the problem you can run major 
compaction and see the amount of disk space created with that( dont do this in 
production as major compaction has its own side effects).

Which compaction strategy are you using?
Are these updates done with TTL?

Thanks
Anuj

On Mon, 14 Nov, 2016 at 1:54 PM, Vladimir Yudovin
> wrote:
Hi Boying,

UPDATE write new value with new time stamp. Old value is not tombstone, but 
remains until compaction. gc_grace_period is not related to this.

Best regards, Vladimir Yudovin,
Winguzone - Hosted Cloud Cassandra
Launch your cluster in minutes.


 On Mon, 14 Nov 2016 03:02:21 -0500Lu, Boying 
> wrote 

Hi, All,

Will the Cassandra generates a new tombstone when updating a column by using 
CQL update statement?

And is there any way to get the number of tombstones of a column family since 
we want to void generating
too many tombstones within gc_grace_period?

Thanks

Boying