Re: High read latency

2015-09-26 Thread Eric Stevens
Since you have most of your reads hitting 5-8 SSTables, it's probably
related to that increasing your latency.  That makes this look like your
write workload is either overwrite-heavy or append-heavy.  Data for a
single partition key is being written to repeatedly over long time periods,
and this will definitely impact read performance.

You can enable tracing in cqlsh and run your select to see where the time
is going.

On Fri, Sep 25, 2015 at 3:07 PM, Jaydeep Chovatia <
chovatia.jayd...@gmail.com> wrote:

> Please find histogram attached.
>
> On Fri, Sep 25, 2015 at 12:20 PM, Ryan Svihla  wrote:
>
>> if everything is in ram there could be a number of issues unrelated to
>> Cassandra and there could be hardware limitations or contention problems.
>> Otherwise cell count can really deeply impact reads, all ram or not, and
>> some of this is because of the nature of GC and some of it is the age of
>> the sstable format (which is due to be revamped in 3.0). Also partition
>> size can matter just because of physics, if one of those is a 1gb
>> partition, the network interface can only move that back across the wire so
>> quickly not to mention the GC issues you’d run into.
>>
>> Anyway this is why I asked for the histograms, I wanted to get cell count
>> and partition size. I’ve seen otherwise very stout hardware get slow on
>> reads of large results because either a bottleneck was hit somewhere, or
>> the CPU got slammed with GC, or other processes running on the machine were
>> contending with Cassandra.
>>
>>
>> On Sep 25, 2015, at 12:45 PM, Jaydeep Chovatia <
>> chovatia.jayd...@gmail.com> wrote:
>>
>> I understand that but everything is in RAM (my data dir is tmpfs) and my
>> row is not that wide approx. less than 5MB in size. So my question is if
>> everything is in RAM then why does it take 43ms latency?
>>
>> On Fri, Sep 25, 2015 at 7:54 AM, Ryan Svihla  wrote:
>>
>>> if you run:
>>>
>>> nodetool cfhistograms  
>>>
>>> On the given table and that will tell you how wide your rows are
>>> getting. At some point you can get wide enough rows that just the physics
>>> of retrieving them all take some time.
>>>
>>>
>>> On Sep 25, 2015, at 9:21 AM, sai krishnam raju potturi <
>>> pskraj...@gmail.com> wrote:
>>>
>>> Jaydeep; since your primary key involves a clustering column, you may be
>>> having pretty wide rows. The read would be sequential. The latency could be
>>> acceptable, if the read were to involve really wide rows.
>>>
>>> If your primary key was like ((a,b)) without the clustering column, it's
>>> like reading a key value pair, and 40ms latency may have been a concern.
>>>
>>> Bottom Line : The latency depends on how wide the row is.
>>>
>>> On Tue, Sep 22, 2015 at 1:27 PM, sai krishnam raju potturi <
>>> pskraj...@gmail.com> wrote:
>>>
 thanks for the information. Posting the query too would be of help.

 On Tue, Sep 22, 2015 at 11:56 AM, Jaydeep Chovatia <
 chovatia.jayd...@gmail.com> wrote:

> Please find required details here:
>
> -  Number of req/s
>
> 2k reads/s
>
> -  Schema details
>
> create table test {
>
> a timeuuid,
>
> b bigint,
>
> c int,
>
> d int static,
>
> e int static,
>
> f int static,
>
> g int static,
>
> h int,
>
> i text,
>
> j text,
>
> k text,
>
> l text,
>
> m set
>
> n bigint
>
> o bigint
>
> p bigint
>
> q bigint
>
> r int
>
> s text
>
> t bigint
>
> u text
>
> v text
>
> w text
>
> x bigint
>
> y bigint
>
> z bigint,
>
> primary key ((a, b), c)
>
> };
>
> -  JVM settings about the heap
>
> Default settings
>
> -  Execution time of the GC
>
> Avg. 400ms. I do not see long pauses of GC anywhere in the log file.
>
> On Tue, Sep 22, 2015 at 5:34 AM, Leleu Eric 
> wrote:
>
>> Hi,
>>
>>
>>
>>
>>
>> Before speaking about tuning, can you provide some additional
>> information ?
>>
>>
>>
>> -  Number of req/s
>>
>> -  Schema details
>>
>> -  JVM settings about the heap
>>
>> -  Execution time of the GC
>>
>>
>>
>> 43ms for a read latency may be acceptable according to the number of
>> request per second.
>>
>>
>>
>>
>>
>> Eric
>>
>>
>>
>> *De :* Jaydeep Chovatia [mailto:chovatia.jayd...@gmail.com]
>> *Envoyé :* mardi 22 septembre 2015 00:07
>> *À :* user@cassandra.apache.org
>> *Objet :* High read latency
>>
>>
>>
>> Hi,
>>
>>
>>
>> My application issues more read requests than write, I do see that
>> under load cfstats for one of the table is quite 

Re: High read latency

2015-09-26 Thread Laing, Michael
Maybe compaction not keeping up - since you are hitting so many sstables?

Read heavy... are you using LCS?

Plenty of resources... tune to increase memtable size?

On Sat, Sep 26, 2015 at 9:19 AM, Eric Stevens  wrote:

> Since you have most of your reads hitting 5-8 SSTables, it's probably
> related to that increasing your latency.  That makes this look like your
> write workload is either overwrite-heavy or append-heavy.  Data for a
> single partition key is being written to repeatedly over long time periods,
> and this will definitely impact read performance.
>
> You can enable tracing in cqlsh and run your select to see where the time
> is going.
>
> On Fri, Sep 25, 2015 at 3:07 PM, Jaydeep Chovatia <
> chovatia.jayd...@gmail.com> wrote:
>
>> Please find histogram attached.
>>
>> On Fri, Sep 25, 2015 at 12:20 PM, Ryan Svihla  wrote:
>>
>>> if everything is in ram there could be a number of issues unrelated to
>>> Cassandra and there could be hardware limitations or contention problems.
>>> Otherwise cell count can really deeply impact reads, all ram or not, and
>>> some of this is because of the nature of GC and some of it is the age of
>>> the sstable format (which is due to be revamped in 3.0). Also partition
>>> size can matter just because of physics, if one of those is a 1gb
>>> partition, the network interface can only move that back across the wire so
>>> quickly not to mention the GC issues you’d run into.
>>>
>>> Anyway this is why I asked for the histograms, I wanted to get cell
>>> count and partition size. I’ve seen otherwise very stout hardware get slow
>>> on reads of large results because either a bottleneck was hit somewhere, or
>>> the CPU got slammed with GC, or other processes running on the machine were
>>> contending with Cassandra.
>>>
>>>
>>> On Sep 25, 2015, at 12:45 PM, Jaydeep Chovatia <
>>> chovatia.jayd...@gmail.com> wrote:
>>>
>>> I understand that but everything is in RAM (my data dir is tmpfs) and my
>>> row is not that wide approx. less than 5MB in size. So my question is if
>>> everything is in RAM then why does it take 43ms latency?
>>>
>>> On Fri, Sep 25, 2015 at 7:54 AM, Ryan Svihla  wrote:
>>>
 if you run:

 nodetool cfhistograms  

 On the given table and that will tell you how wide your rows are
 getting. At some point you can get wide enough rows that just the physics
 of retrieving them all take some time.


 On Sep 25, 2015, at 9:21 AM, sai krishnam raju potturi <
 pskraj...@gmail.com> wrote:

 Jaydeep; since your primary key involves a clustering column, you may
 be having pretty wide rows. The read would be sequential. The latency could
 be acceptable, if the read were to involve really wide rows.

 If your primary key was like ((a,b)) without the clustering column,
 it's like reading a key value pair, and 40ms latency may have been a
 concern.

 Bottom Line : The latency depends on how wide the row is.

 On Tue, Sep 22, 2015 at 1:27 PM, sai krishnam raju potturi <
 pskraj...@gmail.com> wrote:

> thanks for the information. Posting the query too would be of help.
>
> On Tue, Sep 22, 2015 at 11:56 AM, Jaydeep Chovatia <
> chovatia.jayd...@gmail.com> wrote:
>
>> Please find required details here:
>>
>> -  Number of req/s
>>
>> 2k reads/s
>>
>> -  Schema details
>>
>> create table test {
>>
>> a timeuuid,
>>
>> b bigint,
>>
>> c int,
>>
>> d int static,
>>
>> e int static,
>>
>> f int static,
>>
>> g int static,
>>
>> h int,
>>
>> i text,
>>
>> j text,
>>
>> k text,
>>
>> l text,
>>
>> m set
>>
>> n bigint
>>
>> o bigint
>>
>> p bigint
>>
>> q bigint
>>
>> r int
>>
>> s text
>>
>> t bigint
>>
>> u text
>>
>> v text
>>
>> w text
>>
>> x bigint
>>
>> y bigint
>>
>> z bigint,
>>
>> primary key ((a, b), c)
>>
>> };
>>
>> -  JVM settings about the heap
>>
>> Default settings
>>
>> -  Execution time of the GC
>>
>> Avg. 400ms. I do not see long pauses of GC anywhere in the log file.
>>
>> On Tue, Sep 22, 2015 at 5:34 AM, Leleu Eric > > wrote:
>>
>>> Hi,
>>>
>>>
>>>
>>>
>>>
>>> Before speaking about tuning, can you provide some additional
>>> information ?
>>>
>>>
>>>
>>> -  Number of req/s
>>>
>>> -  Schema details
>>>
>>> -  JVM settings about the heap
>>>
>>> -  Execution time of the GC
>>>
>>>
>>>
>>> 43ms for a read latency may be acceptable according to the number of
>>> request per second.
>>>
>>>

compaction became super slow after interrupted repair

2015-09-26 Thread Michał Łowicki
Hi,

Running C* 2.1.8 cluster in two data centers with 6 nodes each. I've
started running repair sequentially on each node (`nodetool repair
--parallel --in-local-dc`).

While running repair number of SSTables grows radically as well as pending
compaction tasks. It's fine as node usually recovers within couple of hours
after finishing repair (
https://www.dropbox.com/s/xzcndf5596mq7rm/Screenshot%202015-09-26%2016.17.44.png?dl=0).
One experiment showed that increasing compaction throughput and number of
compactors mitigates this problem.

Unfortunately one node didn't recovered... (
https://www.dropbox.com/s/nphnsaf2rbfm0bq/Screenshot%202015-09-26%2016.20.56.png?dl=0).
I needed to interrupt repair as node was running out of disk space. I hoped
that within couple of hours node will catch up with compaction but it
didn't happen even after 5 days.

I've tried to increase throughput, disable throttling, increasing number of
compactors, disabling binary / thrift / gossip, increasing heap size,
restarting but still compaction is super slow.

Tried today to run scrub:

root@db2:~# nodetool scrub sync

Aborted scrubbing atleast one column family in keyspace sync, check server
logs for more information.

error: nodetool failed, check server logs

-- StackTrace --

java.lang.RuntimeException: nodetool failed, check server logs

at
org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:290)

at org.apache.cassandra.tools.NodeTool.main(NodeTool.java:202)

as well as cleanup:

root@db2:~# nodetool cleanup

Aborted cleaning up atleast one column family in keyspace sync, check
server logs for more information.

error: nodetool failed, check server logs

-- StackTrace --

java.lang.RuntimeException: nodetool failed, check server logs

at
org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:290)

at org.apache.cassandra.tools.NodeTool.main(NodeTool.java:202)

Couldn't find anything in logs regarding these runtime exceptions (see log
here - https://www.dropbox.com/s/flmii7fgpyp07q2/db2.lati.system.log?dl=0).

Note that I'm experiencing CASSANDRA-9935 while running repair on each node
from the cluster.

Any help will be much appreciated.

-- 
BR,
Michał Łowicki


Using inline JSON is 2-3x faster than using many columns (>20)

2015-09-26 Thread Kevin Burton
I wanted to share this with the community in the hopes that it might help
someone with their schema design.

I didn't get any red flags early on to limit the number of columns we use.
If anything the community pushes for dynamic schema because Cassandra has
super nice online ALTER TABLE.

However, in practice we've found that Cassandra started to use a LOT more
CPU than anything else in our stack.

Including Elasticsearch.  ES uses about 8% of our total CPU whereas
Cassandra uses about 70% of it.. It's not an apples to oranges comparison
mind you but Cassandra definitely warrants some attention in this scenario.

I put Cassandra into a profiler (Java Mission Control) to see if anything
weird was happening and didn't see any red flags.

There were some issues with CAS so I rewrote that to implement a query
before CAS operation where we first check if the row is already there, then
use a CAS if its missing. That was a BIG performance bump.  Probably
reduced our C* usages by 40%

However, I started to speculate that it might be getting overwhelmed with
the raw numbers of rows.

I fired up cassandra_stress to verify and basically split it at 10 columns
with 150 bytes and then 150 columns with 10 bytes.

In this synthetic benchmark C* was actually 5-6x faster for the run with 10
columns.

So this tentatively confirmed my hypothesis.

So I decided to get a bit more aggressive and tried to test it with a less
synthetic benchmark.

I wrote my own benchmark which uses our own schema in two forms.

INLINE_ONLY: 150 columns...
DATA_ONLY: 4 columns (two primary key, 1 data_format and one data_blob)
column

It creates T threads, writes W rows, then reads R rows..

I set T=50, W=50,000, R=50,000

It does a write pass, then a read pass.  I didn't implement a mixed
workload though.. I think that my results wouldn't matter as much.

The results were similarly impressive but not as much as the synthetic
benchmark above.  It was 2x faster (6 minutes vs 3 minutes).

In the inline only benchmark, C* spends 70% of the time in high CPU.  In
data_only it's about 50/50.

I think we're going to move to this model and re-write all our C* stables
to support this inline JSON.

The second benchmark was under 2.0.16... (our production version).  The
cassandra_stress was under 3.0 beta as I wanted to see if a later version
of cassandra fixed the problem. It doesn't.

This was done on a 128GB box with two Samsung SSDs in RAID0.  I didn't test
it with any replicas.

This brings up some interesting issues:

- still interesting that C* spends as much time as it does under high CPU
load.  I'd like to profile it again.

- Looks like there's room for improvement in the JSON encoder/decoder.  I'm
not sure how much we would see though because it's already using the latest
jackson which I've tuned significantly.  I might be able to get some
performance out of it by avoiding GC and garbage collection.

- Later C* might improve our CPU regardless so this might be something we
do anyway (upgrade our cassandra).



-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile