Re: Ring connection timeouts with 2.2.6

2016-07-23 Thread Mike Heffner
Garo,

No, we didn't notice any change in system load, just the expected spike in
packet counts.

Mike

On Wed, Jul 20, 2016 at 3:49 PM, Juho Mäkinen <juho.maki...@gmail.com>
wrote:

> Just to pick this up: Did you see any system load spikes? I'm tracing a
> problem on 2.2.7 where my cluster sees load spikes up to 20-30, when the
> normal average load is around 3-4. So far I haven't found any good reason,
> but I'm going to try otc_coalescing_strategy: disabled tomorrow.
>
>  - Garo
>
> On Fri, Jul 15, 2016 at 6:16 PM, Mike Heffner <m...@librato.com> wrote:
>
>> Just to followup on this post with a couple of more data points:
>>
>> 1)
>>
>> We upgraded to 2.2.7 and did not see any change in behavior.
>>
>> 2)
>>
>> However, what *has* fixed this issue for us was disabling msg coalescing
>> by setting:
>>
>> otc_coalescing_strategy: DISABLED
>>
>> We were using the default setting before (time horizon I believe).
>>
>> We see periodic timeouts on the ring (once every few hours), but they are
>> brief and don't impact latency. With msg coalescing turned on we would see
>> these timeouts persist consistently after an initial spike. My guess is
>> that something in the coalescing logic is disturbed by the initial timeout
>> spike which leads to dropping all / high-percentage of all subsequent
>> traffic.
>>
>> We are planning to continue production use with msg coaleasing disabled
>> for now and may run tests in our staging environments to identify where the
>> coalescing is breaking this.
>>
>> Mike
>>
>> On Tue, Jul 5, 2016 at 12:14 PM, Mike Heffner <m...@librato.com> wrote:
>>
>>> Jeff,
>>>
>>> Thanks, yeah we updated to the 2.16.4 driver version from source. I
>>> don't believe we've hit the bugs mentioned in earlier driver versions.
>>>
>>> Mike
>>>
>>> On Mon, Jul 4, 2016 at 11:16 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com>
>>> wrote:
>>>
>>>> AWS ubuntu 14.04 AMI ships with buggy enhanced networking driver –
>>>> depending on your instance types / hypervisor choice, you may want to
>>>> ensure you’re not seeing that bug.
>>>>
>>>>
>>>>
>>>> *From: *Mike Heffner <m...@librato.com>
>>>> *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
>>>> *Date: *Friday, July 1, 2016 at 1:10 PM
>>>> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
>>>> *Cc: *Peter Norton <p...@librato.com>
>>>> *Subject: *Re: Ring connection timeouts with 2.2.6
>>>>
>>>>
>>>>
>>>> Jens,
>>>>
>>>>
>>>>
>>>> We haven't noticed any particular large GC operations or even
>>>> persistently high GC times.
>>>>
>>>>
>>>>
>>>> Mike
>>>>
>>>>
>>>>
>>>> On Thu, Jun 30, 2016 at 3:20 AM, Jens Rantil <jens.ran...@tink.se>
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> Could it be garbage collection occurring on nodes that are more heavily
>>>> loaded?
>>>>
>>>> Cheers,
>>>> Jens
>>>>
>>>>
>>>>
>>>> Den sön 26 juni 2016 05:22Mike Heffner <m...@librato.com> skrev:
>>>>
>>>> One thing to add, if we do a rolling restart of the ring the timeouts
>>>> disappear entirely for several hours and performance returns to normal.
>>>> It's as if something is leaking over time, but we haven't seen any
>>>> noticeable change in heap.
>>>>
>>>>
>>>>
>>>> On Thu, Jun 23, 2016 at 10:38 AM, Mike Heffner <m...@librato.com>
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that
>>>> is sitting at <25% CPU, doing mostly writes, and not showing any particular
>>>> long GC times/pauses. By all observed metrics the ring is healthy and
>>>> performing well.
>>>>
>>>>
>>>>
>>>> However, we are noticing a pretty consistent number of connection
>>>> timeouts coming from the messaging service between various pairs of nodes
>>>> in the ring. The "Connection.TotalTimeouts" meter metric show 100k's of

Re: Ring connection timeouts with 2.2.6

2016-07-15 Thread Mike Heffner
Just to followup on this post with a couple of more data points:

1)

We upgraded to 2.2.7 and did not see any change in behavior.

2)

However, what *has* fixed this issue for us was disabling msg coalescing by
setting:

otc_coalescing_strategy: DISABLED

We were using the default setting before (time horizon I believe).

We see periodic timeouts on the ring (once every few hours), but they are
brief and don't impact latency. With msg coalescing turned on we would see
these timeouts persist consistently after an initial spike. My guess is
that something in the coalescing logic is disturbed by the initial timeout
spike which leads to dropping all / high-percentage of all subsequent
traffic.

We are planning to continue production use with msg coaleasing disabled for
now and may run tests in our staging environments to identify where the
coalescing is breaking this.

Mike

On Tue, Jul 5, 2016 at 12:14 PM, Mike Heffner <m...@librato.com> wrote:

> Jeff,
>
> Thanks, yeah we updated to the 2.16.4 driver version from source. I don't
> believe we've hit the bugs mentioned in earlier driver versions.
>
> Mike
>
> On Mon, Jul 4, 2016 at 11:16 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com>
> wrote:
>
>> AWS ubuntu 14.04 AMI ships with buggy enhanced networking driver –
>> depending on your instance types / hypervisor choice, you may want to
>> ensure you’re not seeing that bug.
>>
>>
>>
>> *From: *Mike Heffner <m...@librato.com>
>> *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
>> *Date: *Friday, July 1, 2016 at 1:10 PM
>> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
>> *Cc: *Peter Norton <p...@librato.com>
>> *Subject: *Re: Ring connection timeouts with 2.2.6
>>
>>
>>
>> Jens,
>>
>>
>>
>> We haven't noticed any particular large GC operations or even
>> persistently high GC times.
>>
>>
>>
>> Mike
>>
>>
>>
>> On Thu, Jun 30, 2016 at 3:20 AM, Jens Rantil <jens.ran...@tink.se> wrote:
>>
>> Hi,
>>
>> Could it be garbage collection occurring on nodes that are more heavily
>> loaded?
>>
>> Cheers,
>> Jens
>>
>>
>>
>> Den sön 26 juni 2016 05:22Mike Heffner <m...@librato.com> skrev:
>>
>> One thing to add, if we do a rolling restart of the ring the timeouts
>> disappear entirely for several hours and performance returns to normal.
>> It's as if something is leaking over time, but we haven't seen any
>> noticeable change in heap.
>>
>>
>>
>> On Thu, Jun 23, 2016 at 10:38 AM, Mike Heffner <m...@librato.com> wrote:
>>
>> Hi,
>>
>>
>>
>> We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that is
>> sitting at <25% CPU, doing mostly writes, and not showing any particular
>> long GC times/pauses. By all observed metrics the ring is healthy and
>> performing well.
>>
>>
>>
>> However, we are noticing a pretty consistent number of connection
>> timeouts coming from the messaging service between various pairs of nodes
>> in the ring. The "Connection.TotalTimeouts" meter metric show 100k's of
>> timeouts per minute, usually between two pairs of nodes for several hours
>> at a time. It seems to occur for several hours at a time, then may stop or
>> move to other pairs of nodes in the ring. The metric
>> "Connection.SmallMessageDroppedTasks." will also grow for one pair of
>> the nodes in the TotalTimeouts metric.
>>
>>
>>
>> Looking at the debug log typically shows a large number of messages like
>> the following on one of the nodes:
>>
>>
>>
>> StorageProxy.java:1033 - Skipped writing hint for /172.26.33.177
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__172.26.33.177=CwMFaQ=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow=KlMh_-rpcOH2Mdf3i2XGCQhtU4ZuD0Y37WpHKGlKtnQ=ihxNa3DwQPrfqEURi_UIncjESJC_XexR_AjY81coG8U=>
>> (ttl 0)
>>
>> We have cross node timeouts enabled, but ntp is running on all nodes and
>> no node appears to have time drift.
>>
>>
>>
>> The network appears to be fine between nodes, with iperf tests showing
>> that we have a lot of headroom.
>>
>>
>>
>> Any thoughts on what to look for? Can we increase thread count/pool sizes
>> for the messaging service?
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Mike
>>
>>
>>
>> --
>>
>>
>>   Mike Heffner <m...@librato.com>
>>
>>   Librato, Inc.
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>>
>>   Mike Heffner <m...@librato.com>
>>
>>   Librato, Inc.
>>
>>
>>
>> --
>>
>> Jens Rantil
>> Backend Developer @ Tink
>>
>> Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden
>> For urgent matters you can reach me at +46-708-84 18 32.
>>
>>
>>
>>
>>
>> --
>>
>>
>>   Mike Heffner <m...@librato.com>
>>
>>   Librato, Inc.
>>
>>
>>
>
>
>
> --
>
>   Mike Heffner <m...@librato.com>
>   Librato, Inc.
>
>


-- 

  Mike Heffner <m...@librato.com>
  Librato, Inc.


Re: Ring connection timeouts with 2.2.6

2016-06-25 Thread Mike Heffner
One thing to add, if we do a rolling restart of the ring the timeouts
disappear entirely for several hours and performance returns to normal.
It's as if something is leaking over time, but we haven't seen any
noticeable change in heap.

On Thu, Jun 23, 2016 at 10:38 AM, Mike Heffner <m...@librato.com> wrote:

> Hi,
>
> We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that is
> sitting at <25% CPU, doing mostly writes, and not showing any particular
> long GC times/pauses. By all observed metrics the ring is healthy and
> performing well.
>
> However, we are noticing a pretty consistent number of connection timeouts
> coming from the messaging service between various pairs of nodes in the
> ring. The "Connection.TotalTimeouts" meter metric show 100k's of timeouts
> per minute, usually between two pairs of nodes for several hours at a time.
> It seems to occur for several hours at a time, then may stop or move to
> other pairs of nodes in the ring. The metric
> "Connection.SmallMessageDroppedTasks." will also grow for one pair of
> the nodes in the TotalTimeouts metric.
>
> Looking at the debug log typically shows a large number of messages like
> the following on one of the nodes:
>
> StorageProxy.java:1033 - Skipped writing hint for /172.26.33.177 (ttl 0)
>
> We have cross node timeouts enabled, but ntp is running on all nodes and
> no node appears to have time drift.
>
> The network appears to be fine between nodes, with iperf tests showing
> that we have a lot of headroom.
>
> Any thoughts on what to look for? Can we increase thread count/pool sizes
> for the messaging service?
>
> Thanks,
>
> Mike
>
> --
>
>   Mike Heffner <m...@librato.com>
>   Librato, Inc.
>
>


-- 

  Mike Heffner <m...@librato.com>
  Librato, Inc.


Ring connection timeouts with 2.2.6

2016-06-23 Thread Mike Heffner
Hi,

We have a 12 node 2.2.6 ring running in AWS, single DC with RF=3, that is
sitting at <25% CPU, doing mostly writes, and not showing any particular
long GC times/pauses. By all observed metrics the ring is healthy and
performing well.

However, we are noticing a pretty consistent number of connection timeouts
coming from the messaging service between various pairs of nodes in the
ring. The "Connection.TotalTimeouts" meter metric show 100k's of timeouts
per minute, usually between two pairs of nodes for several hours at a time.
It seems to occur for several hours at a time, then may stop or move to
other pairs of nodes in the ring. The metric
"Connection.SmallMessageDroppedTasks." will also grow for one pair of
the nodes in the TotalTimeouts metric.

Looking at the debug log typically shows a large number of messages like
the following on one of the nodes:

StorageProxy.java:1033 - Skipped writing hint for /172.26.33.177 (ttl 0)

We have cross node timeouts enabled, but ntp is running on all nodes and no
node appears to have time drift.

The network appears to be fine between nodes, with iperf tests showing that
we have a lot of headroom.

Any thoughts on what to look for? Can we increase thread count/pool sizes
for the messaging service?

Thanks,

Mike

-- 

  Mike Heffner <m...@librato.com>
  Librato, Inc.


Re: Consistent read timeouts for bursts of reads

2016-03-04 Thread Mike Heffner
Emils,

We believe we've tracked it down to the following issue:
https://issues.apache.org/jira/browse/CASSANDRA-11302, introduced in 2.1.5.

We are running a build of 2.2.5 with that patch and so far have not seen
any more timeouts.

Mike

On Fri, Mar 4, 2016 at 3:14 AM, Emīls Šolmanis <emils.solma...@gmail.com>
wrote:

> Mike,
>
> Is that where you've bisected it to having been introduced?
>
> I'll see what I can do, but doubt it, since we've long since upgraded prod
> to 2.2.4 (and stage before that) and the tests I'm running were for a new
> feature.
>
>
> On Fri, 4 Mar 2016 03:54 Mike Heffner, <m...@librato.com> wrote:
>
>> Emils,
>>
>> I realize this may be a big downgrade, but are you timeouts reproducible
>> under Cassandra 2.1.4?
>>
>> Mike
>>
>> On Thu, Feb 25, 2016 at 10:34 AM, Emīls Šolmanis <
>> emils.solma...@gmail.com> wrote:
>>
>>> Having had a read through the archives, I missed this at first, but this
>>> seems to be *exactly* like what we're experiencing.
>>>
>>> http://www.mail-archive.com/user@cassandra.apache.org/msg46064.html
>>>
>>> Only difference is we're getting this for reads and using CQL, but the
>>> behaviour is identical.
>>>
>>> On Thu, 25 Feb 2016 at 14:55 Emīls Šolmanis <emils.solma...@gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> We're having a problem with concurrent requests. It seems that whenever
>>>> we try resolving more
>>>> than ~ 15 queries at the same time, one or two get a read timeout and
>>>> then succeed on a retry.
>>>>
>>>> We're running Cassandra 2.2.4 accessed via the 2.1.9 Datastax driver on
>>>> AWS.
>>>>
>>>> What we've found while investigating:
>>>>
>>>>  * this is not db-wide. Trying the same pattern against another table
>>>> everything works fine.
>>>>  * it fails 1 or 2 requests regardless of how many are executed in
>>>> parallel, i.e., it's still 1 or 2 when we ramp it up to ~ 120 concurrent
>>>> requests and doesn't seem to scale up.
>>>>  * the problem is consistently reproducible. It happens both under
>>>> heavier load and when just firing off a single batch of requests for
>>>> testing.
>>>>  * tracing the faulty requests says everything is great. An example
>>>> trace: https://gist.github.com/emilssolmanis/41e1e2ecdfd9a0569b1a
>>>>  * the only peculiar thing in the logs is there's no acknowledgement of
>>>> the request being accepted by the server, as seen in
>>>> https://gist.github.com/emilssolmanis/242d9d02a6d8fb91da8a
>>>>  * there's nothing funny in the timed out Cassandra node's logs around
>>>> that time as far as I can tell, not even in the debug logs.
>>>>
>>>> Any ideas about what might be causing this, pointers to server config
>>>> options, or how else we might debug this would be much appreciated.
>>>>
>>>> Kind regards,
>>>> Emils
>>>>
>>>>
>>
>>
>> --
>>
>>   Mike Heffner <m...@librato.com>
>>   Librato, Inc.
>>
>>


-- 

  Mike Heffner <m...@librato.com>
  Librato, Inc.


Re: Consistent read timeouts for bursts of reads

2016-03-03 Thread Mike Heffner
Emils,

I realize this may be a big downgrade, but are you timeouts reproducible
under Cassandra 2.1.4?

Mike

On Thu, Feb 25, 2016 at 10:34 AM, Emīls Šolmanis <emils.solma...@gmail.com>
wrote:

> Having had a read through the archives, I missed this at first, but this
> seems to be *exactly* like what we're experiencing.
>
> http://www.mail-archive.com/user@cassandra.apache.org/msg46064.html
>
> Only difference is we're getting this for reads and using CQL, but the
> behaviour is identical.
>
> On Thu, 25 Feb 2016 at 14:55 Emīls Šolmanis <emils.solma...@gmail.com>
> wrote:
>
>> Hello,
>>
>> We're having a problem with concurrent requests. It seems that whenever
>> we try resolving more
>> than ~ 15 queries at the same time, one or two get a read timeout and
>> then succeed on a retry.
>>
>> We're running Cassandra 2.2.4 accessed via the 2.1.9 Datastax driver on
>> AWS.
>>
>> What we've found while investigating:
>>
>>  * this is not db-wide. Trying the same pattern against another table
>> everything works fine.
>>  * it fails 1 or 2 requests regardless of how many are executed in
>> parallel, i.e., it's still 1 or 2 when we ramp it up to ~ 120 concurrent
>> requests and doesn't seem to scale up.
>>  * the problem is consistently reproducible. It happens both under
>> heavier load and when just firing off a single batch of requests for
>> testing.
>>  * tracing the faulty requests says everything is great. An example
>> trace: https://gist.github.com/emilssolmanis/41e1e2ecdfd9a0569b1a
>>  * the only peculiar thing in the logs is there's no acknowledgement of
>> the request being accepted by the server, as seen in
>> https://gist.github.com/emilssolmanis/242d9d02a6d8fb91da8a
>>  * there's nothing funny in the timed out Cassandra node's logs around
>> that time as far as I can tell, not even in the debug logs.
>>
>> Any ideas about what might be causing this, pointers to server config
>> options, or how else we might debug this would be much appreciated.
>>
>> Kind regards,
>> Emils
>>
>>


-- 

  Mike Heffner <m...@librato.com>
  Librato, Inc.


Re: Debugging write timeouts on Cassandra 2.2.5

2016-02-24 Thread Mike Heffner
Nate,

So we have run several install tests, bisecting the 2.1.x release line, and
we believe that the regression was introduced in version 2.1.5. This is the
first release that clearly hits the timeout for us.

It looks like quite a large release, so our next step will likely be
bisecting the major commits to see if we can narrow it down:
https://github.com/apache/cassandra/blob/3c0a337ebc90b0d99349d0aa152c92b5b3494d8c/CHANGES.txt.
Obviously, any suggestions on potential suspects appreciated.

These are the memtable settings we've configured diff from the defaults
during our testing:

memtable_allocation_type: offheap_objects
memtable_flush_writers: 8


Cheers,

Mike

On Fri, Feb 19, 2016 at 1:46 PM, Nate McCall <n...@thelastpickle.com> wrote:

> The biggest change which *might* explain your behavior has to do with the
> changes in memtable flushing between 2.0 and 2.1:
> https://issues.apache.org/jira/browse/CASSANDRA-5549
>
> However, the tpstats you posted shows no dropped mutations which would
> make me more certain of this as the cause.
>
> What values do you have right now for each of these (my recommendations
> for each on a c4.2xl with stock cassandra-env.sh are in parenthesis):
>
> - memtable_flush_writers (2)
> - memtable_heap_space_in_mb  (2048)
> - memtable_offheap_space_in_mb (2048)
> - memtable_cleanup_threshold (0.11)
> - memtable_allocation_type (offheap_objects)
>
> The biggest win IMO will be moving to offheap_objects. By default,
> everything is on heap. Regardless, spending some time tuning these for your
> workload will pay off.
>
> You may also want to be explicit about
>
> - native_transport_max_concurrent_connections
> - native_transport_max_concurrent_connections_per_ip
>
> Depending on the driver, these may now be allowing 32k streams per
> connection(!) as detailed in v3 of the native protocol:
>
> https://github.com/apache/cassandra/blob/cassandra-2.1/doc/native_protocol_v3.spec#L130-L152
>
>
>
> On Fri, Feb 19, 2016 at 8:48 AM, Mike Heffner <m...@librato.com> wrote:
>
>> Anuj,
>>
>> So we originally started testing with Java8 + G1, however we were able to
>> reproduce the same results with the default CMS settings that ship in the
>> cassandra-env.sh from the Deb pkg. We didn't detect any large GC pauses
>> during the runs.
>>
>> Query pattern during our testing was 100% writes, batching (via Thrift
>> mostly) to 5 tables, between 6-1500 rows per batch.
>>
>> Mike
>>
>> On Thu, Feb 18, 2016 at 12:22 PM, Anuj Wadehra <anujw_2...@yahoo.co.in>
>> wrote:
>>
>>> Whats the GC overhead? Can you your share your GC collector and settings
>>> ?
>>>
>>>
>>> Whats your query pattern? Do you use secondary indexes, batches, in
>>> clause etc?
>>>
>>>
>>> Anuj
>>>
>>>
>>> Sent from Yahoo Mail on Android
>>> <https://overview.mail.yahoo.com/mobile/?.src=Android>
>>>
>>> On Thu, 18 Feb, 2016 at 8:45 pm, Mike Heffner
>>> <m...@librato.com> wrote:
>>> Alain,
>>>
>>> Thanks for the suggestions.
>>>
>>> Sure, tpstats are here:
>>> https://gist.github.com/mheffner/a979ae1a0304480b052a. Looking at the
>>> metrics across the ring, there were no blocked tasks nor dropped messages.
>>>
>>> Iowait metrics look fine, so it doesn't appear to be blocking on disk.
>>> Similarly, there are no long GC pauses.
>>>
>>> We haven't noticed latency on any particular table higher than others or
>>> correlated around the occurrence of a timeout. We have noticed with further
>>> testing that running cassandra-stress against the ring, while our workload
>>> is writing to the same ring, will incur similar 10 second timeouts. If our
>>> workload is not writing to the ring, cassandra stress will run without
>>> hitting timeouts. This seems to imply that our workload pattern is causing
>>> something to block cluster-wide, since the stress tool writes to a
>>> different keyspace then our workload.
>>>
>>> I mentioned in another reply that we've tracked it to something between
>>> 2.0.x and 2.1.x, so we are focusing on narrowing which point release it was
>>> introduced in.
>>>
>>> Cheers,
>>>
>>> Mike
>>>
>>> On Thu, Feb 18, 2016 at 3:33 AM, Alain RODRIGUEZ <arodr...@gmail.com>
>>> wrote:
>>>
>>>> Hi Mike,
>>>>
>>>> What about the output of tpstats ? I imagine you have dropped messages
>>>> there. Any blocked threads ? Could you past

Re: Debugging write timeouts on Cassandra 2.2.5

2016-02-19 Thread Mike Heffner
Anuj,

So we originally started testing with Java8 + G1, however we were able to
reproduce the same results with the default CMS settings that ship in the
cassandra-env.sh from the Deb pkg. We didn't detect any large GC pauses
during the runs.

Query pattern during our testing was 100% writes, batching (via Thrift
mostly) to 5 tables, between 6-1500 rows per batch.

Mike

On Thu, Feb 18, 2016 at 12:22 PM, Anuj Wadehra <anujw_2...@yahoo.co.in>
wrote:

> Whats the GC overhead? Can you your share your GC collector and settings ?
>
>
> Whats your query pattern? Do you use secondary indexes, batches, in clause
> etc?
>
>
> Anuj
>
>
> Sent from Yahoo Mail on Android
> <https://overview.mail.yahoo.com/mobile/?.src=Android>
>
> On Thu, 18 Feb, 2016 at 8:45 pm, Mike Heffner
> <m...@librato.com> wrote:
> Alain,
>
> Thanks for the suggestions.
>
> Sure, tpstats are here:
> https://gist.github.com/mheffner/a979ae1a0304480b052a. Looking at the
> metrics across the ring, there were no blocked tasks nor dropped messages.
>
> Iowait metrics look fine, so it doesn't appear to be blocking on disk.
> Similarly, there are no long GC pauses.
>
> We haven't noticed latency on any particular table higher than others or
> correlated around the occurrence of a timeout. We have noticed with further
> testing that running cassandra-stress against the ring, while our workload
> is writing to the same ring, will incur similar 10 second timeouts. If our
> workload is not writing to the ring, cassandra stress will run without
> hitting timeouts. This seems to imply that our workload pattern is causing
> something to block cluster-wide, since the stress tool writes to a
> different keyspace then our workload.
>
> I mentioned in another reply that we've tracked it to something between
> 2.0.x and 2.1.x, so we are focusing on narrowing which point release it was
> introduced in.
>
> Cheers,
>
> Mike
>
> On Thu, Feb 18, 2016 at 3:33 AM, Alain RODRIGUEZ <arodr...@gmail.com>
> wrote:
>
>> Hi Mike,
>>
>> What about the output of tpstats ? I imagine you have dropped messages
>> there. Any blocked threads ? Could you paste this output here ?
>>
>> May this be due to some network hiccup to access the disks as they are
>> EBS ? Can you think of anyway of checking this ? Do you have a lot of GC
>> logs, how long are the pauses (use something like: grep -i 'GCInspector'
>> /var/log/cassandra/system.log) ?
>>
>> Something else you could check are local_writes stats to see if only one
>> table if affected or this is keyspace / cluster wide. You can use metrics
>> exposed by cassandra or if you have no dashboards I believe a: 'nodetool
>> cfstats  | grep -e 'Table:' -e 'Local'' should give you a rough idea
>> of local latencies.
>>
>> Those are just things I would check, I have not a clue on what is
>> happening here, hope this will help.
>>
>> C*heers,
>> -
>> Alain Rodriguez
>> France
>>
>> The Last Pickle
>> http://www.thelastpickle.com
>>
>> 2016-02-18 5:13 GMT+01:00 Mike Heffner <m...@librato.com>:
>>
>>> Jaydeep,
>>>
>>> No, we don't use any light weight transactions.
>>>
>>> Mike
>>>
>>> On Wed, Feb 17, 2016 at 6:44 PM, Jaydeep Chovatia <
>>> chovatia.jayd...@gmail.com> wrote:
>>>
>>>> Are you guys using light weight transactions in your write path?
>>>>
>>>> On Thu, Feb 11, 2016 at 12:36 AM, Fabrice Facorat <
>>>> fabrice.faco...@gmail.com> wrote:
>>>>
>>>>> Are your commitlog and data on the same disk ? If yes, you should put
>>>>> commitlogs on a separate disk which don't have a lot of IO.
>>>>>
>>>>> Others IO may have great impact impact on your commitlog writing and
>>>>> it may even block.
>>>>>
>>>>> An example of impact IO may have, even for Async writes:
>>>>>
>>>>> https://engineering.linkedin.com/blog/2016/02/eliminating-large-jvm-gc-pauses-caused-by-background-io-traffic
>>>>>
>>>>> 2016-02-11 0:31 GMT+01:00 Mike Heffner <m...@librato.com>:
>>>>> > Jeff,
>>>>> >
>>>>> > We have both commitlog and data on a 4TB EBS with 10k IOPS.
>>>>> >
>>>>> > Mike
>>>>> >
>>>>> > On Wed, Feb 10, 2016 at 5:28 PM, Jeff Jirsa <
>>>>> jeff.ji...@crowdstrike.com>
>>>>> > wrote:
>>>>> >

Re: Debugging write timeouts on Cassandra 2.2.5

2016-02-18 Thread Mike Heffner
Alain,

Thanks for the suggestions.

Sure, tpstats are here:
https://gist.github.com/mheffner/a979ae1a0304480b052a. Looking at the
metrics across the ring, there were no blocked tasks nor dropped messages.

Iowait metrics look fine, so it doesn't appear to be blocking on disk.
Similarly, there are no long GC pauses.

We haven't noticed latency on any particular table higher than others or
correlated around the occurrence of a timeout. We have noticed with further
testing that running cassandra-stress against the ring, while our workload
is writing to the same ring, will incur similar 10 second timeouts. If our
workload is not writing to the ring, cassandra stress will run without
hitting timeouts. This seems to imply that our workload pattern is causing
something to block cluster-wide, since the stress tool writes to a
different keyspace then our workload.

I mentioned in another reply that we've tracked it to something between
2.0.x and 2.1.x, so we are focusing on narrowing which point release it was
introduced in.

Cheers,

Mike

On Thu, Feb 18, 2016 at 3:33 AM, Alain RODRIGUEZ <arodr...@gmail.com> wrote:

> Hi Mike,
>
> What about the output of tpstats ? I imagine you have dropped messages
> there. Any blocked threads ? Could you paste this output here ?
>
> May this be due to some network hiccup to access the disks as they are EBS
> ? Can you think of anyway of checking this ? Do you have a lot of GC logs,
> how long are the pauses (use something like: grep -i 'GCInspector'
> /var/log/cassandra/system.log) ?
>
> Something else you could check are local_writes stats to see if only one
> table if affected or this is keyspace / cluster wide. You can use metrics
> exposed by cassandra or if you have no dashboards I believe a: 'nodetool
> cfstats  | grep -e 'Table:' -e 'Local'' should give you a rough idea
> of local latencies.
>
> Those are just things I would check, I have not a clue on what is
> happening here, hope this will help.
>
> C*heers,
> -
> Alain Rodriguez
> France
>
> The Last Pickle
> http://www.thelastpickle.com
>
> 2016-02-18 5:13 GMT+01:00 Mike Heffner <m...@librato.com>:
>
>> Jaydeep,
>>
>> No, we don't use any light weight transactions.
>>
>> Mike
>>
>> On Wed, Feb 17, 2016 at 6:44 PM, Jaydeep Chovatia <
>> chovatia.jayd...@gmail.com> wrote:
>>
>>> Are you guys using light weight transactions in your write path?
>>>
>>> On Thu, Feb 11, 2016 at 12:36 AM, Fabrice Facorat <
>>> fabrice.faco...@gmail.com> wrote:
>>>
>>>> Are your commitlog and data on the same disk ? If yes, you should put
>>>> commitlogs on a separate disk which don't have a lot of IO.
>>>>
>>>> Others IO may have great impact impact on your commitlog writing and
>>>> it may even block.
>>>>
>>>> An example of impact IO may have, even for Async writes:
>>>>
>>>> https://engineering.linkedin.com/blog/2016/02/eliminating-large-jvm-gc-pauses-caused-by-background-io-traffic
>>>>
>>>> 2016-02-11 0:31 GMT+01:00 Mike Heffner <m...@librato.com>:
>>>> > Jeff,
>>>> >
>>>> > We have both commitlog and data on a 4TB EBS with 10k IOPS.
>>>> >
>>>> > Mike
>>>> >
>>>> > On Wed, Feb 10, 2016 at 5:28 PM, Jeff Jirsa <
>>>> jeff.ji...@crowdstrike.com>
>>>> > wrote:
>>>> >>
>>>> >> What disk size are you using?
>>>> >>
>>>> >>
>>>> >>
>>>> >> From: Mike Heffner
>>>> >> Reply-To: "user@cassandra.apache.org"
>>>> >> Date: Wednesday, February 10, 2016 at 2:24 PM
>>>> >> To: "user@cassandra.apache.org"
>>>> >> Cc: Peter Norton
>>>> >> Subject: Re: Debugging write timeouts on Cassandra 2.2.5
>>>> >>
>>>> >> Paulo,
>>>> >>
>>>> >> Thanks for the suggestion, we ran some tests against CMS and saw the
>>>> same
>>>> >> timeouts. On that note though, we are going to try doubling the
>>>> instance
>>>> >> sizes and testing with double the heap (even though current usage is
>>>> low).
>>>> >>
>>>> >> Mike
>>>> >>
>>>> >> On Wed, Feb 10, 2016 at 3:40 PM, Paulo Motta <
>>>> pauloricard...@gmail.com>
>>>> >> wrote:
>>>> >>>
>>>> >>> Are you using the same GC se

Re: Debugging write timeouts on Cassandra 2.2.5

2016-02-18 Thread Mike Heffner
Following up from our earlier post...

We have continued to do exhaustive testing and measuring of the numerous
hardware and configuration variables here. What we have uncovered is that
on identical hardware (including the configuration we run in production),
something between versions 2.0.17 and 2.1.13 introduced this write timeout
for our workload. We still aren't any closer to identifying the what or
why, but it is easily reproduced using our workload when we bump to the
2.1.x release line.

At the moment we are going to focus on hardening this new hardware
configuration using the 2.0.17 release and roll it out internally to some
of our production rings. We also want to bisect the 2.1.x release line to
find if there was a particular point release that introduced the timeout.
If anyone has suggestions for particular changes to look out for we'd be
happy to focus a test on that earlier.

Thanks,

Mike

On Wed, Feb 10, 2016 at 2:51 PM, Mike Heffner <m...@librato.com> wrote:

> Hi all,
>
> We've recently embarked on a project to update our Cassandra
> infrastructure running on EC2. We are long time users of 2.0.x and are
> testing out a move to version 2.2.5 running on VPC with EBS. Our test setup
> is a 3 node, RF=3 cluster supporting a small write load (mirror of our
> staging load).
>
> We are writing at QUORUM and while p95's look good compared to our staging
> 2.0.x cluster, we are seeing frequent write operations that time out at the
> max write_request_timeout_in_ms (10 seconds). CPU across the cluster is <
> 10% and EBS write load is < 100 IOPS. Cassandra is running with the Oracle
> JDK 8u60 and we're using G1GC and any GC pauses are less than 500ms.
>
> We run on c4.2xl instances with GP2 EBS attached storage for data and
> commitlog directories. The nodes are using EC2 enhanced networking and have
> the latest Intel network driver module. We are running on HVM instances
> using Ubuntu 14.04.2.
>
> Our schema is 5 tables, all with COMPACT STORAGE. Each table is similar to
> the definition here: https://gist.github.com/mheffner/4d80f6b53ccaa24cc20a
>
> This is our cassandra.yaml:
> https://gist.github.com/mheffner/fea80e6e939dd483f94f#file-cassandra-yaml
>
> Like I mentioned we use 8u60 with G1GC and have used many of the GC
> settings in Al Tobey's tuning guide. This is our upstart config with JVM
> and other CPU settings:
> https://gist.github.com/mheffner/dc44613620b25c4fa46d
>
> We've used several of the sysctl settings from Al's guide as well:
> https://gist.github.com/mheffner/ea40d58f58a517028152
>
> Our client application is able to write using either Thrift batches using
> Asytanax driver or CQL async INSERT's using the Datastax Java driver.
>
> For testing against Thrift (our legacy infra uses this) we write batches
> of anywhere from 6 to 1500 rows at a time. Our p99 for batch execution is
> around 45ms but our maximum (p100) sits less than 150ms except when it
> periodically spikes to the full 10seconds.
>
> Testing the same write path using CQL writes instead demonstrates similar
> behavior. Low p99s except for periodic full timeouts. We enabled tracing
> for several operations but were unable to get a trace that completed
> successfully -- Cassandra started logging many messages as:
>
> INFO  [ScheduledTasks:1] - MessagingService.java:946 - _TRACE messages
> were dropped in last 5000 ms: 52499 for internal timeout and 0 for cross
> node timeout
>
> And all the traces contained rows with a "null" source_elapsed row:
> https://gist.githubusercontent.com/mheffner/1d68a70449bd6688a010/raw/0327d7d3d94c3a93af02b64212e3b7e7d8f2911b/trace.out
>
>
> We've exhausted as many configuration option permutations that we can
> think of. This cluster does not appear to be under any significant load and
> latencies seem to largely fall in two bands: low normal or max timeout.
> This seems to imply that something is getting stuck and timing out at the
> max write timeout.
>
> Any suggestions on what to look for? We had debug enabled for awhile but
> we didn't see any msg that pointed to something obvious. Happy to provide
> any more information that may help.
>
> We are pretty much at the point of sprinkling debug around the code to
> track down what could be blocking.
>
>
> Thanks,
>
> Mike
>
> --
>
>   Mike Heffner <m...@librato.com>
>   Librato, Inc.
>
>


-- 

  Mike Heffner <m...@librato.com>
  Librato, Inc.


Re: Debugging write timeouts on Cassandra 2.2.5

2016-02-17 Thread Mike Heffner
Jaydeep,

No, we don't use any light weight transactions.

Mike

On Wed, Feb 17, 2016 at 6:44 PM, Jaydeep Chovatia <
chovatia.jayd...@gmail.com> wrote:

> Are you guys using light weight transactions in your write path?
>
> On Thu, Feb 11, 2016 at 12:36 AM, Fabrice Facorat <
> fabrice.faco...@gmail.com> wrote:
>
>> Are your commitlog and data on the same disk ? If yes, you should put
>> commitlogs on a separate disk which don't have a lot of IO.
>>
>> Others IO may have great impact impact on your commitlog writing and
>> it may even block.
>>
>> An example of impact IO may have, even for Async writes:
>>
>> https://engineering.linkedin.com/blog/2016/02/eliminating-large-jvm-gc-pauses-caused-by-background-io-traffic
>>
>> 2016-02-11 0:31 GMT+01:00 Mike Heffner <m...@librato.com>:
>> > Jeff,
>> >
>> > We have both commitlog and data on a 4TB EBS with 10k IOPS.
>> >
>> > Mike
>> >
>> > On Wed, Feb 10, 2016 at 5:28 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com
>> >
>> > wrote:
>> >>
>> >> What disk size are you using?
>> >>
>> >>
>> >>
>> >> From: Mike Heffner
>> >> Reply-To: "user@cassandra.apache.org"
>> >> Date: Wednesday, February 10, 2016 at 2:24 PM
>> >> To: "user@cassandra.apache.org"
>> >> Cc: Peter Norton
>> >> Subject: Re: Debugging write timeouts on Cassandra 2.2.5
>> >>
>> >> Paulo,
>> >>
>> >> Thanks for the suggestion, we ran some tests against CMS and saw the
>> same
>> >> timeouts. On that note though, we are going to try doubling the
>> instance
>> >> sizes and testing with double the heap (even though current usage is
>> low).
>> >>
>> >> Mike
>> >>
>> >> On Wed, Feb 10, 2016 at 3:40 PM, Paulo Motta <pauloricard...@gmail.com
>> >
>> >> wrote:
>> >>>
>> >>> Are you using the same GC settings as the staging 2.0 cluster? If not,
>> >>> could you try using the default GC settings (CMS) and see if that
>> changes
>> >>> anything? This is just a wild guess, but there were reports before of
>> >>> G1-caused instabilities with small heap sizes (< 16GB - see
>> CASSANDRA-10403
>> >>> for more context). Please ignore if you already tried reverting back
>> to CMS.
>> >>>
>> >>> 2016-02-10 16:51 GMT-03:00 Mike Heffner <m...@librato.com>:
>> >>>>
>> >>>> Hi all,
>> >>>>
>> >>>> We've recently embarked on a project to update our Cassandra
>> >>>> infrastructure running on EC2. We are long time users of 2.0.x and
>> are
>> >>>> testing out a move to version 2.2.5 running on VPC with EBS. Our
>> test setup
>> >>>> is a 3 node, RF=3 cluster supporting a small write load (mirror of
>> our
>> >>>> staging load).
>> >>>>
>> >>>> We are writing at QUORUM and while p95's look good compared to our
>> >>>> staging 2.0.x cluster, we are seeing frequent write operations that
>> time out
>> >>>> at the max write_request_timeout_in_ms (10 seconds). CPU across the
>> cluster
>> >>>> is < 10% and EBS write load is < 100 IOPS. Cassandra is running with
>> the
>> >>>> Oracle JDK 8u60 and we're using G1GC and any GC pauses are less than
>> 500ms.
>> >>>>
>> >>>> We run on c4.2xl instances with GP2 EBS attached storage for data and
>> >>>> commitlog directories. The nodes are using EC2 enhanced networking
>> and have
>> >>>> the latest Intel network driver module. We are running on HVM
>> instances
>> >>>> using Ubuntu 14.04.2.
>> >>>>
>> >>>> Our schema is 5 tables, all with COMPACT STORAGE. Each table is
>> similar
>> >>>> to the definition here:
>> >>>> https://gist.github.com/mheffner/4d80f6b53ccaa24cc20a
>> >>>>
>> >>>> This is our cassandra.yaml:
>> >>>>
>> https://gist.github.com/mheffner/fea80e6e939dd483f94f#file-cassandra-yaml
>> >>>>
>> >>>> Like I mentioned we use 8u60 with G1GC and have used many of the GC
>> >>>> settings in Al Tobey's tuning guide. This is our upstart config with

Debugging write timeouts on Cassandra 2.2.5

2016-02-10 Thread Mike Heffner
Hi all,

We've recently embarked on a project to update our Cassandra infrastructure
running on EC2. We are long time users of 2.0.x and are testing out a move
to version 2.2.5 running on VPC with EBS. Our test setup is a 3 node, RF=3
cluster supporting a small write load (mirror of our staging load).

We are writing at QUORUM and while p95's look good compared to our staging
2.0.x cluster, we are seeing frequent write operations that time out at the
max write_request_timeout_in_ms (10 seconds). CPU across the cluster is <
10% and EBS write load is < 100 IOPS. Cassandra is running with the Oracle
JDK 8u60 and we're using G1GC and any GC pauses are less than 500ms.

We run on c4.2xl instances with GP2 EBS attached storage for data and
commitlog directories. The nodes are using EC2 enhanced networking and have
the latest Intel network driver module. We are running on HVM instances
using Ubuntu 14.04.2.

Our schema is 5 tables, all with COMPACT STORAGE. Each table is similar to
the definition here: https://gist.github.com/mheffner/4d80f6b53ccaa24cc20a

This is our cassandra.yaml:
https://gist.github.com/mheffner/fea80e6e939dd483f94f#file-cassandra-yaml

Like I mentioned we use 8u60 with G1GC and have used many of the GC
settings in Al Tobey's tuning guide. This is our upstart config with JVM
and other CPU settings:
https://gist.github.com/mheffner/dc44613620b25c4fa46d

We've used several of the sysctl settings from Al's guide as well:
https://gist.github.com/mheffner/ea40d58f58a517028152

Our client application is able to write using either Thrift batches using
Asytanax driver or CQL async INSERT's using the Datastax Java driver.

For testing against Thrift (our legacy infra uses this) we write batches of
anywhere from 6 to 1500 rows at a time. Our p99 for batch execution is
around 45ms but our maximum (p100) sits less than 150ms except when it
periodically spikes to the full 10seconds.

Testing the same write path using CQL writes instead demonstrates similar
behavior. Low p99s except for periodic full timeouts. We enabled tracing
for several operations but were unable to get a trace that completed
successfully -- Cassandra started logging many messages as:

INFO  [ScheduledTasks:1] - MessagingService.java:946 - _TRACE messages were
dropped in last 5000 ms: 52499 for internal timeout and 0 for cross node
timeout

And all the traces contained rows with a "null" source_elapsed row:
https://gist.githubusercontent.com/mheffner/1d68a70449bd6688a010/raw/0327d7d3d94c3a93af02b64212e3b7e7d8f2911b/trace.out


We've exhausted as many configuration option permutations that we can think
of. This cluster does not appear to be under any significant load and
latencies seem to largely fall in two bands: low normal or max timeout.
This seems to imply that something is getting stuck and timing out at the
max write timeout.

Any suggestions on what to look for? We had debug enabled for awhile but we
didn't see any msg that pointed to something obvious. Happy to provide any
more information that may help.

We are pretty much at the point of sprinkling debug around the code to
track down what could be blocking.


Thanks,

Mike

-- 

  Mike Heffner <m...@librato.com>
  Librato, Inc.


Re: Debugging write timeouts on Cassandra 2.2.5

2016-02-10 Thread Mike Heffner
Paulo,

Thanks for the suggestion, we ran some tests against CMS and saw the same
timeouts. On that note though, we are going to try doubling the instance
sizes and testing with double the heap (even though current usage is low).

Mike

On Wed, Feb 10, 2016 at 3:40 PM, Paulo Motta <pauloricard...@gmail.com>
wrote:

> Are you using the same GC settings as the staging 2.0 cluster? If not,
> could you try using the default GC settings (CMS) and see if that changes
> anything? This is just a wild guess, but there were reports before of
> G1-caused instabilities with small heap sizes (< 16GB - see CASSANDRA-10403
> for more context). Please ignore if you already tried reverting back to CMS.
>
> 2016-02-10 16:51 GMT-03:00 Mike Heffner <m...@librato.com>:
>
>> Hi all,
>>
>> We've recently embarked on a project to update our Cassandra
>> infrastructure running on EC2. We are long time users of 2.0.x and are
>> testing out a move to version 2.2.5 running on VPC with EBS. Our test setup
>> is a 3 node, RF=3 cluster supporting a small write load (mirror of our
>> staging load).
>>
>> We are writing at QUORUM and while p95's look good compared to our
>> staging 2.0.x cluster, we are seeing frequent write operations that time
>> out at the max write_request_timeout_in_ms (10 seconds). CPU across the
>> cluster is < 10% and EBS write load is < 100 IOPS. Cassandra is running
>> with the Oracle JDK 8u60 and we're using G1GC and any GC pauses are less
>> than 500ms.
>>
>> We run on c4.2xl instances with GP2 EBS attached storage for data and
>> commitlog directories. The nodes are using EC2 enhanced networking and have
>> the latest Intel network driver module. We are running on HVM instances
>> using Ubuntu 14.04.2.
>>
>> Our schema is 5 tables, all with COMPACT STORAGE. Each table is similar
>> to the definition here:
>> https://gist.github.com/mheffner/4d80f6b53ccaa24cc20a
>>
>> This is our cassandra.yaml:
>> https://gist.github.com/mheffner/fea80e6e939dd483f94f#file-cassandra-yaml
>>
>> Like I mentioned we use 8u60 with G1GC and have used many of the GC
>> settings in Al Tobey's tuning guide. This is our upstart config with JVM
>> and other CPU settings:
>> https://gist.github.com/mheffner/dc44613620b25c4fa46d
>>
>> We've used several of the sysctl settings from Al's guide as well:
>> https://gist.github.com/mheffner/ea40d58f58a517028152
>>
>> Our client application is able to write using either Thrift batches using
>> Asytanax driver or CQL async INSERT's using the Datastax Java driver.
>>
>> For testing against Thrift (our legacy infra uses this) we write batches
>> of anywhere from 6 to 1500 rows at a time. Our p99 for batch execution is
>> around 45ms but our maximum (p100) sits less than 150ms except when it
>> periodically spikes to the full 10seconds.
>>
>> Testing the same write path using CQL writes instead demonstrates similar
>> behavior. Low p99s except for periodic full timeouts. We enabled tracing
>> for several operations but were unable to get a trace that completed
>> successfully -- Cassandra started logging many messages as:
>>
>> INFO  [ScheduledTasks:1] - MessagingService.java:946 - _TRACE messages
>> were dropped in last 5000 ms: 52499 for internal timeout and 0 for cross
>> node timeout
>>
>> And all the traces contained rows with a "null" source_elapsed row:
>> https://gist.githubusercontent.com/mheffner/1d68a70449bd6688a010/raw/0327d7d3d94c3a93af02b64212e3b7e7d8f2911b/trace.out
>>
>>
>> We've exhausted as many configuration option permutations that we can
>> think of. This cluster does not appear to be under any significant load and
>> latencies seem to largely fall in two bands: low normal or max timeout.
>> This seems to imply that something is getting stuck and timing out at the
>> max write timeout.
>>
>> Any suggestions on what to look for? We had debug enabled for awhile but
>> we didn't see any msg that pointed to something obvious. Happy to provide
>> any more information that may help.
>>
>> We are pretty much at the point of sprinkling debug around the code to
>> track down what could be blocking.
>>
>>
>> Thanks,
>>
>> Mike
>>
>> --
>>
>>   Mike Heffner <m...@librato.com>
>>   Librato, Inc.
>>
>>
>


-- 

  Mike Heffner <m...@librato.com>
  Librato, Inc.


Re: Debugging write timeouts on Cassandra 2.2.5

2016-02-10 Thread Mike Heffner
Jeff,

We have both commitlog and data on a 4TB EBS with 10k IOPS.

Mike

On Wed, Feb 10, 2016 at 5:28 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com>
wrote:

> What disk size are you using?
>
>
>
> From: Mike Heffner
> Reply-To: "user@cassandra.apache.org"
> Date: Wednesday, February 10, 2016 at 2:24 PM
> To: "user@cassandra.apache.org"
> Cc: Peter Norton
> Subject: Re: Debugging write timeouts on Cassandra 2.2.5
>
> Paulo,
>
> Thanks for the suggestion, we ran some tests against CMS and saw the same
> timeouts. On that note though, we are going to try doubling the instance
> sizes and testing with double the heap (even though current usage is low).
>
> Mike
>
> On Wed, Feb 10, 2016 at 3:40 PM, Paulo Motta <pauloricard...@gmail.com>
> wrote:
>
>> Are you using the same GC settings as the staging 2.0 cluster? If not,
>> could you try using the default GC settings (CMS) and see if that changes
>> anything? This is just a wild guess, but there were reports before of
>> G1-caused instabilities with small heap sizes (< 16GB - see CASSANDRA-10403
>> for more context). Please ignore if you already tried reverting back to CMS.
>>
>> 2016-02-10 16:51 GMT-03:00 Mike Heffner <m...@librato.com>:
>>
>>> Hi all,
>>>
>>> We've recently embarked on a project to update our Cassandra
>>> infrastructure running on EC2. We are long time users of 2.0.x and are
>>> testing out a move to version 2.2.5 running on VPC with EBS. Our test setup
>>> is a 3 node, RF=3 cluster supporting a small write load (mirror of our
>>> staging load).
>>>
>>> We are writing at QUORUM and while p95's look good compared to our
>>> staging 2.0.x cluster, we are seeing frequent write operations that time
>>> out at the max write_request_timeout_in_ms (10 seconds). CPU across the
>>> cluster is < 10% and EBS write load is < 100 IOPS. Cassandra is running
>>> with the Oracle JDK 8u60 and we're using G1GC and any GC pauses are less
>>> than 500ms.
>>>
>>> We run on c4.2xl instances with GP2 EBS attached storage for data and
>>> commitlog directories. The nodes are using EC2 enhanced networking and have
>>> the latest Intel network driver module. We are running on HVM instances
>>> using Ubuntu 14.04.2.
>>>
>>> Our schema is 5 tables, all with COMPACT STORAGE. Each table is similar
>>> to the definition here:
>>> https://gist.github.com/mheffner/4d80f6b53ccaa24cc20a
>>>
>>> This is our cassandra.yaml:
>>> https://gist.github.com/mheffner/fea80e6e939dd483f94f#file-cassandra-yaml
>>>
>>> Like I mentioned we use 8u60 with G1GC and have used many of the GC
>>> settings in Al Tobey's tuning guide. This is our upstart config with JVM
>>> and other CPU settings:
>>> https://gist.github.com/mheffner/dc44613620b25c4fa46d
>>>
>>> We've used several of the sysctl settings from Al's guide as well:
>>> https://gist.github.com/mheffner/ea40d58f58a517028152
>>>
>>> Our client application is able to write using either Thrift batches
>>> using Asytanax driver or CQL async INSERT's using the Datastax Java driver.
>>>
>>> For testing against Thrift (our legacy infra uses this) we write batches
>>> of anywhere from 6 to 1500 rows at a time. Our p99 for batch execution is
>>> around 45ms but our maximum (p100) sits less than 150ms except when it
>>> periodically spikes to the full 10seconds.
>>>
>>> Testing the same write path using CQL writes instead demonstrates
>>> similar behavior. Low p99s except for periodic full timeouts. We enabled
>>> tracing for several operations but were unable to get a trace that
>>> completed successfully -- Cassandra started logging many messages as:
>>>
>>> INFO  [ScheduledTasks:1] - MessagingService.java:946 - _TRACE messages
>>> were dropped in last 5000 ms: 52499 for internal timeout and 0 for cross
>>> node timeout
>>>
>>> And all the traces contained rows with a "null" source_elapsed row:
>>> https://gist.githubusercontent.com/mheffner/1d68a70449bd6688a010/raw/0327d7d3d94c3a93af02b64212e3b7e7d8f2911b/trace.out
>>>
>>>
>>> We've exhausted as many configuration option permutations that we can
>>> think of. This cluster does not appear to be under any significant load and
>>> latencies seem to largely fall in two bands: low normal or max timeout.
>>> This seems to imply that something is getting stuck and timing out at the
>>> max write timeout.
>>>
>>> Any suggestions on what to look for? We had debug enabled for awhile but
>>> we didn't see any msg that pointed to something obvious. Happy to provide
>>> any more information that may help.
>>>
>>> We are pretty much at the point of sprinkling debug around the code to
>>> track down what could be blocking.
>>>
>>>
>>> Thanks,
>>>
>>> Mike
>>>
>>> --
>>>
>>>   Mike Heffner <m...@librato.com>
>>>   Librato, Inc.
>>>
>>>
>>
>
>
> --
>
>   Mike Heffner <m...@librato.com>
>   Librato, Inc.
>
>


-- 

  Mike Heffner <m...@librato.com>
  Librato, Inc.


Re: Significant drop in storage load after 2.1.6-2.1.8 upgrade

2015-07-19 Thread Mike Heffner
Nate,

Thanks. I dug through the changes a bit more and I believe my original
observation may have been due to:

https://github.com/krummas/cassandra/commit/fbc47e3b950949a8aa191bc7e91eb6cb396fe6a8

from: https://issues.apache.org/jira/browse/CASSANDRA-9572

I had originally passed over it because we are not using DTCS, but it
matches since the upgrade appeared to only drop fully expired sstables.


Mike

On Sat, Jul 18, 2015 at 3:40 PM, Nate McCall n...@thelastpickle.com wrote:

 Perhaps https://issues.apache.org/jira/browse/CASSANDRA-9592 got
 compactions moving forward for you? This would explain the drop.

 However, the discussion on
 https://issues.apache.org/jira/browse/CASSANDRA-9683 seems to be similar
 to what you saw and that is currently being investigated.

 On Fri, Jul 17, 2015 at 10:24 AM, Mike Heffner m...@librato.com wrote:

 Hi all,

 I've been upgrading several of our rings from 2.1.6 to 2.1.8 and I've
 noticed that after the upgrade our storage load drops significantly (I've
 seen up to an 80% drop).

 I believe most of the data that is dropped is tombstoned (via TTL
 expiration) and I haven't detected any data loss yet. However, can someone
 point me to what changed between 2.1.6 and 2.1.8 that would lead to such a
 significant drop in tombstoned data? Looking at the changelog there's
 nothing that jumps out at me. This is a CF definition from one of the CFs
 that had a significant drop:

  describe measures_mid_1;

 CREATE TABLE Metrics.measures_mid_1 (
 key blob,
 c1 int,
 c2 blob,
 c3 blob,
 PRIMARY KEY (key, c1, c2)
 ) WITH COMPACT STORAGE
 AND CLUSTERING ORDER BY (c1 ASC, c2 ASC)
 AND bloom_filter_fp_chance = 0.01
 AND caching = '{keys:ALL, rows_per_partition:NONE}'
 AND comment = ''
 AND compaction = {'class':
 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
 AND compression = {'sstable_compression':
 'org.apache.cassandra.io.compress.LZ4Compressor'}
 AND dclocal_read_repair_chance = 0.1
 AND default_time_to_live = 0
 AND gc_grace_seconds = 0
 AND max_index_interval = 2048
 AND memtable_flush_period_in_ms = 0
 AND min_index_interval = 128
 AND read_repair_chance = 0.0
 AND speculative_retry = '99.0PERCENTILE';

 Thanks,

 Mike

 --

   Mike Heffner m...@librato.com
   Librato, Inc.




 --
 -
 Nate McCall
 Austin, TX
 @zznate

 Co-Founder  Sr. Technical Consultant
 Apache Cassandra Consulting
 http://www.thelastpickle.com




-- 

  Mike Heffner m...@librato.com
  Librato, Inc.


Significant drop in storage load after 2.1.6-2.1.8 upgrade

2015-07-17 Thread Mike Heffner
Hi all,

I've been upgrading several of our rings from 2.1.6 to 2.1.8 and I've
noticed that after the upgrade our storage load drops significantly (I've
seen up to an 80% drop).

I believe most of the data that is dropped is tombstoned (via TTL
expiration) and I haven't detected any data loss yet. However, can someone
point me to what changed between 2.1.6 and 2.1.8 that would lead to such a
significant drop in tombstoned data? Looking at the changelog there's
nothing that jumps out at me. This is a CF definition from one of the CFs
that had a significant drop:

 describe measures_mid_1;

CREATE TABLE Metrics.measures_mid_1 (
key blob,
c1 int,
c2 blob,
c3 blob,
PRIMARY KEY (key, c1, c2)
) WITH COMPACT STORAGE
AND CLUSTERING ORDER BY (c1 ASC, c2 ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{keys:ALL, rows_per_partition:NONE}'
AND comment = ''
AND compaction = {'class':
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression':
'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 0
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';

Thanks,

Mike

-- 

  Mike Heffner m...@librato.com
  Librato, Inc.


Re: How to column slice with CQL + 1.2

2014-07-18 Thread Mike Heffner
Tyler,

Cool, yes I was actually trying to solve that exact problem of paginating
with LIMIT when it ends up slicing in the middle of a set of composite
columns. (though sounds like automatic ResultSet paging in 2.0.x alleviates
that need).

So to do composite column slicing in 1.2.x the answer is to stick with
Thrift?

Mike


On Thu, Jul 17, 2014 at 8:27 PM, Tyler Hobbs ty...@datastax.com wrote:

 For this type of query, you really want the tuple notation introduced in
 2.0.6 (https://issues.apache.org/jira/browse/CASSANDRA-4851):

 SELECT * FROM CF WHERE key='X' AND (column1, column2, column3)  (1, 3, 4)
 AND (column1)  (2)


 On Thu, Jul 17, 2014 at 6:01 PM, Mike Heffner m...@librato.com wrote:

 Michael,

 So if I switch to:

 SELECT * FROM CF WHERE key='X' AND column1=1 AND column2=3 AND column34

 That doesn't include rows where column1=2, which breaks the original
 slice query.

 Maybe a better way to put it, I would like:

 SELECT * FROM CF WHERE key='X' AND column1=1 AND column2=3 AND
 column34 AND column1=2;

 but that is rejected with:

 Bad Request: PRIMARY KEY part column2 cannot be restricted (preceding
 part column1 is either not restricted or by a non-EQ relation)


 Mike



 On Thu, Jul 17, 2014 at 6:37 PM, Michael Dykman mdyk...@gmail.com
 wrote:

 The last term in this query is redundant.  Any time column1 = 1, we
 may reasonably expect that it is also = 2 as that's where 1 is found.
 If you remove the last term, you elimiate the error and non of the
 selection logic.

 SELECT * FROM CF WHERE key='X' AND column1=1 AND column2=3 AND
 column34 AND column1=2;

 On Thu, Jul 17, 2014 at 6:23 PM, Mike Heffner m...@librato.com wrote:
  What is the proper way to perform a column slice using CQL with 1.2?
 
  I have a CF with a primary key X and 3 composite columns (A, B, C).
 I'd like
  to find records at:
 
  key=X
  columns  (A=1, B=3, C=4) AND
 columns = (A=2)
 
  The Query:
 
  SELECT * FROM CF WHERE key='X' AND column1=1 AND column2=3 AND
 column34 AND
  column1=2;
 
  fails with:
 
  DoGetMeasures: column1 cannot be restricted by both an equal and an
 inequal
  relation
 
  This is against Cassandra 1.2.16.
 
  What is the proper way to perform this query?
 
 
  Cheers,
 
  Mike
 
  --
 
Mike Heffner m...@librato.com
Librato, Inc.
 



 --
  - michael dykman
  - mdyk...@gmail.com

  May the Source be with you.




 --

   Mike Heffner m...@librato.com
   Librato, Inc.




 --
 Tyler Hobbs
 DataStax http://datastax.com/




-- 

  Mike Heffner m...@librato.com
  Librato, Inc.


Re: How to column slice with CQL + 1.2

2014-07-17 Thread Mike Heffner
Michael,

So if I switch to:

SELECT * FROM CF WHERE key='X' AND column1=1 AND column2=3 AND column34

That doesn't include rows where column1=2, which breaks the original slice
query.

Maybe a better way to put it, I would like:

SELECT * FROM CF WHERE key='X' AND column1=1 AND column2=3 AND column34
AND column1=2;

but that is rejected with:

Bad Request: PRIMARY KEY part column2 cannot be restricted (preceding part
column1 is either not restricted or by a non-EQ relation)


Mike



On Thu, Jul 17, 2014 at 6:37 PM, Michael Dykman mdyk...@gmail.com wrote:

 The last term in this query is redundant.  Any time column1 = 1, we
 may reasonably expect that it is also = 2 as that's where 1 is found.
 If you remove the last term, you elimiate the error and non of the
 selection logic.

 SELECT * FROM CF WHERE key='X' AND column1=1 AND column2=3 AND
 column34 AND column1=2;

 On Thu, Jul 17, 2014 at 6:23 PM, Mike Heffner m...@librato.com wrote:
  What is the proper way to perform a column slice using CQL with 1.2?
 
  I have a CF with a primary key X and 3 composite columns (A, B, C). I'd
 like
  to find records at:
 
  key=X
  columns  (A=1, B=3, C=4) AND
 columns = (A=2)
 
  The Query:
 
  SELECT * FROM CF WHERE key='X' AND column1=1 AND column2=3 AND column34
 AND
  column1=2;
 
  fails with:
 
  DoGetMeasures: column1 cannot be restricted by both an equal and an
 inequal
  relation
 
  This is against Cassandra 1.2.16.
 
  What is the proper way to perform this query?
 
 
  Cheers,
 
  Mike
 
  --
 
Mike Heffner m...@librato.com
Librato, Inc.
 



 --
  - michael dykman
  - mdyk...@gmail.com

  May the Source be with you.




-- 

  Mike Heffner m...@librato.com
  Librato, Inc.


How to restart bootstrap after a failed streaming due to Broken Pipe (1.2.16)

2014-06-09 Thread Mike Heffner
Hi,

During an attempt to bootstrap a new node into a 1.2.16 ring the new node
saw one of the streaming nodes periodically disappear:

 INFO [GossipTasks:1] 2014-06-10 00:28:52,572 Gossiper.java (line 823)
InetAddress /10.156.1.2 is now DOWN
ERROR [GossipTasks:1] 2014-06-10 00:28:52,574 AbstractStreamSession.java
(line 108) Stream failed because /10.156.1.2 died or was restarted/removed
(streams may still be active in background, but further streams won't be
started)
 WARN [GossipTasks:1] 2014-06-10 00:28:52,574 RangeStreamer.java (line 246)
Streaming from /10.156.1.2 failed
 INFO [HANDSHAKE-/10.156.1.2] 2014-06-10 00:28:57,922
OutboundTcpConnection.java (line 418) Handshaking version with /10.156.1.2
 INFO [GossipStage:1] 2014-06-10 00:28:57,943 Gossiper.java (line 809)
InetAddress /10.156.1.2 is now UP

This brief interruption was enough to kill the streaming from node
10.156.1.2. Node 10.156.1.2 saw a similar broken pipe exception from the
bootstrapping node:

ERROR [Streaming to /10.156.193.1.3] 2014-06-10 01:22:02,345
CassandraDaemon.java (line 191) Exception in thread Thread[Streaming to /
10.156.1.3:1,5,main]
java.lang.RuntimeException: java.io.IOException: Broken pipe
at com.google.common.base.Throwables.propagate(Throwables.java:160)
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: java.io.IOException: Broken pipe
at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
at
sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:420)
at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:552)
at
org.apache.cassandra.streaming.compress.CompressedFileStreamTask.stream(CompressedFileStreamTask.java:93)
at
org.apache.cassandra.streaming.FileStreamTask.runMayThrow(FileStreamTask.java:91)
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)


During bootstrapping we notice a significant spike in CPU and latency
across the board on the ring (CPU 50-85% and write latencies 60ms -
250ms). It seems likely that this persistent high load led to the hiccup
that caused the gossiper to see the streaming node as briefly down.

What is the proper way to recover from this? The original estimate was
almost 24 hours to stream all the data required to bootstrap this single
node (streaming set to unlimited) and this occurred 6 hours into the
bootstrap. With such high load from streaming it seems that simply
restarting will inevitably hit this problem again.


Cheers,

Mike

-- 

  Mike Heffner m...@librato.com
  Librato, Inc.


Re: Failed decommission

2013-08-25 Thread Mike Heffner
Janne,

We ran into this too. Appears it's a bug in 1.2.8 that is fixed in the
upcoming 1.2.9. I added the steps I took to finally remove the node here:
https://issues.apache.org/jira/browse/CASSANDRA-5857?focusedCommentId=13748998page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13748998


Cheers,

Mike


On Sun, Aug 25, 2013 at 4:06 AM, Janne Jalkanen janne.jalka...@ecyrd.comwrote:

 This on cass 1.2.8

 Ring state before decommission

 --  Address Load   Owns   Host ID
   TokenRack
 UN  10.0.0.1  38.82 GB   33.3%  21a98502-dc74-4ad0-9689-0880aa110409  1
  1a
 UN  10.0.0.2   33.5 GB33.3%  cba6b27a-4982-4f04-854d-cc73155d5f69
  56713727820156407428984779325531226110   1b
 UN  10.0.0.3  37.41 GB   0.0%   6ba2c7d4-713e-4c14-8df8-f861fb211b0d
  56713727820156407428984779325531226111   1b
 UN  10.0.0.4  35.7 GB33.3%  bf3d4792-f3e0-4062-afe3-be292bc85ed7
  11342745564031281485796955865106245  1c

 Trying to decommission the node

 ubuntu@10.0.0.3:~$ nodetool decommission
 Exception in thread main java.lang.NumberFormatException: For input
 string: 56713727820156407428984779325531226111
 at
 java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
 at java.lang.Long.parseLong(Long.java:444)
 at java.lang.Long.parseLong(Long.java:483)
 at
 org.apache.cassandra.service.StorageService.extractExpireTime(StorageService.java:1660)
 at
 org.apache.cassandra.service.StorageService.handleStateLeft(StorageService.java:1515)
 at
 org.apache.cassandra.service.StorageService.onChange(StorageService.java:1234)
 at
 org.apache.cassandra.gms.Gossiper.doNotifications(Gossiper.java:949)
 at
 org.apache.cassandra.gms.Gossiper.addLocalApplicationState(Gossiper.java:1116)
 at
 org.apache.cassandra.service.StorageService.leaveRing(StorageService.java:2817)
 at
 org.apache.cassandra.service.StorageService.unbootstrap(StorageService.java:2861)
 at
 org.apache.cassandra.service.StorageService.decommission(StorageService.java:2808)

 Now I'm in a state where the machine is still up but leaving but I
 can't seem to get it out of the ring.  For example:

 % nodetool removenode 6ba2c7d4-713e-4c14-8df8-f861fb211b0d
 Exception in thread main java.lang.UnsupportedOperationException: Node /
 10.0.0.3 is alive and owns this ID. Use decommission command to remove it
 from the ring

 Any ideas?

 /Janne




-- 

  Mike Heffner m...@librato.com
  Librato, Inc.


Re: Decommission faster than bootstrap

2013-08-22 Thread Mike Heffner
We've also noticed fairly poor streaming performance during a bootstrap
operation, albeit with 1.2.x. Streaming takes much longer than the physical
hardware capacity, even with the limits set high or off:
https://issues.apache.org/jira/browse/CASSANDRA-5726


On Sun, Aug 18, 2013 at 6:19 PM, Rodrigo Felix 
rodrigofelixdealme...@gmail.com wrote:

 Hi,

I've noticed that, at least in my enviroment (Cassandra 1.1.12 running
 on Amazon EC2), decommission operations take about 3-4 minutes while
 bootstrap can take more than 20 minutes.
What is the reason to have this time difference? For both operations,
 what it is time-consuming the data streaming from (or to) other node, right?
Thanks in advance.

 Att.

 *Rodrigo Felix de Almeida*
 LSBD - Universidade Federal do Ceará
 Project Manager
 MBA, CSM, CSPO, SCJP




-- 

  Mike Heffner m...@librato.com
  Librato, Inc.


Re: High performance hardware with lot of data per node - Global learning about configuration

2013-07-11 Thread Mike Heffner
 seems coherent ? Right now, performance are
 correct, latency  5ms almost all the time. What can I do to handle more
 data per node and keep these performances or get even better once ?
 
  I know this is a long message but if you have any comment or insight
 even on part of it, don't hesitate to share it. I guess this kind of
 comment on configuration is usable by the entire community.
 
  Alain
 




-- 

  Mike Heffner m...@librato.com
  Librato, Inc.


Re: High performance hardware with lot of data per node - Global learning about configuration

2013-07-11 Thread Mike Heffner
Aiman,

I believe that is one of the cases we added a check for:

https://github.com/librato/tablesnap/blob/master/tablesnap#L203-L207


Mike


On Thu, Jul 11, 2013 at 1:54 PM, Aiman Parvaiz ai...@grapheffect.comwrote:

 Thanks for the info Mike, we ran in to a race condition which was killing
 table snap, I want to share the problem and the solution/ work around and
 may be someone can throw some light on the effects of the solution.

 tablesnap was getting killed with this error message:

 Failed uploading %s. Aborting.\n%s

 Looking at the code it took me to the following:

 def worker(self):
 bucket = self.get_bucket()

 while True:
 f = self.fileq.get()
 keyname = self.build_keyname(f)
 try:
 self.upload_sstable(bucket, keyname, f)
 except:
 self.log.critical(Failed uploading %s. Aborting.\n%s %
  (f, format_exc()))
 # Brute force kill self
 os.kill(os.getpid(), signal.SIGKILL)

 self.fileq.task_done()


 It builds the filename and then before it could upload it, the file
 disappears (which is possible), I simply commented out the line which kills
 tablesnap if the file is not found, it fixes the issue we were having but I
 would appreciate if some one has any insights on any ill effects this might
 have on backup or restoration process.

 Thanks


 On Jul 11, 2013, at 7:03 AM, Mike Heffner m...@librato.com wrote:

 We've also noticed very good read and write latencies with the hi1.4xls
 compared to our previous instance classes. We actually ran a mixed cluster
 of hi1.4xls and m2.4xls to watch side-by-side comparison.

 Despite the significant improvement in underlying hardware, we've noticed
 that streaming performance with 1.2.6+vnodes is a lot slower than we would
 expect. Bootstrapping a node into a ring with large storage loads can take
 6+ hours. We have a JIRA open that describes our current config:
 https://issues.apache.org/jira/browse/CASSANDRA-5726

 Aiman: We also use tablesnap for our backups. We're using a slightly
 modified version [1]. We currently backup every sst as soon as they hit
 disk (tablesnap's inotify), but we're considering moving to a periodic
 snapshot approach as the sst churn after going from 24 nodes - 6 nodes is
 quite high.

 Mike


 [1]: https://github.com/librato/tablesnap


 On Thu, Jul 11, 2013 at 7:33 AM, Aiman Parvaiz ai...@grapheffect.comwrote:

 Hi,
 We also recently migrated to 3 hi.4xlarge boxes(Raid0 SSD) and the disk
 IO performance is definitely better than the earlier non SSD servers, we
 are serving up to 14k reads/s with a latency of 3-3.5 ms/op.
 I wanted to share our config options and ask about the data back up
 strategy for Raid0.

 We are using C* 1.2.6 with

 key_chache and row_cache of 300MB
 I have not changed/ modified any other parameter except for going with
 multithreaded GC. I will be playing around with other factors and update
 everyone if I find something interesting.

 Also, just wanted to share backup strategy and see if I can get something
 useful from how others are taking backup of their raid0. I am using
 tablesnap to upload SSTables to s3 and I have attached a separate EBS
 volume to every box and have set up rsync to mirror Cassandra data from
 Raid0 to EBS. I would really appreciate if you guys can share how you
 taking backups.

 Thanks


 On Jul 9, 2013, at 7:11 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:

  Hi,
 
  Using C*1.2.2.
 
  We recently dropped our 18 m1.xLarge (4CPU, 15GB RAM, 4 Raid-0 Disks)
 servers to get 3 hi1.4xLarge (16CPU, 60GB RAM, 2 Raid-0 SSD) servers
 instead, for about the same price.
 
  We tried it after reading some benchmark published by Netflix.
 
  It is awesome and I recommend it to anyone who is using more than 18
 xLarge server or can afford these high cost / high performance EC2
 instances. SSD gives a very good throughput with an awesome latency.
 
  Yet, we had about 200 GB data per server and now about 1 TB.
 
  To alleviate memory pressure inside the heap I had to reduce the index
 sampling. I changed the index_interval value from 128 to 512, with no
 visible impact on latency, but a great improvement inside the heap which
 doesn't complain about any pressure anymore.
 
  Is there some more tuning I could use, more tricks that could be useful
 while using big servers, with a lot of data per node and relatively high
 throughput ?
 
  SSD are at 20-40 % of their throughput capacity (according to
 OpsCenter), CPU almost never reach a bigger load than 5 or 6 (with 16 CPU),
 15 GB RAM used out of 60GB.
 
  At this point I have kept my previous configuration, which is almost
 the default one from the Datastax community AMI. There is a part of it, you
 can consider that any property that is not in here is configured as default
 :
 
  cassandra.yaml
 
  key_cache_size_in_mb: (empty) - so default - 100MB (hit rate between 88

Re: High performance hardware with lot of data per node - Global learning about configuration

2013-07-09 Thread Mike Heffner
I'm curious because we are experimenting with a very similar configuration,
what basis did you use for expanding the index_interval to that value? Do
you have before and after numbers or was it simply reduction of the heap
pressure warnings that you looked for?

thanks,

Mike


On Tue, Jul 9, 2013 at 10:11 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:

 Hi,

 Using C*1.2.2.

 We recently dropped our 18 m1.xLarge (4CPU, 15GB RAM, 4 Raid-0 Disks)
 servers to get 3 hi1.4xLarge (16CPU, 60GB RAM, 2 Raid-0 SSD) servers
 instead, for about the same price.

 We tried it after reading some benchmark published by Netflix.

 It is awesome and I recommend it to anyone who is using more than 18
 xLarge server or can afford these high cost / high performance EC2
 instances. SSD gives a very good throughput with an awesome latency.

 Yet, we had about 200 GB data per server and now about 1 TB.

 To alleviate memory pressure inside the heap I had to reduce the index
 sampling. I changed the index_interval value from 128 to 512, with no
 visible impact on latency, but a great improvement inside the heap which
 doesn't complain about any pressure anymore.

 Is there some more tuning I could use, more tricks that could be useful
 while using big servers, with a lot of data per node and relatively high
 throughput ?

 SSD are at 20-40 % of their throughput capacity (according to OpsCenter),
 CPU almost never reach a bigger load than 5 or 6 (with 16 CPU), 15 GB RAM
 used out of 60GB.

 At this point I have kept my previous configuration, which is almost the
 default one from the Datastax community AMI. There is a part of it, you can
 consider that any property that is not in here is configured as default :

 cassandra.yaml

 key_cache_size_in_mb: (empty) - so default - 100MB (hit rate between 88 %
 and 92 %, good enough ?)
 row_cache_size_in_mb: 0 (not usable in our use case, a lot of different
 and random reads)
 flush_largest_memtables_at: 0.80
 reduce_cache_sizes_at: 0.90

 concurrent_reads: 32 (I am thinking to increase this to 64 or more since I
 have just a few servers to handle more concurrence)
 concurrent_writes: 32 (I am thinking to increase this to 64 or more too)
 memtable_total_space_in_mb: 1024 (to avoid having a full heap, shoul I use
 bigger value, why for ?)

 rpc_server_type: sync (I tried hsha and had the ERROR 12:02:18,971 Read
 an invalid frame size of 0. Are you using TFramedTransport on the client
 side? error). No idea how to fix this, and I use 5 different clients for
 different purpose  (Hector, Cassie, phpCassa, Astyanax, Helenus)...

 multithreaded_compaction: false (Should I try enabling this since I now
 use SSD ?)
 compaction_throughput_mb_per_sec: 16 (I will definitely up this to 32 or
 even more)

 cross_node_timeout: true
 endpoint_snitch: Ec2MultiRegionSnitch

 index_interval: 512

 cassandra-env.sh

 I am not sure about how to tune the heap, so I mainly use defaults

 MAX_HEAP_SIZE=8G
 HEAP_NEWSIZE=400M (I tried with higher values, and it produced bigger GC
 times (1600 ms instead of  200 ms now with 400M)

 -XX:+UseParNewGC
 -XX:+UseConcMarkSweepGC
 -XX:+CMSParallelRemarkEnabled
 -XX:SurvivorRatio=8
 -XX:MaxTenuringThreshold=1
 -XX:CMSInitiatingOccupancyFraction=70
 -XX:+UseCMSInitiatingOccupancyOnly

 Does this configuration seems coherent ? Right now, performance are
 correct, latency  5ms almost all the time. What can I do to handle more
 data per node and keep these performances or get even better once ?

 I know this is a long message but if you have any comment or insight even
 on part of it, don't hesitate to share it. I guess this kind of comment on
 configuration is usable by the entire community.

 Alain




-- 

  Mike Heffner m...@librato.com
  Librato, Inc.


Re: Streaming performance with 1.2.6

2013-07-02 Thread Mike Heffner
On Mon, Jul 1, 2013 at 10:06 PM, Mike Heffner m...@librato.com wrote:


 The only changes we've made to the config (aside from dirs/hosts) are:


Forgot to include we've changed this as well:

-partitioner: org.apache.cassandra.dht.Murmur3Partitioner
+partitioner: org.apache.cassandra.dht.RandomPartitioner


Cheers,

Mike
-- 

  Mike Heffner m...@librato.com
  Librato, Inc.


Re: Streaming performance with 1.2.6

2013-07-02 Thread Mike Heffner
Sankalp,

Parallel sstableloader streaming would definitely be valuable.

However, this ring is currently using vnodes and I was surprised to see
that a bootstrapping node only streamed from one node in the ring. My
understanding was that a bootstrapping node would stream from multiple
nodes in the ring.

We started with a 3 node/3 AZ, RF=3 ring. We then increased that to 6
nodes, adding one per AZ. The 4th, 5th and 6th nodes only streamed from the
node in their own AZ/rack which led to the serial sstable streaming. Is
this the correct behavior for the snitch? Is there an option to stream from
multiple replicas across the az/rack configuration?

Mike


On Tue, Jul 2, 2013 at 1:53 PM, sankalp kohli kohlisank...@gmail.comwrote:

 This was a problem pre vnodes. I had several JIRA for that but some of
 them were voted down saying the performance will improve with vnodes.
 The main problem is that it streams one sstable at a time and not in
 parallel.

 Jira 4784 can speed up the bootstrap performance. You can also do a zero
 copy and not touch the caches of the nodes which are contributing in the
 build.


 https://issues.apache.org/jira/browse/CASSANDRA-4663
 https://issues.apache.org/jira/browse/CASSANDRA-4784


 On Tue, Jul 2, 2013 at 7:35 AM, Mike Heffner m...@librato.com wrote:


 On Mon, Jul 1, 2013 at 10:06 PM, Mike Heffner m...@librato.com wrote:


 The only changes we've made to the config (aside from dirs/hosts) are:


 Forgot to include we've changed this as well:

 -partitioner: org.apache.cassandra.dht.Murmur3Partitioner
 +partitioner: org.apache.cassandra.dht.RandomPartitioner


 Cheers,

 Mike
 --

   Mike Heffner m...@librato.com
   Librato, Inc.





-- 

  Mike Heffner m...@librato.com
  Librato, Inc.


Re: Streaming performance with 1.2.6

2013-07-02 Thread Mike Heffner
As a test, adding a 7th node in the first AZ will stream from both the two
existing nodes in the same AZ.

Aggregate streaming bandwidth at the 7th node is approximately 12 MB/sec
when all limits are set at 800 MB/sec, or about double what I saw streaming
from a single node. This would seem to indicate that the sending node is
limiting our streaming rate.

Mike


On Tue, Jul 2, 2013 at 3:00 PM, Mike Heffner m...@librato.com wrote:

 Sankalp,

 Parallel sstableloader streaming would definitely be valuable.

 However, this ring is currently using vnodes and I was surprised to see
 that a bootstrapping node only streamed from one node in the ring. My
 understanding was that a bootstrapping node would stream from multiple
 nodes in the ring.

 We started with a 3 node/3 AZ, RF=3 ring. We then increased that to 6
 nodes, adding one per AZ. The 4th, 5th and 6th nodes only streamed from the
 node in their own AZ/rack which led to the serial sstable streaming. Is
 this the correct behavior for the snitch? Is there an option to stream from
 multiple replicas across the az/rack configuration?

 Mike


 On Tue, Jul 2, 2013 at 1:53 PM, sankalp kohli kohlisank...@gmail.comwrote:

 This was a problem pre vnodes. I had several JIRA for that but some of
 them were voted down saying the performance will improve with vnodes.
 The main problem is that it streams one sstable at a time and not in
 parallel.

 Jira 4784 can speed up the bootstrap performance. You can also do a zero
 copy and not touch the caches of the nodes which are contributing in the
 build.


 https://issues.apache.org/jira/browse/CASSANDRA-4663
 https://issues.apache.org/jira/browse/CASSANDRA-4784


 On Tue, Jul 2, 2013 at 7:35 AM, Mike Heffner m...@librato.com wrote:


 On Mon, Jul 1, 2013 at 10:06 PM, Mike Heffner m...@librato.com wrote:


 The only changes we've made to the config (aside from dirs/hosts) are:


 Forgot to include we've changed this as well:

 -partitioner: org.apache.cassandra.dht.Murmur3Partitioner
 +partitioner: org.apache.cassandra.dht.RandomPartitioner


 Cheers,

 Mike
 --

   Mike Heffner m...@librato.com
   Librato, Inc.





 --

   Mike Heffner m...@librato.com
   Librato, Inc.




-- 

  Mike Heffner m...@librato.com
  Librato, Inc.


Streaming performance with 1.2.6

2013-07-01 Thread Mike Heffner
Hi,

We've recently been testing some of the higher performance instance classes
on EC2, specifically the hi1.4xlarge, with Cassandra. For those that are
not familiar with them, they have two SSD disks and 10 gige.

While we have observed much improved raw performance over our current
instances, we are seeing a fairly large gap between Cassandra and raw
performance. We have particularly noticed a gap in the streaming
performance when bootstrapping a new node. I wanted to ensure that we have
configured these instances correctly to get the best performance out of
Cassandra.

When bootstrapping a new node into a small ring with a 35GB streaming
payload, we see a 5-8 MB/sec max streaming rate joining the new node to the
ring. We are using 1.2.6 with 256 token vnode support. In our tests the
ring is small enough so all streaming occurs from a single node.

To test hardware performance for this use case, we ran an rsync of the
sstables from one node to the next (to/from the same file systems) and
observed a consistent rate of 115 MB/sec.

The only changes we've made to the config (aside from dirs/hosts) are:

-concurrent_reads: 32
-concurrent_writes: 32
+concurrent_reads: 128 # 32
+concurrent_writes: 128 # 32

-rpc_server_type: sync
+rpc_server_type: hsha # sync

-compaction_throughput_mb_per_sec: 16
+compaction_throughput_mb_per_sec: 256 # 16

-read_request_timeout_in_ms: 1
+read_request_timeout_in_ms: 6000 # 1

-endpoint_snitch: SimpleSnitch
+endpoint_snitch: Ec2Snitch # SimpleSnitch

-internode_compression: all
+internode_compression: none

We use a 10G heap with a 2G new size. We are using the Oracle 1.7.0_25 JVM.

I've adjusted our streaming throughput limit from 200MB/sec up to 800MB/sec
on both the sending and receiving streaming nodes, but that doesn't appear
to make a difference.

The disks are raid0 (2 * 1T SSD) with 512 read ahead, XFS.

The nodes in the ring are running about 23% CPU on average, with spikes up
to a maximum of 45% CPU.

As I mentioned, on the same boxes with the same workloads, I've seen up to
115 MB/sec transfers with rsync.


Any suggestions for what to adjust to see better streaming performance? 5%
of what a single rsync can do seems somewhat limited.


Thanks,

Mike


-- 

  Mike Heffner m...@librato.com
  Librato, Inc.


Re: Upgrade 1.1.2 - 1.1.6

2012-11-20 Thread Mike Heffner
Alain,

My understanding is that drain ensures that all memtables are flushed, so
that there is no data in the commitlog that is isn't in an sstable. A
marker is saved that indicates the commit logs should not be replayed.
Commitlogs are only removed from disk periodically
(after commitlog_total_space_in_mb is exceeded?).

With 1.1.5/6, all nanotime commitlogs are replayed on startup regardless of
whether they've been flushed. So in our case manually removing all the
commitlogs after a drain was the only way to prevent their replay.

Mike




On Tue, Nov 20, 2012 at 5:19 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:

 @Mike

 I am glad to see I am not the only one with this issue (even if I am sorry
 it happened to you of course.).

 Isn't drain supposed to clear the commit logs ? Did removing them worked
 properly ?

 I his warning to C* users, Jonathan Ellis told that a drain would avoid
 this issue, It seems like it doesn't.

 @Rob

 You understood precisely the 2 issues I met during the upgrade. I am sad
 to see none of them is yet resolved and probably wont.


 2012/11/20 Mike Heffner m...@librato.com

 Alain,

 We performed a 1.1.3 - 1.1.6 upgrade and found that all the logs
 replayed regardless of the drain. After noticing this on the first node, we
 did the following:

 * nodetool flush
 * nodetool drain
 * service cassandra stop
 * mv /path/to/logs/*.log /backup/
 * apt-get install cassandra
 restarts automatically

 I also agree that starting C* after an upgrade/install seems quite broken
 if it was already stopped before the install. However annoying, I have
 found this to be the default for most Ubuntu daemon packages.

 Mike


 On Thu, Nov 15, 2012 at 9:21 AM, Alain RODRIGUEZ arodr...@gmail.comwrote:

 We had an issue with counters over-counting even using the nodetool
 drain command before upgrading...

 Here is my bash history

69  cp /etc/cassandra/cassandra.yaml /etc/cassandra/cassandra.yaml.bak
70  cp /etc/cassandra/cassandra-env.sh
 /etc/cassandra/cassandra-env.sh.bak
71  sudo apt-get install cassandra
72  nodetool disablethrift
73  nodetool drain
74  service cassandra stop
75  cat /etc/cassandra/cassandra-env.sh
 /etc/cassandra/cassandra-env.sh.bak
76  vim /etc/cassandra/cassandra-env.sh
77  cat /etc/cassandra/cassandra.yaml
 /etc/cassandra/cassandra.yaml.bak
78  vim /etc/cassandra/cassandra.yaml
79  service cassandra start

 So I think I followed these steps
 http://www.datastax.com/docs/1.1/install/upgrading#upgrade-steps

 I merged my conf files with an external tool so consider I merged my
 conf files on steps 76 and 78.

 I saw that the sudo apt-get install cassandra stop the server and
 restart it automatically. So it updated without draining and restart before
 I had the time to reconfigure the conf files. Is this normal ? Is there a
 way to avoid it ?

 So for the second node I decided to try to stop C*before the upgrade.

   125  cp /etc/cassandra/cassandra.yaml /etc/cassandra/cassandra.yaml.bak
   126  cp /etc/cassandra/cassandra-env.sh
 /etc/cassandra/cassandra-env.sh.bak
   127  nodetool disablegossip
   128  nodetool disablethrift
   129  nodetool drain
   130  service cassandra stop
   131  sudo apt-get install cassandra

 //131 : This restarted cassandra

   132  nodetool disablethrift
   133  nodetool disablegossip
   134  nodetool drain
   135  service cassandra stop
   136  cat /etc/cassandra/cassandra-env.sh
 /etc/cassandra/cassandra-env.sh.bak
   137  cim /etc/cassandra/cassandra-env.sh
   138  vim /etc/cassandra/cassandra-env.sh
   139  cat /etc/cassandra/cassandra.yaml
 /etc/cassandra/cassandra.yaml.bak
   140  vim /etc/cassandra/cassandra.yaml
   141  service cassandra start

 After both of these updates I saw my current counters increase without
 any reason.

 Did I do anything wrong ?

 Alain




 --

   Mike Heffner m...@librato.com
   Librato, Inc.






-- 

  Mike Heffner m...@librato.com
  Librato, Inc.


Re: Upgrade 1.1.2 - 1.1.6

2012-11-20 Thread Mike Heffner
On Tue, Nov 20, 2012 at 2:49 PM, Rob Coli rc...@palominodb.com wrote:

 On Mon, Nov 19, 2012 at 7:18 PM, Mike Heffner m...@librato.com wrote:
  We performed a 1.1.3 - 1.1.6 upgrade and found that all the logs
 replayed
  regardless of the drain.

 Your experience and desire for different (expected) behavior is welcomed
 on :

 https://issues.apache.org/jira/browse/CASSANDRA-4446

 nodetool drain sometimes doesn't mark commitlog fully flushed

 If every production operator who experiences this issue shares their
 experience on this bug, perhaps the project will acknowledge and
 address it.


Well in this case I think our issue was that upgrading from nanotime-epoch
seconds, by definition, replays all commit logs. That's not due to any
specific problem with nodetool drain not marking commitlog's flushed, but a
safety to ensure data is not lost due to buggy nanotime implementations.

For us, it was that the upgrade instructions pre-1.1.5-1.1.6 didn't
mention that CL's should be removed if successfully drained. On the other
hand, we do not use counters so replaying them was merely a much longer
MTT-Return after restarting with 1.1.6.

Mike

-- 

  Mike Heffner m...@librato.com
  Librato, Inc.


Re: Upgrade 1.1.2 - 1.1.6

2012-11-19 Thread Mike Heffner
Alain,

We performed a 1.1.3 - 1.1.6 upgrade and found that all the logs replayed
regardless of the drain. After noticing this on the first node, we did the
following:

* nodetool flush
* nodetool drain
* service cassandra stop
* mv /path/to/logs/*.log /backup/
* apt-get install cassandra
restarts automatically

I also agree that starting C* after an upgrade/install seems quite broken
if it was already stopped before the install. However annoying, I have
found this to be the default for most Ubuntu daemon packages.

Mike


On Thu, Nov 15, 2012 at 9:21 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:

 We had an issue with counters over-counting even using the nodetool drain
 command before upgrading...

 Here is my bash history

69  cp /etc/cassandra/cassandra.yaml /etc/cassandra/cassandra.yaml.bak
70  cp /etc/cassandra/cassandra-env.sh
 /etc/cassandra/cassandra-env.sh.bak
71  sudo apt-get install cassandra
72  nodetool disablethrift
73  nodetool drain
74  service cassandra stop
75  cat /etc/cassandra/cassandra-env.sh
 /etc/cassandra/cassandra-env.sh.bak
76  vim /etc/cassandra/cassandra-env.sh
77  cat /etc/cassandra/cassandra.yaml /etc/cassandra/cassandra.yaml.bak
78  vim /etc/cassandra/cassandra.yaml
79  service cassandra start

 So I think I followed these steps
 http://www.datastax.com/docs/1.1/install/upgrading#upgrade-steps

 I merged my conf files with an external tool so consider I merged my conf
 files on steps 76 and 78.

 I saw that the sudo apt-get install cassandra stop the server and
 restart it automatically. So it updated without draining and restart before
 I had the time to reconfigure the conf files. Is this normal ? Is there a
 way to avoid it ?

 So for the second node I decided to try to stop C*before the upgrade.

   125  cp /etc/cassandra/cassandra.yaml /etc/cassandra/cassandra.yaml.bak
   126  cp /etc/cassandra/cassandra-env.sh
 /etc/cassandra/cassandra-env.sh.bak
   127  nodetool disablegossip
   128  nodetool disablethrift
   129  nodetool drain
   130  service cassandra stop
   131  sudo apt-get install cassandra

 //131 : This restarted cassandra

   132  nodetool disablethrift
   133  nodetool disablegossip
   134  nodetool drain
   135  service cassandra stop
   136  cat /etc/cassandra/cassandra-env.sh
 /etc/cassandra/cassandra-env.sh.bak
   137  cim /etc/cassandra/cassandra-env.sh
   138  vim /etc/cassandra/cassandra-env.sh
   139  cat /etc/cassandra/cassandra.yaml /etc/cassandra/cassandra.yaml.bak
   140  vim /etc/cassandra/cassandra.yaml
   141  service cassandra start

 After both of these updates I saw my current counters increase without any
 reason.

 Did I do anything wrong ?

 Alain




-- 

  Mike Heffner m...@librato.com
  Librato, Inc.


Re: Hinted Handoff runs every ten minutes

2012-11-08 Thread Mike Heffner
Is there a ticket open for this for 1.1.6?

We also noticed this after upgrading from 1.1.3 to 1.1.6. Every node runs a
0 row hinted handoff every 10 minutes. N-1 nodes hint to the same node,
while that node hints to another node.


On Tue, Oct 30, 2012 at 1:35 PM, Vegard Berget p...@fantasista.no wrote:

 Hi,

 I have the exact same problem with 1.1.6.  HintsColumnFamily consists of
 one row (Rowkey 00, nothing more).   The problem started after upgrading
 from 1.1.4 to 1.1.6.  Every ten minutes HintedHandoffManager starts and
 finishes  after sending 0 rows.

 .vegard,



 - Original Message -
 From:
 user@cassandra.apache.org

 To:
 user@cassandra.apache.org
 Cc:

 Sent:
 Mon, 29 Oct 2012 23:45:30 +0100

 Subject:
 Re: Hinted Handoff runs every ten minutes


 Dne 29.10.2012 23:24, Stephen Pierce napsal(a):
  I'm running 1.1.5; the bug says it's fixed in 1.0.9/1.1.0.
 
  How can I check to see why it keeps running HintedHandoff?
 you have tombstone is system.HintsColumnFamily use list command in
 cassandra-cli to check




-- 

  Mike Heffner m...@librato.com
  Librato, Inc.


Re: Migrating data from a 0.8.8 - 1.1.2 ring

2012-07-24 Thread Mike Heffner
On Mon, Jul 23, 2012 at 1:25 PM, Mike Heffner m...@librato.com wrote:

 Hi,

 We are migrating from a 0.8.8 ring to a 1.1.2 ring and we are noticing
 missing data post-migration. We use pre-built/configured AMIs so our
 preferred route is to leave our existing production 0.8.8 untouched and
 bring up a parallel 1.1.2 ring and migrate data into it. Data is written to
 the rings via batch processes so we can easily assure that both the
 existing and new rings will have the same data post migration.

 snip


 The steps we are taking are:

 1. Bring up a 1.1.2 ring in the same AZ/data center configuration with
 tokens matching the corresponding nodes in the 0.8.8 ring.
 2. Create the same keyspace on 1.1.2.
 3. Create each CF in the keyspace on 1.1.2.
 4. Flush each node of the 0.8.8 ring.
 5. Rsync each non-compacted sstable from 0.8.8 to the corresponding node
 in 1.1.2.
 6. Move each 0.8.8 sstable into the 1.1.2 directory structure by renaming
 the file to the  /cassandra/data/keyspace/cf/keyspace-cf... format.
 For example, for the keyspace Metrics and CF epochs_60 we get:
 cassandra/data/Metrics/epochs_60/Metrics-epochs_60-g-941-Data.db.
 7. On each 1.1.2 node run `nodetool -h localhost refresh Metrics CF` for
 each CF in the keyspace. We notice that storage load jumps accordingly.
 8. On each 1.1.2 node run `nodetool -h localhost upgradesstables`. This
 takes awhile but appears to correctly rewrite each sstable in the new 1.1.x
 format. Storage load drops as sstables are compressed.


So, after some further testing we've observed that the `upgradesstables`
command is removing data from the sstables, leading to our missing data.
We've repeated the steps above with several variations:

WORKS refresh - scrub
WORKS refresh - scrub - major compaction

FAILS refresh - upgradesstables
FAILS refresh - scrub - upgradesstables
FAILS refresh - scrub - major compaction - upgradesstables

So, we are able to migrate our test CFs from a 0.8.8 ring to a 1.1.2 ring
when we use scrub. However, whenever we run an upgradesstables command the
sstables are shrunk significantly and our tests show missing data:

 INFO [CompactionExecutor:4] 2012-07-24 04:27:36,837 CompactionTask.java
(line 109) Compacting
[SSTableReader(path='/raid0/cassandra/data/Metrics/metrics_900/Metrics-metrics_900-hd-51-Data.db')]
 INFO [CompactionExecutor:4] 2012-07-24 04:27:51,090 CompactionTask.java
(line 221) Compacted to
[/raid0/cassandra/data/Metrics/metrics_900/Metrics-metrics_900-hd-58-Data.db,].
 60,449,155 to 2,578,102 (~4% of original) bytes for 4,002 keys at
0.172562MB/s.  Time: 14,248ms.

Is there a scenario where upgradesstables would remove data that a scrub
command wouldn't? According the documentation, it would appear that the
scrub command is actually more destructive than upgradesstables in terms of
removing data. On 1.1.x, upgradesstables is the documented upgrade command
over a scrub.

The keyspace is defined as:

Keyspace: Metrics:
  Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
  Durable Writes: true
Options: [us-east:3]

And the column family above defined as:

ColumnFamily: metrics_900
  Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type
  Default column value validator:
org.apache.cassandra.db.marshal.BytesType
  Columns sorted by:
org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.LongType,org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type)
  GC grace seconds: 0
  Compaction min/max thresholds: 4/32
  Read repair chance: 0.1
  DC Local Read repair chance: 0.0
  Replicate on write: true
  Caching: KEYS_ONLY
  Bloom Filter FP chance: default
  Built indexes: []
  Compaction Strategy:
org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
  Compression Options:
sstable_compression:
org.apache.cassandra.io.compress.SnappyCompressor

All rows have a TTL of 30 days, so it's possible that, along with the
gc_grace=0, a small number would be removed during a
compaction/scrub/upgradesstables step. However, the majority should still
be kept as their TTL has not expired yet.

We are still experimenting to see under what conditions this happens, but I
thought I'd send out some more info in case there is something clearly
wrong we're doing here.


Thanks,

Mike
-- 

  Mike Heffner m...@librato.com
  Librato, Inc.


Migrating data from a 0.8.8 - 1.1.2 ring

2012-07-23 Thread Mike Heffner
Hi,

We are migrating from a 0.8.8 ring to a 1.1.2 ring and we are noticing
missing data post-migration. We use pre-built/configured AMIs so our
preferred route is to leave our existing production 0.8.8 untouched and
bring up a parallel 1.1.2 ring and migrate data into it. Data is written to
the rings via batch processes so we can easily assure that both the
existing and new rings will have the same data post migration.

The ring we are migrating from is:

  * 12 nodes
  * single data-center, 3 AZs
  * 0.8.8

The ring we are migrating to is the same except 1.1.2.

The steps we are taking are:

1. Bring up a 1.1.2 ring in the same AZ/data center configuration with
tokens matching the corresponding nodes in the 0.8.8 ring.
2. Create the same keyspace on 1.1.2.
3. Create each CF in the keyspace on 1.1.2.
4. Flush each node of the 0.8.8 ring.
5. Rsync each non-compacted sstable from 0.8.8 to the corresponding node in
1.1.2.
6. Move each 0.8.8 sstable into the 1.1.2 directory structure by renaming
the file to the  /cassandra/data/keyspace/cf/keyspace-cf... format.
For example, for the keyspace Metrics and CF epochs_60 we get:
cassandra/data/Metrics/epochs_60/Metrics-epochs_60-g-941-Data.db.
7. On each 1.1.2 node run `nodetool -h localhost refresh Metrics CF` for
each CF in the keyspace. We notice that storage load jumps accordingly.
8. On each 1.1.2 node run `nodetool -h localhost upgradesstables`. This
takes awhile but appears to correctly rewrite each sstable in the new 1.1.x
format. Storage load drops as sstables are compressed.

After these steps we run a script that validates data on the new ring. What
we've noticed is that large portions of the data that was on the 0.8.8 is
not available on the 1.1.2 ring. We've tried reading at both quorum and
ONE, but the resulting data appears missing in both cases.

We have fewer than 143 million row keys in the CFs we're testing and none
of the *-Filter.db files are  10MB, so I don't believe this is our
problem: https://issues.apache.org/jira/browse/CASSANDRA-3820

Anything else to test verify? Are the steps above correct for this type of
upgrade? Is this type of upgrade/migration supported?

We have also tried running a repair across the cluster after step #8. While
it took a few retries due to
https://issues.apache.org/jira/browse/CASSANDRA-4456, we still had missing
data afterwards.

Any assistance would be appreciated.


Thanks!

Mike

-- 

  Mike Heffner m...@librato.com
  Librato, Inc.


Wildcard character for CF in access.properties?

2011-04-14 Thread Mike Heffner

Is there a wildcard for the COLUMNFAMILY field in `access.properties`?
I'd like to split read-write and read-only access between my backend and 
frontend users, respectively, however the full list of CFs is not known 
a priori.


I'm using 0.7.4.


Cheers,


Mike


--

  Mike Heffner m...@librato.com
  Librato, Inc.