Re: ReadStage filling up and leading to Read Timeouts

2019-02-05 Thread Rajsekhar Mallick
Thank you Jeff for the link.
Please do comment on the G1GC settings,if they are ok for the cluster.
Also comment on reducing the concurrent reads to 32 on all nodes in the
cluster.
As has earlier lead to reads getting dropped.
Will adding nodes to the cluster be helpful.

Thanks,
Rajsekhar Mallick



On Wed, 6 Feb, 2019, 1:12 PM Jeff Jirsa 
> https://docs.datastax.com/en/developer/java-driver/3.2/manual/paging/
>
>
> --
> Jeff Jirsa
>
>
> On Feb 5, 2019, at 11:33 PM, Rajsekhar Mallick 
> wrote:
>
> Hello Jeff,
>
> Thanks for the reply.
> We do have GC logs enabled.
> We do observe gc pauses upto 2 seconds but quite often we see this issue
> even when the gc log reads good and clear.
>
> JVM Flags related to G1GC:
>
> Xms: 48G
> Xmx:48G
> Maxgcpausemillis=200
> Parallels gc threads=32
> Concurrent gc threads= 10
> Initiatingheapoccupancypercent=50
>
> You talked about dropping application page size. Please do elaborate on
> how to change the same.
> Reducing the concurrent reads to 32 does help as we have tried the
> same...the cpu load average remains under threshold...but read timeout
> keeps on happening.
>
> We will definitely try increasing the key cache sizes after verifying the
> current max heap usage in the cluster.
>
> Thanks,
> Rajsekhar Mallick
>
> On Wed, 6 Feb, 2019, 11:17 AM Jeff Jirsa 
>> What you're potentially seeing is the GC impact of reading a large
>> partition - do you have GC logs or StatusLogger output indicating you're
>> pausing? What are you actual JVM flags you're using?
>>
>> Given your heap size, the easiest mitigation may be significantly
>> increasing your key cache size (up to a gigabyte or two, if needed).
>>
>> Yes, when you read data, it's materialized in memory (iterators from each
>> sstable are merged and sent to the client), so reading lots of rows from a
>> wide partition can cause GC pressure just from materializing the responses.
>> Dropping your application's paging size could help if this is the problem.
>>
>> You may be able to drop concurrent reads from 64 to something lower
>> (potentially 48 or 32, given your core count) to mitigate GC impact from
>> lots of objects when you have a lot of concurrent reads, or consider
>> upgrading to 3.11.4 (when it's out) to take advantage of CASSANDRA-11206
>> (which made reading wide partitions less expensive). STCS especially wont
>> help here - a large partition may be larger than you think, if it's
>> spanning a lot of sstables.
>>
>>
>>
>>
>> On Tue, Feb 5, 2019 at 9:30 PM Rajsekhar Mallick 
>> wrote:
>>
>>> Hello Team,
>>>
>>> Cluster Details:
>>> 1. Number of Nodes in cluster : 7
>>> 2. Number of CPU cores: 48
>>> 3. Swap is enabled on all nodes
>>> 4. Memory available on all nodes : 120GB
>>> 5. Disk space available : 745GB
>>> 6. Cassandra version: 2.1
>>> 7. Active tables are using size-tiered compaction strategy
>>> 8. Read Throughput: 6000 reads/s on each node (42000 reads/s cluster
>>> wide)
>>> 9. Read latency 99%: 300 ms
>>> 10. Write Throughput : 1800 writes/s
>>> 11. Write Latency 99%: 50 ms
>>> 12. Known issues in the cluster ( Large Partitions(upto 560MB, observed
>>> when they get compacted), tombstones)
>>> 13. To reduce the impact of tombstones, gc_grace_seconds set to 0 for
>>> the active tables
>>> 14. Heap size: 48 GB G1GC
>>> 15. Read timeout : 5000ms , Write timeouts: 2000ms
>>> 16. Number of concurrent reads: 64
>>> 17. Number of connections from clients on port 9042 stays almost
>>> constant (close to 1800)
>>> 18. Cassandra thread count also stays almost constant (close to 2000)
>>>
>>> Problem Statement:
>>> 1. ReadStage often gets full (reaches max size 64) on 2 to 3 nodes and
>>> pending reads go upto 4000.
>>> 2. When the above happens Native-Transport-Stage gets full on
>>> neighbouring nodes(1024 max) and pending threads are also observed.
>>> 3. During this time, CPU load average rises, user % for Cassandra
>>> process reaches 90%
>>> 4. We see Read getting dropped, org.apache.cassandra.transport package
>>> errors of reads getting timeout is seen.
>>> 5. Read latency 99% reached 5seconds, client starts seeing impact.
>>> 6. No IOwait observed on any of the virtual cores, sjk ttop command
>>> shows max us% being used by “Worker Threads”
>>>
>>> I have trying hard to zero upon what is the exact issue.
>>> What I make out of these above observations is…there might be some slow
>>> queries, which get stuck on few nodes.
>>> Then there is a cascading effect wherein other queries get lined up.
>>> Unable to figure out any such slow queries up till now.
>>> As I mentioned, there are large partitions. We using size-tiered
>>> compaction strategy, hence a large partition might be spread across
>>> multiple stables.
>>> Can this fact lead to slow queries. I also tried to understand, that
>>> data in stables is stored in serialized format and when read into memory,
>>> it is unseralized. This would lead to a large object in memory which then
>>> needs to be transferred across the wire to the 

Re: ReadStage filling up and leading to Read Timeouts

2019-02-05 Thread Jeff Jirsa

https://docs.datastax.com/en/developer/java-driver/3.2/manual/paging/


-- 
Jeff Jirsa


> On Feb 5, 2019, at 11:33 PM, Rajsekhar Mallick  
> wrote:
> 
> Hello Jeff,
> 
> Thanks for the reply.
> We do have GC logs enabled.
> We do observe gc pauses upto 2 seconds but quite often we see this issue even 
> when the gc log reads good and clear.
> 
> JVM Flags related to G1GC:
> 
> Xms: 48G
> Xmx:48G
> Maxgcpausemillis=200
> Parallels gc threads=32
> Concurrent gc threads= 10
> Initiatingheapoccupancypercent=50
> 
> You talked about dropping application page size. Please do elaborate on how 
> to change the same.
> Reducing the concurrent reads to 32 does help as we have tried the same...the 
> cpu load average remains under threshold...but read timeout keeps on 
> happening.
> 
> We will definitely try increasing the key cache sizes after verifying the 
> current max heap usage in the cluster.
> 
> Thanks,
> Rajsekhar Mallick
> 
>> On Wed, 6 Feb, 2019, 11:17 AM Jeff Jirsa > What you're potentially seeing is the GC impact of reading a large partition 
>> - do you have GC logs or StatusLogger output indicating you're pausing? What 
>> are you actual JVM flags you're using? 
>> 
>> Given your heap size, the easiest mitigation may be significantly increasing 
>> your key cache size (up to a gigabyte or two, if needed).
>> 
>> Yes, when you read data, it's materialized in memory (iterators from each 
>> sstable are merged and sent to the client), so reading lots of rows from a 
>> wide partition can cause GC pressure just from materializing the responses. 
>> Dropping your application's paging size could help if this is the problem. 
>> 
>> You may be able to drop concurrent reads from 64 to something lower 
>> (potentially 48 or 32, given your core count) to mitigate GC impact from 
>> lots of objects when you have a lot of concurrent reads, or consider 
>> upgrading to 3.11.4 (when it's out) to take advantage of CASSANDRA-11206 
>> (which made reading wide partitions less expensive). STCS especially wont 
>> help here - a large partition may be larger than you think, if it's spanning 
>> a lot of sstables. 
>> 
>> 
>> 
>> 
>>> On Tue, Feb 5, 2019 at 9:30 PM Rajsekhar Mallick  
>>> wrote:
>>> Hello Team,
>>> 
>>> Cluster Details:
>>> 1. Number of Nodes in cluster : 7
>>> 2. Number of CPU cores: 48
>>> 3. Swap is enabled on all nodes
>>> 4. Memory available on all nodes : 120GB 
>>> 5. Disk space available : 745GB
>>> 6. Cassandra version: 2.1
>>> 7. Active tables are using size-tiered compaction strategy
>>> 8. Read Throughput: 6000 reads/s on each node (42000 reads/s cluster wide)
>>> 9. Read latency 99%: 300 ms
>>> 10. Write Throughput : 1800 writes/s
>>> 11. Write Latency 99%: 50 ms
>>> 12. Known issues in the cluster ( Large Partitions(upto 560MB, observed 
>>> when they get compacted), tombstones)
>>> 13. To reduce the impact of tombstones, gc_grace_seconds set to 0 for the 
>>> active tables
>>> 14. Heap size: 48 GB G1GC
>>> 15. Read timeout : 5000ms , Write timeouts: 2000ms
>>> 16. Number of concurrent reads: 64
>>> 17. Number of connections from clients on port 9042 stays almost constant 
>>> (close to 1800)
>>> 18. Cassandra thread count also stays almost constant (close to 2000)
>>> 
>>> Problem Statement:
>>> 1. ReadStage often gets full (reaches max size 64) on 2 to 3 nodes and 
>>> pending reads go upto 4000.
>>> 2. When the above happens Native-Transport-Stage gets full on neighbouring 
>>> nodes(1024 max) and pending threads are also observed.
>>> 3. During this time, CPU load average rises, user % for Cassandra process 
>>> reaches 90%
>>> 4. We see Read getting dropped, org.apache.cassandra.transport package 
>>> errors of reads getting timeout is seen.
>>> 5. Read latency 99% reached 5seconds, client starts seeing impact.
>>> 6. No IOwait observed on any of the virtual cores, sjk ttop command shows 
>>> max us% being used by “Worker Threads”
>>> 
>>> I have trying hard to zero upon what is the exact issue.
>>> What I make out of these above observations is…there might be some slow 
>>> queries, which get stuck on few nodes.
>>> Then there is a cascading effect wherein other queries get lined up.
>>> Unable to figure out any such slow queries up till now.
>>> As I mentioned, there are large partitions. We using size-tiered compaction 
>>> strategy, hence a large partition might be spread across multiple stables.
>>> Can this fact lead to slow queries. I also tried to understand, that data 
>>> in stables is stored in serialized format and when read into memory, it is 
>>> unseralized. This would lead to a large object in memory which then needs 
>>> to be transferred across the wire to the client.
>>> 
>>> Not sure what might be the reason. Kindly help on helping me understand 
>>> what might be the impact on read performance when we have large partitions.
>>> Kindly Suggest ways to catch these slow queries.
>>> Also do add if you see any other issues from the above details
>>> We are now 

Re: ReadStage filling up and leading to Read Timeouts

2019-02-05 Thread Rajsekhar Mallick
Hello Jeff,

Thanks for the reply.
We do have GC logs enabled.
We do observe gc pauses upto 2 seconds but quite often we see this issue
even when the gc log reads good and clear.

JVM Flags related to G1GC:

Xms: 48G
Xmx:48G
Maxgcpausemillis=200
Parallels gc threads=32
Concurrent gc threads= 10
Initiatingheapoccupancypercent=50

You talked about dropping application page size. Please do elaborate on how
to change the same.
Reducing the concurrent reads to 32 does help as we have tried the
same...the cpu load average remains under threshold...but read timeout
keeps on happening.

We will definitely try increasing the key cache sizes after verifying the
current max heap usage in the cluster.

Thanks,
Rajsekhar Mallick

On Wed, 6 Feb, 2019, 11:17 AM Jeff Jirsa  What you're potentially seeing is the GC impact of reading a large
> partition - do you have GC logs or StatusLogger output indicating you're
> pausing? What are you actual JVM flags you're using?
>
> Given your heap size, the easiest mitigation may be significantly
> increasing your key cache size (up to a gigabyte or two, if needed).
>
> Yes, when you read data, it's materialized in memory (iterators from each
> sstable are merged and sent to the client), so reading lots of rows from a
> wide partition can cause GC pressure just from materializing the responses.
> Dropping your application's paging size could help if this is the problem.
>
> You may be able to drop concurrent reads from 64 to something lower
> (potentially 48 or 32, given your core count) to mitigate GC impact from
> lots of objects when you have a lot of concurrent reads, or consider
> upgrading to 3.11.4 (when it's out) to take advantage of CASSANDRA-11206
> (which made reading wide partitions less expensive). STCS especially wont
> help here - a large partition may be larger than you think, if it's
> spanning a lot of sstables.
>
>
>
>
> On Tue, Feb 5, 2019 at 9:30 PM Rajsekhar Mallick 
> wrote:
>
>> Hello Team,
>>
>> Cluster Details:
>> 1. Number of Nodes in cluster : 7
>> 2. Number of CPU cores: 48
>> 3. Swap is enabled on all nodes
>> 4. Memory available on all nodes : 120GB
>> 5. Disk space available : 745GB
>> 6. Cassandra version: 2.1
>> 7. Active tables are using size-tiered compaction strategy
>> 8. Read Throughput: 6000 reads/s on each node (42000 reads/s cluster wide)
>> 9. Read latency 99%: 300 ms
>> 10. Write Throughput : 1800 writes/s
>> 11. Write Latency 99%: 50 ms
>> 12. Known issues in the cluster ( Large Partitions(upto 560MB, observed
>> when they get compacted), tombstones)
>> 13. To reduce the impact of tombstones, gc_grace_seconds set to 0 for the
>> active tables
>> 14. Heap size: 48 GB G1GC
>> 15. Read timeout : 5000ms , Write timeouts: 2000ms
>> 16. Number of concurrent reads: 64
>> 17. Number of connections from clients on port 9042 stays almost constant
>> (close to 1800)
>> 18. Cassandra thread count also stays almost constant (close to 2000)
>>
>> Problem Statement:
>> 1. ReadStage often gets full (reaches max size 64) on 2 to 3 nodes and
>> pending reads go upto 4000.
>> 2. When the above happens Native-Transport-Stage gets full on
>> neighbouring nodes(1024 max) and pending threads are also observed.
>> 3. During this time, CPU load average rises, user % for Cassandra process
>> reaches 90%
>> 4. We see Read getting dropped, org.apache.cassandra.transport package
>> errors of reads getting timeout is seen.
>> 5. Read latency 99% reached 5seconds, client starts seeing impact.
>> 6. No IOwait observed on any of the virtual cores, sjk ttop command shows
>> max us% being used by “Worker Threads”
>>
>> I have trying hard to zero upon what is the exact issue.
>> What I make out of these above observations is…there might be some slow
>> queries, which get stuck on few nodes.
>> Then there is a cascading effect wherein other queries get lined up.
>> Unable to figure out any such slow queries up till now.
>> As I mentioned, there are large partitions. We using size-tiered
>> compaction strategy, hence a large partition might be spread across
>> multiple stables.
>> Can this fact lead to slow queries. I also tried to understand, that data
>> in stables is stored in serialized format and when read into memory, it is
>> unseralized. This would lead to a large object in memory which then needs
>> to be transferred across the wire to the client.
>>
>> Not sure what might be the reason. Kindly help on helping me understand
>> what might be the impact on read performance when we have large partitions.
>> Kindly Suggest ways to catch these slow queries.
>> Also do add if you see any other issues from the above details
>> We are now considering to expand our cluster. Is the cluster under-sized.
>> Will addition of nodes help resolve the issue.
>>
>> Thanks,
>> Rajsekhar Mallick
>>
>>
>>
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: 

Re: ReadStage filling up and leading to Read Timeouts

2019-02-05 Thread Jeff Jirsa
What you're potentially seeing is the GC impact of reading a large
partition - do you have GC logs or StatusLogger output indicating you're
pausing? What are you actual JVM flags you're using?

Given your heap size, the easiest mitigation may be significantly
increasing your key cache size (up to a gigabyte or two, if needed).

Yes, when you read data, it's materialized in memory (iterators from each
sstable are merged and sent to the client), so reading lots of rows from a
wide partition can cause GC pressure just from materializing the responses.
Dropping your application's paging size could help if this is the problem.

You may be able to drop concurrent reads from 64 to something lower
(potentially 48 or 32, given your core count) to mitigate GC impact from
lots of objects when you have a lot of concurrent reads, or consider
upgrading to 3.11.4 (when it's out) to take advantage of CASSANDRA-11206
(which made reading wide partitions less expensive). STCS especially wont
help here - a large partition may be larger than you think, if it's
spanning a lot of sstables.




On Tue, Feb 5, 2019 at 9:30 PM Rajsekhar Mallick 
wrote:

> Hello Team,
>
> Cluster Details:
> 1. Number of Nodes in cluster : 7
> 2. Number of CPU cores: 48
> 3. Swap is enabled on all nodes
> 4. Memory available on all nodes : 120GB
> 5. Disk space available : 745GB
> 6. Cassandra version: 2.1
> 7. Active tables are using size-tiered compaction strategy
> 8. Read Throughput: 6000 reads/s on each node (42000 reads/s cluster wide)
> 9. Read latency 99%: 300 ms
> 10. Write Throughput : 1800 writes/s
> 11. Write Latency 99%: 50 ms
> 12. Known issues in the cluster ( Large Partitions(upto 560MB, observed
> when they get compacted), tombstones)
> 13. To reduce the impact of tombstones, gc_grace_seconds set to 0 for the
> active tables
> 14. Heap size: 48 GB G1GC
> 15. Read timeout : 5000ms , Write timeouts: 2000ms
> 16. Number of concurrent reads: 64
> 17. Number of connections from clients on port 9042 stays almost constant
> (close to 1800)
> 18. Cassandra thread count also stays almost constant (close to 2000)
>
> Problem Statement:
> 1. ReadStage often gets full (reaches max size 64) on 2 to 3 nodes and
> pending reads go upto 4000.
> 2. When the above happens Native-Transport-Stage gets full on neighbouring
> nodes(1024 max) and pending threads are also observed.
> 3. During this time, CPU load average rises, user % for Cassandra process
> reaches 90%
> 4. We see Read getting dropped, org.apache.cassandra.transport package
> errors of reads getting timeout is seen.
> 5. Read latency 99% reached 5seconds, client starts seeing impact.
> 6. No IOwait observed on any of the virtual cores, sjk ttop command shows
> max us% being used by “Worker Threads”
>
> I have trying hard to zero upon what is the exact issue.
> What I make out of these above observations is…there might be some slow
> queries, which get stuck on few nodes.
> Then there is a cascading effect wherein other queries get lined up.
> Unable to figure out any such slow queries up till now.
> As I mentioned, there are large partitions. We using size-tiered
> compaction strategy, hence a large partition might be spread across
> multiple stables.
> Can this fact lead to slow queries. I also tried to understand, that data
> in stables is stored in serialized format and when read into memory, it is
> unseralized. This would lead to a large object in memory which then needs
> to be transferred across the wire to the client.
>
> Not sure what might be the reason. Kindly help on helping me understand
> what might be the impact on read performance when we have large partitions.
> Kindly Suggest ways to catch these slow queries.
> Also do add if you see any other issues from the above details
> We are now considering to expand our cluster. Is the cluster under-sized.
> Will addition of nodes help resolve the issue.
>
> Thanks,
> Rajsekhar Mallick
>
>
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


ReadStage filling up and leading to Read Timeouts

2019-02-05 Thread Rajsekhar Mallick
Hello Team,

Cluster Details:
1. Number of Nodes in cluster : 7
2. Number of CPU cores: 48
3. Swap is enabled on all nodes
4. Memory available on all nodes : 120GB 
5. Disk space available : 745GB
6. Cassandra version: 2.1
7. Active tables are using size-tiered compaction strategy
8. Read Throughput: 6000 reads/s on each node (42000 reads/s cluster wide)
9. Read latency 99%: 300 ms
10. Write Throughput : 1800 writes/s
11. Write Latency 99%: 50 ms
12. Known issues in the cluster ( Large Partitions(upto 560MB, observed when 
they get compacted), tombstones)
13. To reduce the impact of tombstones, gc_grace_seconds set to 0 for the 
active tables
14. Heap size: 48 GB G1GC
15. Read timeout : 5000ms , Write timeouts: 2000ms
16. Number of concurrent reads: 64
17. Number of connections from clients on port 9042 stays almost constant 
(close to 1800)
18. Cassandra thread count also stays almost constant (close to 2000)

Problem Statement:
1. ReadStage often gets full (reaches max size 64) on 2 to 3 nodes and pending 
reads go upto 4000.
2. When the above happens Native-Transport-Stage gets full on neighbouring 
nodes(1024 max) and pending threads are also observed.
3. During this time, CPU load average rises, user % for Cassandra process 
reaches 90%
4. We see Read getting dropped, org.apache.cassandra.transport package errors 
of reads getting timeout is seen.
5. Read latency 99% reached 5seconds, client starts seeing impact.
6. No IOwait observed on any of the virtual cores, sjk ttop command shows max 
us% being used by “Worker Threads”

I have trying hard to zero upon what is the exact issue.
What I make out of these above observations is…there might be some slow 
queries, which get stuck on few nodes.
Then there is a cascading effect wherein other queries get lined up.
Unable to figure out any such slow queries up till now.
As I mentioned, there are large partitions. We using size-tiered compaction 
strategy, hence a large partition might be spread across multiple stables.
Can this fact lead to slow queries. I also tried to understand, that data in 
stables is stored in serialized format and when read into memory, it is 
unseralized. This would lead to a large object in memory which then needs to be 
transferred across the wire to the client.

Not sure what might be the reason. Kindly help on helping me understand what 
might be the impact on read performance when we have large partitions.
Kindly Suggest ways to catch these slow queries.
Also do add if you see any other issues from the above details
We are now considering to expand our cluster. Is the cluster under-sized. Will 
addition of nodes help resolve the issue.

Thanks,
Rajsekhar Mallick





-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Read timeouts when performing rolling restart

2018-09-18 Thread Riccardo Ferrari
 can be multiple things, but having
> an interactive view of the pending requests might lead you to the root
> cause of the issue.
>
> C*heers,
> ---
> Alain Rodriguez - @arodream - al...@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> Le jeu. 13 sept. 2018 à 09:50, Riccardo Ferrari  a
> écrit :
>
>> Hi Shalom,
>>
>> It happens almost at every restart, either a single node or a rolling
>> one. I do agree with you that it is good, at least on my setup, to wait few
>> minutes to let the rebooted node to cool down before moving to the next.
>> The more I look at it the more I think is something coming from hint
>> dispatching, maybe I should try  something around hints throttling.
>>
>> Thanks!
>>
>> On Thu, Sep 13, 2018 at 8:55 AM, shalom sagges 
>> wrote:
>>
>>> Hi Riccardo,
>>>
>>> Does this issue occur when performing a single restart or after several
>>> restarts during a rolling restart (as mentioned in your original post)?
>>> We have a cluster that when performing a rolling restart, we prefer to
>>> wait ~10-15 minutes between each restart because we see an increase of GC
>>> for a few minutes.
>>> If we keep restarting the nodes quickly one after the other, the
>>> applications experience timeouts (probably due to GC and hints).
>>>
>>> Hope this helps!
>>>
>>> On Thu, Sep 13, 2018 at 2:20 AM Riccardo Ferrari 
>>> wrote:
>>>
>>>> A little update on the progress.
>>>>
>>>> First:
>>>> Thank you Thomas. I checked the code in the patch and briefly skimmed
>>>> through the 3.0.6 code. Yup it should be fixed.
>>>> Thank you Surbhi. At the moment we don't need authentication as the
>>>> instances are locked down.
>>>>
>>>> Now:
>>>> - Unfortunately the start_transport_native trick does not always work.
>>>> On some nodes works on other don't. What do I mean? I still experience
>>>> timeouts and dropped messages during startup.
>>>> - I realized that cutting the concurrent_compactors to 1 was not really
>>>> a good idea, minimum vlaue should be 2, currently testing 4 (that is the
>>>> min(n_cores, n_disks))
>>>> - After rising the compactors to 4 I still see some dropped messages
>>>> for HINT and MUTATIONS. This happens during startup. Reason is "for
>>>> internal timeout". Maybe too many compactors?
>>>>
>>>> Thanks!
>>>>
>>>>
>>>> On Wed, Sep 12, 2018 at 7:09 PM, Surbhi Gupta >>> > wrote:
>>>>
>>>>> Another thing to notice is :
>>>>>
>>>>> system_auth WITH replication = {'class': 'SimpleStrategy',
>>>>> 'replication_factor': '1'}
>>>>>
>>>>> system_auth has a replication factor of 1 and even if one node is down
>>>>> it may impact the system because of the replication factor.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, 12 Sep 2018 at 09:46, Steinmaurer, Thomas <
>>>>> thomas.steinmau...@dynatrace.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>>
>>>>>> I remember something that a client using the native protocol gets
>>>>>> notified too early by Cassandra being ready due to the following issue:
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-8236
>>>>>>
>>>>>>
>>>>>>
>>>>>> which looks similar, but above was marked as fixed in 2.2.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thomas
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Riccardo Ferrari 
>>>>>> *Sent:* Mittwoch, 12. September 2018 18:25
>>>>>> *To:* user@cassandra.apache.org
>>>>>> *Subject:* Re: Read timeouts when performing rolling restart
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi Alain,
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thank you for chiming in!
>>>>>>
>>>>>>
>>>>>>
>>>>>> I was thinking to perform the 'start_native_transp

Re: Read timeouts when performing rolling restart

2018-09-14 Thread Alain RODRIGUEZ
 What do I mean? I still experience
>>> timeouts and dropped messages during startup.
>>> - I realized that cutting the concurrent_compactors to 1 was not really
>>> a good idea, minimum vlaue should be 2, currently testing 4 (that is the
>>> min(n_cores, n_disks))
>>> - After rising the compactors to 4 I still see some dropped messages for
>>> HINT and MUTATIONS. This happens during startup. Reason is "for internal
>>> timeout". Maybe too many compactors?
>>>
>>> Thanks!
>>>
>>>
>>> On Wed, Sep 12, 2018 at 7:09 PM, Surbhi Gupta 
>>> wrote:
>>>
>>>> Another thing to notice is :
>>>>
>>>> system_auth WITH replication = {'class': 'SimpleStrategy',
>>>> 'replication_factor': '1'}
>>>>
>>>> system_auth has a replication factor of 1 and even if one node is down
>>>> it may impact the system because of the replication factor.
>>>>
>>>>
>>>>
>>>> On Wed, 12 Sep 2018 at 09:46, Steinmaurer, Thomas <
>>>> thomas.steinmau...@dynatrace.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> I remember something that a client using the native protocol gets
>>>>> notified too early by Cassandra being ready due to the following issue:
>>>>>
>>>>> https://issues.apache.org/jira/browse/CASSANDRA-8236
>>>>>
>>>>>
>>>>>
>>>>> which looks similar, but above was marked as fixed in 2.2.
>>>>>
>>>>>
>>>>>
>>>>> Thomas
>>>>>
>>>>>
>>>>>
>>>>> *From:* Riccardo Ferrari 
>>>>> *Sent:* Mittwoch, 12. September 2018 18:25
>>>>> *To:* user@cassandra.apache.org
>>>>> *Subject:* Re: Read timeouts when performing rolling restart
>>>>>
>>>>>
>>>>>
>>>>> Hi Alain,
>>>>>
>>>>>
>>>>>
>>>>> Thank you for chiming in!
>>>>>
>>>>>
>>>>>
>>>>> I was thinking to perform the 'start_native_transport=false' test as
>>>>> well and indeed the issue is not showing up. Starting the/a node with
>>>>> native transport disabled and letting it cool down lead to no timeout
>>>>> exceptions no dropped messages, simply a crystal clean startup. Agreed it
>>>>> is a workaround
>>>>>
>>>>>
>>>>>
>>>>> # About upgrading:
>>>>>
>>>>> Yes, I desperately want to upgrade despite is a long and slow task.
>>>>> Just reviewing all the changes from 3.0.6 to 3.0.17
>>>>> is going to be a huge pain, top of your head, any breaking change I
>>>>> should absolutely take care of reviewing ?
>>>>>
>>>>>
>>>>>
>>>>> # describecluster output: YES they agree on the same schema version
>>>>>
>>>>>
>>>>>
>>>>> # keyspaces:
>>>>>
>>>>> system WITH replication = {'class': 'LocalStrategy'}
>>>>>
>>>>> system_schema WITH replication = {'class': 'LocalStrategy'}
>>>>>
>>>>> system_auth WITH replication = {'class': 'SimpleStrategy',
>>>>> 'replication_factor': '1'}
>>>>>
>>>>> system_distributed WITH replication = {'class': 'SimpleStrategy',
>>>>> 'replication_factor': '3'}
>>>>>
>>>>> system_traces WITH replication = {'class': 'SimpleStrategy',
>>>>> 'replication_factor': '2'}
>>>>>
>>>>>
>>>>>
>>>>>  WITH replication = {'class': 'SimpleStrategy',
>>>>> 'replication_factor': '3'}
>>>>>
>>>>>   WITH replication = {'class': 'SimpleStrategy',
>>>>> 'replication_factor': '3'}
>>>>>
>>>>>
>>>>>
>>>>> # Snitch
>>>>>
>>>>> Ec2Snitch
>>>>>
>>>>>
>>>>>
>>>>> ## About Snitch and replication:
>>>>>
>>>>> - We have the default DC and all nodes are in the same RACK
>>>>>
>>>>> - We are planning to move to GossipingPropertyFileSnitch configuring
>>>>> the cassandra

Re: Read timeouts when performing rolling restart

2018-09-13 Thread Riccardo Ferrari
Hi Shalom,

It happens almost at every restart, either a single node or a rolling one.
I do agree with you that it is good, at least on my setup, to wait few
minutes to let the rebooted node to cool down before moving to the next.
The more I look at it the more I think is something coming from hint
dispatching, maybe I should try  something around hints throttling.

Thanks!

On Thu, Sep 13, 2018 at 8:55 AM, shalom sagges 
wrote:

> Hi Riccardo,
>
> Does this issue occur when performing a single restart or after several
> restarts during a rolling restart (as mentioned in your original post)?
> We have a cluster that when performing a rolling restart, we prefer to
> wait ~10-15 minutes between each restart because we see an increase of GC
> for a few minutes.
> If we keep restarting the nodes quickly one after the other, the
> applications experience timeouts (probably due to GC and hints).
>
> Hope this helps!
>
> On Thu, Sep 13, 2018 at 2:20 AM Riccardo Ferrari 
> wrote:
>
>> A little update on the progress.
>>
>> First:
>> Thank you Thomas. I checked the code in the patch and briefly skimmed
>> through the 3.0.6 code. Yup it should be fixed.
>> Thank you Surbhi. At the moment we don't need authentication as the
>> instances are locked down.
>>
>> Now:
>> - Unfortunately the start_transport_native trick does not always work. On
>> some nodes works on other don't. What do I mean? I still experience
>> timeouts and dropped messages during startup.
>> - I realized that cutting the concurrent_compactors to 1 was not really a
>> good idea, minimum vlaue should be 2, currently testing 4 (that is the
>> min(n_cores, n_disks))
>> - After rising the compactors to 4 I still see some dropped messages for
>> HINT and MUTATIONS. This happens during startup. Reason is "for internal
>> timeout". Maybe too many compactors?
>>
>> Thanks!
>>
>>
>> On Wed, Sep 12, 2018 at 7:09 PM, Surbhi Gupta 
>> wrote:
>>
>>> Another thing to notice is :
>>>
>>> system_auth WITH replication = {'class': 'SimpleStrategy',
>>> 'replication_factor': '1'}
>>>
>>> system_auth has a replication factor of 1 and even if one node is down
>>> it may impact the system because of the replication factor.
>>>
>>>
>>>
>>> On Wed, 12 Sep 2018 at 09:46, Steinmaurer, Thomas <
>>> thomas.steinmau...@dynatrace.com> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> I remember something that a client using the native protocol gets
>>>> notified too early by Cassandra being ready due to the following issue:
>>>>
>>>> https://issues.apache.org/jira/browse/CASSANDRA-8236
>>>>
>>>>
>>>>
>>>> which looks similar, but above was marked as fixed in 2.2.
>>>>
>>>>
>>>>
>>>> Thomas
>>>>
>>>>
>>>>
>>>> *From:* Riccardo Ferrari 
>>>> *Sent:* Mittwoch, 12. September 2018 18:25
>>>> *To:* user@cassandra.apache.org
>>>> *Subject:* Re: Read timeouts when performing rolling restart
>>>>
>>>>
>>>>
>>>> Hi Alain,
>>>>
>>>>
>>>>
>>>> Thank you for chiming in!
>>>>
>>>>
>>>>
>>>> I was thinking to perform the 'start_native_transport=false' test as
>>>> well and indeed the issue is not showing up. Starting the/a node with
>>>> native transport disabled and letting it cool down lead to no timeout
>>>> exceptions no dropped messages, simply a crystal clean startup. Agreed it
>>>> is a workaround
>>>>
>>>>
>>>>
>>>> # About upgrading:
>>>>
>>>> Yes, I desperately want to upgrade despite is a long and slow task.
>>>> Just reviewing all the changes from 3.0.6 to 3.0.17
>>>> is going to be a huge pain, top of your head, any breaking change I
>>>> should absolutely take care of reviewing ?
>>>>
>>>>
>>>>
>>>> # describecluster output: YES they agree on the same schema version
>>>>
>>>>
>>>>
>>>> # keyspaces:
>>>>
>>>> system WITH replication = {'class': 'LocalStrategy'}
>>>>
>>>> system_schema WITH replication = {'class': 'LocalStrategy'}
>>>>
>>>> system_auth WITH replication = {'class': 'SimpleStrategy',
>>>> 'replication_f

Re: Read timeouts when performing rolling restart

2018-09-13 Thread shalom sagges
Hi Riccardo,

Does this issue occur when performing a single restart or after several
restarts during a rolling restart (as mentioned in your original post)?
We have a cluster that when performing a rolling restart, we prefer to wait
~10-15 minutes between each restart because we see an increase of GC for a
few minutes.
If we keep restarting the nodes quickly one after the other, the
applications experience timeouts (probably due to GC and hints).

Hope this helps!

On Thu, Sep 13, 2018 at 2:20 AM Riccardo Ferrari  wrote:

> A little update on the progress.
>
> First:
> Thank you Thomas. I checked the code in the patch and briefly skimmed
> through the 3.0.6 code. Yup it should be fixed.
> Thank you Surbhi. At the moment we don't need authentication as the
> instances are locked down.
>
> Now:
> - Unfortunately the start_transport_native trick does not always work. On
> some nodes works on other don't. What do I mean? I still experience
> timeouts and dropped messages during startup.
> - I realized that cutting the concurrent_compactors to 1 was not really a
> good idea, minimum vlaue should be 2, currently testing 4 (that is the
> min(n_cores, n_disks))
> - After rising the compactors to 4 I still see some dropped messages for
> HINT and MUTATIONS. This happens during startup. Reason is "for internal
> timeout". Maybe too many compactors?
>
> Thanks!
>
>
> On Wed, Sep 12, 2018 at 7:09 PM, Surbhi Gupta 
> wrote:
>
>> Another thing to notice is :
>>
>> system_auth WITH replication = {'class': 'SimpleStrategy',
>> 'replication_factor': '1'}
>>
>> system_auth has a replication factor of 1 and even if one node is down it
>> may impact the system because of the replication factor.
>>
>>
>>
>> On Wed, 12 Sep 2018 at 09:46, Steinmaurer, Thomas <
>> thomas.steinmau...@dynatrace.com> wrote:
>>
>>> Hi,
>>>
>>>
>>>
>>> I remember something that a client using the native protocol gets
>>> notified too early by Cassandra being ready due to the following issue:
>>>
>>> https://issues.apache.org/jira/browse/CASSANDRA-8236
>>>
>>>
>>>
>>> which looks similar, but above was marked as fixed in 2.2.
>>>
>>>
>>>
>>> Thomas
>>>
>>>
>>>
>>> *From:* Riccardo Ferrari 
>>> *Sent:* Mittwoch, 12. September 2018 18:25
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Re: Read timeouts when performing rolling restart
>>>
>>>
>>>
>>> Hi Alain,
>>>
>>>
>>>
>>> Thank you for chiming in!
>>>
>>>
>>>
>>> I was thinking to perform the 'start_native_transport=false' test as
>>> well and indeed the issue is not showing up. Starting the/a node with
>>> native transport disabled and letting it cool down lead to no timeout
>>> exceptions no dropped messages, simply a crystal clean startup. Agreed it
>>> is a workaround
>>>
>>>
>>>
>>> # About upgrading:
>>>
>>> Yes, I desperately want to upgrade despite is a long and slow task. Just
>>> reviewing all the changes from 3.0.6 to 3.0.17
>>> is going to be a huge pain, top of your head, any breaking change I
>>> should absolutely take care of reviewing ?
>>>
>>>
>>>
>>> # describecluster output: YES they agree on the same schema version
>>>
>>>
>>>
>>> # keyspaces:
>>>
>>> system WITH replication = {'class': 'LocalStrategy'}
>>>
>>> system_schema WITH replication = {'class': 'LocalStrategy'}
>>>
>>> system_auth WITH replication = {'class': 'SimpleStrategy',
>>> 'replication_factor': '1'}
>>>
>>> system_distributed WITH replication = {'class': 'SimpleStrategy',
>>> 'replication_factor': '3'}
>>>
>>> system_traces WITH replication = {'class': 'SimpleStrategy',
>>> 'replication_factor': '2'}
>>>
>>>
>>>
>>>  WITH replication = {'class': 'SimpleStrategy',
>>> 'replication_factor': '3'}
>>>
>>>   WITH replication = {'class': 'SimpleStrategy',
>>> 'replication_factor': '3'}
>>>
>>>
>>>
>>> # Snitch
>>>
>>> Ec2Snitch
>>>
>>>
>>>
>>> ## About Snitch and replication:
>>>
>>> - We have the default DC and all nodes are in the same RACK
>>>
>>> - We are planning to move to GossipingPropertyFileSnitch configuring the
>>> cassand

Re: Read timeouts when performing rolling restart

2018-09-12 Thread Riccardo Ferrari
A little update on the progress.

First:
Thank you Thomas. I checked the code in the patch and briefly skimmed
through the 3.0.6 code. Yup it should be fixed.
Thank you Surbhi. At the moment we don't need authentication as the
instances are locked down.

Now:
- Unfortunately the start_transport_native trick does not always work. On
some nodes works on other don't. What do I mean? I still experience
timeouts and dropped messages during startup.
- I realized that cutting the concurrent_compactors to 1 was not really a
good idea, minimum vlaue should be 2, currently testing 4 (that is the
min(n_cores, n_disks))
- After rising the compactors to 4 I still see some dropped messages for
HINT and MUTATIONS. This happens during startup. Reason is "for internal
timeout". Maybe too many compactors?

Thanks!


On Wed, Sep 12, 2018 at 7:09 PM, Surbhi Gupta 
wrote:

> Another thing to notice is :
>
> system_auth WITH replication = {'class': 'SimpleStrategy',
> 'replication_factor': '1'}
>
> system_auth has a replication factor of 1 and even if one node is down it
> may impact the system because of the replication factor.
>
>
>
> On Wed, 12 Sep 2018 at 09:46, Steinmaurer, Thomas <
> thomas.steinmau...@dynatrace.com> wrote:
>
>> Hi,
>>
>>
>>
>> I remember something that a client using the native protocol gets
>> notified too early by Cassandra being ready due to the following issue:
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-8236
>>
>>
>>
>> which looks similar, but above was marked as fixed in 2.2.
>>
>>
>>
>> Thomas
>>
>>
>>
>> *From:* Riccardo Ferrari 
>> *Sent:* Mittwoch, 12. September 2018 18:25
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Read timeouts when performing rolling restart
>>
>>
>>
>> Hi Alain,
>>
>>
>>
>> Thank you for chiming in!
>>
>>
>>
>> I was thinking to perform the 'start_native_transport=false' test as well
>> and indeed the issue is not showing up. Starting the/a node with native
>> transport disabled and letting it cool down lead to no timeout exceptions
>> no dropped messages, simply a crystal clean startup. Agreed it is a
>> workaround
>>
>>
>>
>> # About upgrading:
>>
>> Yes, I desperately want to upgrade despite is a long and slow task. Just
>> reviewing all the changes from 3.0.6 to 3.0.17
>> is going to be a huge pain, top of your head, any breaking change I
>> should absolutely take care of reviewing ?
>>
>>
>>
>> # describecluster output: YES they agree on the same schema version
>>
>>
>>
>> # keyspaces:
>>
>> system WITH replication = {'class': 'LocalStrategy'}
>>
>> system_schema WITH replication = {'class': 'LocalStrategy'}
>>
>> system_auth WITH replication = {'class': 'SimpleStrategy',
>> 'replication_factor': '1'}
>>
>> system_distributed WITH replication = {'class': 'SimpleStrategy',
>> 'replication_factor': '3'}
>>
>> system_traces WITH replication = {'class': 'SimpleStrategy',
>> 'replication_factor': '2'}
>>
>>
>>
>>  WITH replication = {'class': 'SimpleStrategy',
>> 'replication_factor': '3'}
>>
>>   WITH replication = {'class': 'SimpleStrategy',
>> 'replication_factor': '3'}
>>
>>
>>
>> # Snitch
>>
>> Ec2Snitch
>>
>>
>>
>> ## About Snitch and replication:
>>
>> - We have the default DC and all nodes are in the same RACK
>>
>> - We are planning to move to GossipingPropertyFileSnitch configuring the
>> cassandra-rackdc accortingly.
>>
>> -- This should be a transparent change, correct?
>>
>>
>>
>> - Once switched to GPFS, we plan to move to 'NetworkTopologyStrategy'
>> with 'us-' DC and replica counts as before
>>
>> - Then adding a new DC inside the VPC, but this is another story...
>>
>>
>>
>> Any concerns here ?
>>
>>
>>
>> # nodetool status 
>>
>> --  Address Load   Tokens   Owns (effective)  Host
>> ID   Rack
>> UN  10.x.x.a  177 GB 256  50.3%
>> d8bfe4ad-8138-41fe-89a4-ee9a043095b5  rr
>> UN  10.x.x.b152.46 GB  256  51.8%
>> 7888c077-346b-4e09-96b0-9f6376b8594f  rr
>> UN  10.x.x.c   159.59 GB  256  49.0%
>> 329b288e-c5b5-4b55-b75e-fbe9243e75fa  rr
>> UN  10.x.x.d  162.44 GB  256  49.3%
>> 07038c11-d200-46a0-9f6a-6e2465580fb1  rr
>> UN  10.x.x.e174.9 GB   256  50.5%
>> c35b5d51-2d14

Re: Read timeouts when performing rolling restart

2018-09-12 Thread Surbhi Gupta
Another thing to notice is :

system_auth WITH replication = {'class': 'SimpleStrategy',
'replication_factor': '1'}

system_auth has a replication factor of 1 and even if one node is down it
may impact the system because of the replication factor.



On Wed, 12 Sep 2018 at 09:46, Steinmaurer, Thomas <
thomas.steinmau...@dynatrace.com> wrote:

> Hi,
>
>
>
> I remember something that a client using the native protocol gets notified
> too early by Cassandra being ready due to the following issue:
>
> https://issues.apache.org/jira/browse/CASSANDRA-8236
>
>
>
> which looks similar, but above was marked as fixed in 2.2.
>
>
>
> Thomas
>
>
>
> *From:* Riccardo Ferrari 
> *Sent:* Mittwoch, 12. September 2018 18:25
> *To:* user@cassandra.apache.org
> *Subject:* Re: Read timeouts when performing rolling restart
>
>
>
> Hi Alain,
>
>
>
> Thank you for chiming in!
>
>
>
> I was thinking to perform the 'start_native_transport=false' test as well
> and indeed the issue is not showing up. Starting the/a node with native
> transport disabled and letting it cool down lead to no timeout exceptions
> no dropped messages, simply a crystal clean startup. Agreed it is a
> workaround
>
>
>
> # About upgrading:
>
> Yes, I desperately want to upgrade despite is a long and slow task. Just
> reviewing all the changes from 3.0.6 to 3.0.17
> is going to be a huge pain, top of your head, any breaking change I should
> absolutely take care of reviewing ?
>
>
>
> # describecluster output: YES they agree on the same schema version
>
>
>
> # keyspaces:
>
> system WITH replication = {'class': 'LocalStrategy'}
>
> system_schema WITH replication = {'class': 'LocalStrategy'}
>
> system_auth WITH replication = {'class': 'SimpleStrategy',
> 'replication_factor': '1'}
>
> system_distributed WITH replication = {'class': 'SimpleStrategy',
> 'replication_factor': '3'}
>
> system_traces WITH replication = {'class': 'SimpleStrategy',
> 'replication_factor': '2'}
>
>
>
>  WITH replication = {'class': 'SimpleStrategy',
> 'replication_factor': '3'}
>
>   WITH replication = {'class': 'SimpleStrategy',
> 'replication_factor': '3'}
>
>
>
> # Snitch
>
> Ec2Snitch
>
>
>
> ## About Snitch and replication:
>
> - We have the default DC and all nodes are in the same RACK
>
> - We are planning to move to GossipingPropertyFileSnitch configuring the
> cassandra-rackdc accortingly.
>
> -- This should be a transparent change, correct?
>
>
>
> - Once switched to GPFS, we plan to move to 'NetworkTopologyStrategy' with
> 'us-' DC and replica counts as before
>
> - Then adding a new DC inside the VPC, but this is another story...
>
>
>
> Any concerns here ?
>
>
>
> # nodetool status 
>
> --  Address Load   Tokens   Owns (effective)  Host
> ID   Rack
> UN  10.x.x.a  177 GB 256  50.3%
> d8bfe4ad-8138-41fe-89a4-ee9a043095b5  rr
> UN  10.x.x.b152.46 GB  256  51.8%
> 7888c077-346b-4e09-96b0-9f6376b8594f  rr
> UN  10.x.x.c   159.59 GB  256  49.0%
> 329b288e-c5b5-4b55-b75e-fbe9243e75fa  rr
> UN  10.x.x.d  162.44 GB  256  49.3%
> 07038c11-d200-46a0-9f6a-6e2465580fb1  rr
> UN  10.x.x.e174.9 GB   256  50.5%
> c35b5d51-2d14-4334-9ffc-726f9dd8a214  rr
> UN  10.x.x.f  194.71 GB  256  49.2%
> f20f7a87-d5d2-4f38-a963-21e24167b8ac  rr
>
>
>
> # gossipinfo
>
> /10.x.x.a
>   STATUS:827:NORMAL,-1350078789194251746
>   LOAD:289986:1.90078037902E11
>   SCHEMA:281088:af4461c3-d269-39bc-9d03-3566031c1e0a
>   DC:6:
>   RACK:8:rr
>   RELEASE_VERSION:4:3.0.6
>   SEVERITY:290040:0.5934718251228333
>   NET_VERSION:1:10
>   HOST_ID:2:d8bfe4ad-8138-41fe-89a4-ee9a043095b5
>   RPC_READY:868:true
>   TOKENS:826:
> /10.x.x.b
>   STATUS:16:NORMAL,-1023229528754013265
>   LOAD:7113:1.63730480619E11
>   SCHEMA:10:af4461c3-d269-39bc-9d03-3566031c1e0a
>   DC:6:
>   RACK:8:rr
>   RELEASE_VERSION:4:3.0.6
>   SEVERITY:7274:0.5988024473190308
>   NET_VERSION:1:10
>   HOST_ID:2:7888c077-346b-4e09-96b0-9f6376b8594f
>   TOKENS:15:
> /10.x.x.c
>   STATUS:732:NORMAL,-111717275923547
>   LOAD:245839:1.71409806942E11
>   SCHEMA:237168:af4461c3-d269-39bc-9d03-3566031c1e0a
>   DC:6:
>   RACK:8:rr
>   RELEASE_VERSION:4:3.0.6
>   SEVERITY:245989:0.0
>   NET_VERSION:1:10
>   HOST_ID:2:329b288e-c5b5-4b55-b75e-fbe9243e75fa
>   RPC_READY:763:true
>   TOKENS:731:
> /10.x.x.d
>   STATUS:14:NORMAL,-1004942496246544417
>   LOAD:313125:1.74447964917E11
>   SCHEMA:304268:af4461c3-d269-39bc-9d03-3566031c1e0a
>   DC:6:
&g

RE: Read timeouts when performing rolling restart

2018-09-12 Thread Steinmaurer, Thomas
Hi,

I remember something that a client using the native protocol gets notified too 
early by Cassandra being ready due to the following issue:
https://issues.apache.org/jira/browse/CASSANDRA-8236

which looks similar, but above was marked as fixed in 2.2.

Thomas

From: Riccardo Ferrari 
Sent: Mittwoch, 12. September 2018 18:25
To: user@cassandra.apache.org
Subject: Re: Read timeouts when performing rolling restart

Hi Alain,

Thank you for chiming in!

I was thinking to perform the 'start_native_transport=false' test as well and 
indeed the issue is not showing up. Starting the/a node with native transport 
disabled and letting it cool down lead to no timeout exceptions no dropped 
messages, simply a crystal clean startup. Agreed it is a workaround

# About upgrading:
Yes, I desperately want to upgrade despite is a long and slow task. Just 
reviewing all the changes from 3.0.6 to 3.0.17
is going to be a huge pain, top of your head, any breaking change I should 
absolutely take care of reviewing ?

# describecluster output: YES they agree on the same schema version

# keyspaces:
system WITH replication = {'class': 'LocalStrategy'}
system_schema WITH replication = {'class': 'LocalStrategy'}
system_auth WITH replication = {'class': 'SimpleStrategy', 
'replication_factor': '1'}
system_distributed WITH replication = {'class': 'SimpleStrategy', 
'replication_factor': '3'}
system_traces WITH replication = {'class': 'SimpleStrategy', 
'replication_factor': '2'}

 WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 
'3'}
  WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 
'3'}

# Snitch
Ec2Snitch

## About Snitch and replication:
- We have the default DC and all nodes are in the same RACK
- We are planning to move to GossipingPropertyFileSnitch configuring the 
cassandra-rackdc accortingly.
-- This should be a transparent change, correct?

- Once switched to GPFS, we plan to move to 'NetworkTopologyStrategy' with 
'us-' DC and replica counts as before
- Then adding a new DC inside the VPC, but this is another story...

Any concerns here ?

# nodetool status 
--  Address Load   Tokens   Owns (effective)  Host ID   
Rack
UN  10.x.x.a  177 GB 256  50.3% 
d8bfe4ad-8138-41fe-89a4-ee9a043095b5  rr
UN  10.x.x.b152.46 GB  256  51.8% 
7888c077-346b-4e09-96b0-9f6376b8594f  rr
UN  10.x.x.c   159.59 GB  256  49.0% 
329b288e-c5b5-4b55-b75e-fbe9243e75fa  rr
UN  10.x.x.d  162.44 GB  256  49.3% 
07038c11-d200-46a0-9f6a-6e2465580fb1  rr
UN  10.x.x.e174.9 GB   256  50.5% 
c35b5d51-2d14-4334-9ffc-726f9dd8a214  rr
UN  10.x.x.f  194.71 GB  256  49.2% 
f20f7a87-d5d2-4f38-a963-21e24167b8ac  rr

# gossipinfo
/10.x.x.a
  STATUS:827:NORMAL,-1350078789194251746
  LOAD:289986:1.90078037902E11
  SCHEMA:281088:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:290040:0.5934718251228333
  NET_VERSION:1:10
  HOST_ID:2:d8bfe4ad-8138-41fe-89a4-ee9a043095b5
  RPC_READY:868:true
  TOKENS:826:
/10.x.x.b
  STATUS:16:NORMAL,-1023229528754013265
  LOAD:7113:1.63730480619E11
  SCHEMA:10:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:7274:0.5988024473190308
  NET_VERSION:1:10
  HOST_ID:2:7888c077-346b-4e09-96b0-9f6376b8594f
  TOKENS:15:
/10.x.x.c
  STATUS:732:NORMAL,-111717275923547
  LOAD:245839:1.71409806942E11
  SCHEMA:237168:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:245989:0.0
  NET_VERSION:1:10
  HOST_ID:2:329b288e-c5b5-4b55-b75e-fbe9243e75fa
  RPC_READY:763:true
  TOKENS:731:
/10.x.x.d
  STATUS:14:NORMAL,-1004942496246544417
  LOAD:313125:1.74447964917E11
  SCHEMA:304268:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:313215:0.25641027092933655
  NET_VERSION:1:10
  HOST_ID:2:07038c11-d200-46a0-9f6a-6e2465580fb1
  RPC_READY:56:true
  TOKENS:13:
/10.x.x.e
  STATUS:520:NORMAL,-1058809960483771749
  LOAD:276118:1.87831573032E11
  SCHEMA:267327:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:276217:0.32786884903907776
  NET_VERSION:1:10
  HOST_ID:2:c35b5d51-2d14-4334-9ffc-726f9dd8a214
  RPC_READY:550:true
  TOKENS:519:
/10.x.x.f
  STATUS:1081:NORMAL,-1039671799603495012
  LOAD:239114:2.09082017545E11
  SCHEMA:230229:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:239180:0.5665722489356995
  NET_VERSION:1:10
  HOST_ID:2:f20f7a87-d5d2-4f38-a963-21e24167b8ac
  RPC_READY:1118:true
  TOKENS:1080:

## About load and tokens:
- While load is pretty even this does not apply to tokens, I guess we have some 
table with uneven distribution. This should not be the case for high load 
tabels as partition keys are are build with some 'id + '
- I was not able to find some

Re: Read timeouts when performing rolling restart

2018-09-12 Thread Riccardo Ferrari
Hi Alain,

Thank you for chiming in!

I was thinking to perform the 'start_native_transport=false' test as well
and indeed the issue is not showing up. Starting the/a node with native
transport disabled and letting it cool down lead to no timeout exceptions
no dropped messages, simply a crystal clean startup. Agreed it is a
workaround

# About upgrading:
Yes, I desperately want to upgrade despite is a long and slow task. Just
reviewing all the changes from 3.0.6 to 3.0.17
is going to be a huge pain, top of your head, any breaking change I should
absolutely take care of reviewing ?

# describecluster output: YES they agree on the same schema version

# keyspaces:
system WITH replication = {'class': 'LocalStrategy'}
system_schema WITH replication = {'class': 'LocalStrategy'}
system_auth WITH replication = {'class': 'SimpleStrategy',
'replication_factor': '1'}
system_distributed WITH replication = {'class': 'SimpleStrategy',
'replication_factor': '3'}
system_traces WITH replication = {'class': 'SimpleStrategy',
'replication_factor': '2'}

 WITH replication = {'class': 'SimpleStrategy',
'replication_factor': '3'}
  WITH replication = {'class': 'SimpleStrategy',
'replication_factor': '3'}

# Snitch
Ec2Snitch

## About Snitch and replication:
- We have the default DC and all nodes are in the same RACK
- We are planning to move to GossipingPropertyFileSnitch configuring the
cassandra-rackdc accortingly.
-- This should be a transparent change, correct?

- Once switched to GPFS, we plan to move to 'NetworkTopologyStrategy' with
'us-' DC and replica counts as before
- Then adding a new DC inside the VPC, but this is another story...

Any concerns here ?

# nodetool status 
--  Address Load   Tokens   Owns (effective)  Host
ID   Rack
UN  10.x.x.a  177 GB 256  50.3%
d8bfe4ad-8138-41fe-89a4-ee9a043095b5  rr
UN  10.x.x.b152.46 GB  256  51.8%
7888c077-346b-4e09-96b0-9f6376b8594f  rr
UN  10.x.x.c   159.59 GB  256  49.0%
329b288e-c5b5-4b55-b75e-fbe9243e75fa  rr
UN  10.x.x.d  162.44 GB  256  49.3%
07038c11-d200-46a0-9f6a-6e2465580fb1  rr
UN  10.x.x.e174.9 GB   256  50.5%
c35b5d51-2d14-4334-9ffc-726f9dd8a214  rr
UN  10.x.x.f  194.71 GB  256  49.2%
f20f7a87-d5d2-4f38-a963-21e24167b8ac  rr

# gossipinfo
/10.x.x.a
  STATUS:827:NORMAL,-1350078789194251746
  LOAD:289986:1.90078037902E11
  SCHEMA:281088:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:290040:0.5934718251228333
  NET_VERSION:1:10
  HOST_ID:2:d8bfe4ad-8138-41fe-89a4-ee9a043095b5
  RPC_READY:868:true
  TOKENS:826:
/10.x.x.b
  STATUS:16:NORMAL,-1023229528754013265
  LOAD:7113:1.63730480619E11
  SCHEMA:10:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:7274:0.5988024473190308
  NET_VERSION:1:10
  HOST_ID:2:7888c077-346b-4e09-96b0-9f6376b8594f
  TOKENS:15:
/10.x.x.c
  STATUS:732:NORMAL,-111717275923547
  LOAD:245839:1.71409806942E11
  SCHEMA:237168:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:245989:0.0
  NET_VERSION:1:10
  HOST_ID:2:329b288e-c5b5-4b55-b75e-fbe9243e75fa
  RPC_READY:763:true
  TOKENS:731:
/10.x.x.d
  STATUS:14:NORMAL,-1004942496246544417
  LOAD:313125:1.74447964917E11
  SCHEMA:304268:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:313215:0.25641027092933655
  NET_VERSION:1:10
  HOST_ID:2:07038c11-d200-46a0-9f6a-6e2465580fb1
  RPC_READY:56:true
  TOKENS:13:
/10.x.x.e
  STATUS:520:NORMAL,-1058809960483771749
  LOAD:276118:1.87831573032E11
  SCHEMA:267327:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:276217:0.32786884903907776
  NET_VERSION:1:10
  HOST_ID:2:c35b5d51-2d14-4334-9ffc-726f9dd8a214
  RPC_READY:550:true
  TOKENS:519:
/10.x.x.f
  STATUS:1081:NORMAL,-1039671799603495012
  LOAD:239114:2.09082017545E11
  SCHEMA:230229:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:239180:0.5665722489356995
  NET_VERSION:1:10
  HOST_ID:2:f20f7a87-d5d2-4f38-a963-21e24167b8ac
  RPC_READY:1118:true
  TOKENS:1080:

## About load and tokens:
- While load is pretty even this does not apply to tokens, I guess we have
some table with uneven distribution. This should not be the case for high
load tabels as partition keys are are build with some 'id + '
- I was not able to find some documentation about the numbers printed next
to LOAD, SCHEMA, SEVERITY, RPC_READY ... Is there any doc around ?

# Tombstones
No ERRORS, only WARN about a very specific table that we are aware of. It
is an append only table read by spark from a batch job. (I guess it is a
read_repair chance or DTCS misconfig)

## Closing note!
We are on very only m1.xlarge 4 vcpu and raid0 (stripe) on the 4 spinning
drives, some changes to the cassandra.yml:

- dynamic_snitch: false
- concurrent_reads: 48
- concurrent_compactors: 

Re: Read timeouts when performing rolling restart

2018-09-12 Thread Alain RODRIGUEZ
Hello Ricardo

How come that a single node is impacting the whole cluster?
>

It sounds weird indeed.

Is there a way to further delay the native transposrt startup?


You can configure 'start_native_transport: false' in 'cassandra.yaml'. (
https://github.com/apache/cassandra/blob/cassandra-3.0.6/conf/cassandra.yaml#L496
)
Then 'nodetool enablebinary' (
http://cassandra.apache.org/doc/latest/tools/nodetool/enablebinary.html)
when you are ready for it.

But I would consider this as a workaround, and it might not even work, I
hope it does though :).

Any hint on troubleshooting it further?
>

The version of Cassandra is quite an early Cassandra 3+. It's probably
worth to consider moving to 3.0.17, if not to solve this issue, not to face
other issues that were fixed since then.
To know if that would really help you, you can go through
https://github.com/apache/cassandra/blob/cassandra-3.0.17/CHANGES.txt

I am not too sure about what is going on, but here are some other things I
would look at to try to understand this:

Are all the nodes agreeing on the schema?
'nodetool describecluster'

Are all the keyspaces using the 'NetworkTopologyStrategy' and a replication
factor of 2+?
'cqlsh -e "DESCRIBE KEYSPACES;" '

What snitch are you using (in cassandra.yaml)?

What does ownership look like?
'nodetool status '

What about gossip?
'nodetool gossipinfo' or 'nodetool gossipinfo | grep STATUS' maybe.

A tombstone issue?
https://support.datastax.com/hc/en-us/articles/204612559-ReadTimeoutException-seen-when-using-the-java-driver-caused-by-excessive-tombstones

Any ERROR or WARN in the logs after the restart on this node and on other
nodes (you would see the tombstone issue here)?
'grep -e "WARN" -e "ERROR" /var/log/cassandra/system.log'

I hope one of those will help, let us know if you need help to interpret
some of the outputs,

C*heers,
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com


Le mer. 12 sept. 2018 à 10:59, Riccardo Ferrari  a
écrit :

> Hi list,
>
> We are seeing the following behaviour when performing a rolling restart:
>
> On the node I need to restart:
> *  I run the 'nodetool drain'
> * Then 'service cassandra restart'
>
> so far so good. The load incerase on the other 5 nodes is negligible.
> The node is generally out of service just for the time of the restart (ie.
> cassandra.yml update)
>
> When the node comes back up and switch on the native transport I start see
> lots of read timeouts in our various services:
>
> com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra
> timeout during read query at consistency LOCAL_ONE (1 responses were
> required but only 0 replica responded)
>
> Indeed the restarting node have a huge peak on the system load, because of
> hints and compactions, nevertheless I don't notice a load increase on the
> other 5 nodes.
>
> Specs:
> 6 nodes cluster on Cassandra 3.0.6
> - keyspace RF=3
>
> Java driver 3.5.1:
> - DefaultRetryPolicy
> - default LoadBalancingPolicy (that should be DCAwareRoundRobinPolicy)
>
> QUESTIONS:
> How come that a single node is impacting the whole cluster?
> Is there a way to further delay the native transposrt startup?
> Any hint on troubleshooting it further?
>
> Thanks
>


Read timeouts when performing rolling restart

2018-09-12 Thread Riccardo Ferrari
Hi list,

We are seeing the following behaviour when performing a rolling restart:

On the node I need to restart:
*  I run the 'nodetool drain'
* Then 'service cassandra restart'

so far so good. The load incerase on the other 5 nodes is negligible.
The node is generally out of service just for the time of the restart (ie.
cassandra.yml update)

When the node comes back up and switch on the native transport I start see
lots of read timeouts in our various services:

com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout
during read query at consistency LOCAL_ONE (1 responses were required but
only 0 replica responded)

Indeed the restarting node have a huge peak on the system load, because of
hints and compactions, nevertheless I don't notice a load increase on the
other 5 nodes.

Specs:
6 nodes cluster on Cassandra 3.0.6
- keyspace RF=3

Java driver 3.5.1:
- DefaultRetryPolicy
- default LoadBalancingPolicy (that should be DCAwareRoundRobinPolicy)

QUESTIONS:
How come that a single node is impacting the whole cluster?
Is there a way to further delay the native transposrt startup?
Any hint on troubleshooting it further?

Thanks


Re: Read timeouts

2017-05-16 Thread Nitan Kainth
Thank you Jeff.

We are at Cassandra 3.0.10

Will look forward to upgrade or enable driver logging.

> On May 16, 2017, at 11:44 AM, Jeff Jirsa <jji...@apache.org> wrote:
> 
> 
> 
> On 2017-05-16 08:53 (-0700), Nitan Kainth <ni...@bamlabs.com> wrote: 
>> Hi,
>> 
>> We see read timeouts intermittently. Mostly after they have occurred. 
>> Timeouts are not consistent and does not occur in 100s at a moment. 
>> 
>> 1. Does read timeout considered as Dropped Mutation?
> 
> No, a dropped mutation is a failed write, not a failed read.
> 
>> 2. What is best way to nail down exact issue of scattered timeouts?
>> 
> 
> First, be aware that tombstone overwhelming exceptions also get propagated as 
> read timeouts - you should check your logs for warnings about tombstone 
> problems.
> 
> Second, you need to identify the slow queries somehow. You have a few options:
> 
> 1) If you happen to be running 3.10 or newer , turn on the slow query log ( 
> https://issues.apache.org/jira/browse/CASSANDRA-12403 ) . 3.10 is the newest 
> release, and may not be fully stable, so you probably don't want to upgrade 
> to 3.10 JUST to get this feature. But if you're already on that version, 
> definitely use that tool.
> 
> 2) Some drivers have a log-slow-queries feature. Consider turning that on, 
> and let the application side log the slow queries. It's possible that you 
> have a bad partition or two, and you may see patterns there.
> 
> 3) Probabilistic tracing - you can tell cassandra to trace 1% of your 
> queries, and hope you catch a timeout. It'll be unpleasant to track alone - 
> this is really a last-resort type option, because you'll need to dig through 
> that trace table to find the outliers after the fact.
> 
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
> 



Re: Read timeouts

2017-05-16 Thread Jeff Jirsa


On 2017-05-16 08:53 (-0700), Nitan Kainth <ni...@bamlabs.com> wrote: 
> Hi,
> 
> We see read timeouts intermittently. Mostly after they have occurred. 
> Timeouts are not consistent and does not occur in 100s at a moment. 
> 
> 1. Does read timeout considered as Dropped Mutation?

No, a dropped mutation is a failed write, not a failed read.

> 2. What is best way to nail down exact issue of scattered timeouts?
> 

First, be aware that tombstone overwhelming exceptions also get propagated as 
read timeouts - you should check your logs for warnings about tombstone 
problems.

Second, you need to identify the slow queries somehow. You have a few options:

1) If you happen to be running 3.10 or newer , turn on the slow query log ( 
https://issues.apache.org/jira/browse/CASSANDRA-12403 ) . 3.10 is the newest 
release, and may not be fully stable, so you probably don't want to upgrade to 
3.10 JUST to get this feature. But if you're already on that version, 
definitely use that tool.

2) Some drivers have a log-slow-queries feature. Consider turning that on, and 
let the application side log the slow queries. It's possible that you have a 
bad partition or two, and you may see patterns there.

3) Probabilistic tracing - you can tell cassandra to trace 1% of your queries, 
and hope you catch a timeout. It'll be unpleasant to track alone - this is 
really a last-resort type option, because you'll need to dig through that trace 
table to find the outliers after the fact.



-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Read timeouts

2017-05-16 Thread Nitan Kainth
Hi,

We see read timeouts intermittently. Mostly after they have occurred. Timeouts 
are not consistent and does not occur in 100s at a moment. 

1. Does read timeout considered as Dropped Mutation?
2. What is best way to nail down exact issue of scattered timeouts?

Thank you.
-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Read timeouts on primary key queries

2016-09-15 Thread Joseph Tech
gt;> We have seen read time out issue in cassandra due to high droppable
>> tombstone ratio for repository.
>>
>> Please check for high droppable tombstone ratio for your repo.
>>
>> On Mon, Sep 5, 2016 at 8:11 PM, Romain Hardouin <romainh...@yahoo.fr>
>> wrote:
>>
>> Yes dclocal_read_repair_chance will reduce the cross-DC traffic and
>> latency, so you can swap the values ( https://issues.apache.org/ji
>> ra/browse/CASSANDRA-7320
>> <https://issues.apache.org/jira/browse/CASSANDRA-7320> ). I guess the
>> sstable_size_in_mb was set to 50 because back in the day (C* 1.0) the
>> default size was way too small: 5 MB. So maybe someone in your company
>> tried "10 * the default" i.e. 50 MB. Now the default is 160 MB. I don't say
>> to change the value but just keep in mind that you're using a small value
>> here, it could help you someday.
>>
>> Regarding the cells, the histograms shows an *estimation* of the min,
>> p50, ..., p99, max of cells based on SSTables metadata. On your screenshot,
>> the Max is 4768. So you have a partition key with ~ 4768 cells. The p99 is
>> 1109, so 99% of your partition keys have less than (or equal to) 1109
>> cells.
>> You can see these data of a given sstable with the tool sstablemetadata.
>>
>> Best,
>>
>> Romain
>>
>>
>>
>> Le Lundi 5 septembre 2016 15h17, Joseph Tech <jaalex.t...@gmail.com> a
>> écrit :
>>
>>
>> Thanks, Romain . We will try to enable the DEBUG logging (assuming it
>> won't clog the logs much) . Regarding the table configs, read_repair_chance
>> must be carried over from older versions - mostly defaults. I think 
>> sstable_size_in_mb
>> was set to limit the max SSTable size, though i am not sure on the reason
>> for the 50 MB value.
>>
>> Does setting dclocal_read_repair_chance help in reducing cross-DC
>> traffic (haven't looked into this parameter, just going by the name).
>>
>> By the cell count definition : is it incremented based on the number of
>> writes for a given name(key?) and value. This table is heavy on reads and
>> writes. If so, the value should be much higher?
>>
>> On Mon, Sep 5, 2016 at 7:35 AM, Romain Hardouin <romainh...@yahoo.fr>
>> wrote:
>>
>> Hi,
>>
>> Try to put org.apache.cassandra.db. ConsistencyLevel at DEBUG level, it
>> could help to find a regular pattern. By the way, I see that you have set a
>> global read repair chance:
>> read_repair_chance = 0.1
>> And not the local read repair:
>> dclocal_read_repair_chance = 0.0
>> Is there any reason to do that or is it just the old (pre 2.0.9) default
>> configuration?
>>
>> The cell count is the number of triplets: (name, value, timestamp)
>>
>> Also, I see that you have set sstable_size_in_mb at 50 MB. What is the
>> rational behind this? (Yes I'm curious :-) ). Anyway your "SSTables per
>> read" are good.
>>
>> Best,
>>
>> Romain
>>
>> Le Lundi 5 septembre 2016 13h32, Joseph Tech <jaalex.t...@gmail.com> a
>> écrit :
>>
>>
>> Hi Ryan,
>>
>> Attached are the cfhistograms run within few mins of each other. On the
>> surface, don't see anything which indicates too much skewing (assuming
>> skewing ==keys spread across many SSTables) . Please confirm. Related to
>> this, what does the "cell count" metric indicate ; didn't find a clear
>> explanation in the documents.
>>
>> Thanks,
>> Joseph
>>
>>
>> On Thu, Sep 1, 2016 at 6:30 PM, Ryan Svihla <r...@foundev.pro> wrote:
>>
>> Have you looked at cfhistograms/tablehistograms your data maybe just
>> skewed (most likely explanation is probably the correct one here)
>>
>> Regard,
>>
>> Ryan Svihla
>>
>> _
>> From: Joseph Tech <jaalex.t...@gmail.com>
>> Sent: Wednesday, August 31, 2016 11:16 PM
>> Subject: Re: Read timeouts on primary key queries
>> To: <user@cassandra.apache.org>
>>
>>
>>
>> Patrick,
>>
>> The desc table is below (only col names changed) :
>>
>> CREATE TABLE db.tbl (
>> id1 text,
>> id2 text,
>> id3 text,
>> id4 text,
>> f1 text,
>> f2 map<text, text>,
>> f3 map<text, text>,
>> created timestamp,
>> updated timestamp,
>> PRIMARY KEY (id1, id2, id3, id4)
>> ) WITH CLUSTERING ORDER BY (id2 ASC, id3 ASC, id4 ASC)
>> AND bloom_filter_fp_ch

Re: Read timeouts on primary key queries

2016-09-07 Thread Romain Hardouin
 be 
carried over from older versions - mostly defaults. I think sstable_size_in_mb 
was set to limit the max SSTable size, though i am not sure on the reason for 
the 50 MB value.
Does setting dclocal_read_repair_chance help in reducing cross-DC traffic 
(haven't looked into this parameter, just going by the name).

By the cell count definition : is it incremented based on the number of writes 
for a given name(key?) and value. This table is heavy on reads and writes. If 
so, the value should be much higher?
On Mon, Sep 5, 2016 at 7:35 AM, Romain Hardouin <romainh...@yahoo.fr> wrote:

Hi,
Try to put org.apache.cassandra.db. ConsistencyLevel at DEBUG level, it could 
help to find a regular pattern. By the way, I see that you have set a global 
read repair chance:    read_repair_chance = 0.1And not the local read repair:   
 dclocal_read_repair_chance = 0.0 Is there any reason to do that or is it just 
the old (pre 2.0.9) default configuration? 
The cell count is the number of triplets: (name, value, timestamp)
Also, I see that you have set sstable_size_in_mb at 50 MB. What is the rational 
behind this? (Yes I'm curious :-) ). Anyway your "SSTables per read" are good.
Best,
Romain
Le Lundi 5 septembre 2016 13h32, Joseph Tech <jaalex.t...@gmail.com> a 
écrit :
 

 Hi Ryan,
Attached are the cfhistograms run within few mins of each other. On the 
surface, don't see anything which indicates too much skewing (assuming skewing 
==keys spread across many SSTables) . Please confirm. Related to this, what 
does the "cell count" metric indicate ; didn't find a clear explanation in the 
documents.
Thanks,Joseph

On Thu, Sep 1, 2016 at 6:30 PM, Ryan Svihla <r...@foundev.pro> wrote:

 Have you looked at cfhistograms/tablehistograms your data maybe just skewed 
(most likely explanation is probably the correct one here)

Regard,
Ryan Svihla
 _
From: Joseph Tech <jaalex.t...@gmail.com>
Sent: Wednesday, August 31, 2016 11:16 PM
Subject: Re: Read timeouts on primary key queries
To: <user@cassandra.apache.org>


Patrick,
The desc table is below (only col names changed) : 
CREATE TABLE db.tbl (    id1 text,    id2 text,    id3 text,    id4 text,    f1 
text,    f2 map<text, text>,    f3 map<text, text>,    created timestamp,    
updated timestamp,    PRIMARY KEY (id1, id2, id3, id4)) WITH CLUSTERING ORDER 
BY (id2 ASC, id3 ASC, id4 ASC)    AND bloom_filter_fp_chance = 0.01    AND 
caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'    AND comment = ''    
AND compaction = {'sstable_size_in_mb': '50', 'class': 
'org.apache.cassandra.db. compaction. LeveledCompactionStrategy'}    AND 
compression = {'sstable_compression': 'org.apache.cassandra.io. 
compress.LZ4Compressor'}    AND dclocal_read_repair_chance = 0.0    AND 
default_time_to_live = 0    AND gc_grace_seconds = 864000    AND 
max_index_interval = 2048    AND memtable_flush_period_in_ms = 0    AND 
min_index_interval = 128    AND read_repair_chance = 0.1    AND 
speculative_retry = '99.0PERCENTILE';
and the query is select * from tbl where id1=? and id2=? and id3=? and id4=?
The timeouts happen within ~2s to ~5s, while the successful calls have avg of 
8ms and p99 of 15s. These times are seen from app side, the actual query times 
would be slightly lower. 
Is there a way to capture traces only when queries take longer than a specified 
duration? . We can't enable tracing in production given the volume of traffic. 
We see that the same query which timed out works fine later, so not sure if the 
trace of a successful run would help.
Thanks,Joseph

On Wed, Aug 31, 2016 at 8:05 PM, Patrick McFadin <pmcfa...@gmail.com> wrote:

If you are getting a timeout on one table, then a mismatch of RF and node count 
doesn't seem as likely. 
Time to look at your query. You said it was a 'select * from table where key=?' 
type query. I would next use the trace facility in cqlsh to investigate 
further. That's a good way to find hard to find issues. You should be looking 
for clear ledge where you go from single digit ms to 4 or 5 digit ms times. 
The other place to look is your data model for that table if you want to post 
the output from a desc table.
Patrick


On Tue, Aug 30, 2016 at 11:07 AM, Joseph Tech <jaalex.t...@gmail.com> wrote:

On further analysis, this issue happens only on 1 table in the KS which has the 
max reads. 
@Atul, I will look at system health, but didnt see anything standing out from 
GC logs. (using JDK 1.8_92 with G1GC). 
@Patrick , could you please elaborate the "mismatch on node count + RF" part.
On Tue, Aug 30, 2016 at 5:35 PM, Atul Saroha <atul.sar...@snapdeal.com> wrote:

There could be many reasons for this if it is intermittent. CPU usage + I/O 
wait status. As read are I/O intensive, your IOPS requi

Re: Read timeouts on primary key queries

2016-09-07 Thread Joseph Tech
 99% of your partition keys have less than (or equal to) 1109
> cells.
> You can see these data of a given sstable with the tool sstablemetadata.
>
> Best,
>
> Romain
>
>
>
> Le Lundi 5 septembre 2016 15h17, Joseph Tech <jaalex.t...@gmail.com> a
> écrit :
>
>
> Thanks, Romain . We will try to enable the DEBUG logging (assuming it
> won't clog the logs much) . Regarding the table configs, read_repair_chance
> must be carried over from older versions - mostly defaults. I think 
> sstable_size_in_mb
> was set to limit the max SSTable size, though i am not sure on the reason
> for the 50 MB value.
>
> Does setting dclocal_read_repair_chance help in reducing cross-DC traffic
> (haven't looked into this parameter, just going by the name).
>
> By the cell count definition : is it incremented based on the number of
> writes for a given name(key?) and value. This table is heavy on reads and
> writes. If so, the value should be much higher?
>
> On Mon, Sep 5, 2016 at 7:35 AM, Romain Hardouin <romainh...@yahoo.fr>
> wrote:
>
> Hi,
>
> Try to put org.apache.cassandra.db. ConsistencyLevel at DEBUG level, it
> could help to find a regular pattern. By the way, I see that you have set a
> global read repair chance:
> read_repair_chance = 0.1
> And not the local read repair:
> dclocal_read_repair_chance = 0.0
> Is there any reason to do that or is it just the old (pre 2.0.9) default
> configuration?
>
> The cell count is the number of triplets: (name, value, timestamp)
>
> Also, I see that you have set sstable_size_in_mb at 50 MB. What is the
> rational behind this? (Yes I'm curious :-) ). Anyway your "SSTables per
> read" are good.
>
> Best,
>
> Romain
>
> Le Lundi 5 septembre 2016 13h32, Joseph Tech <jaalex.t...@gmail.com> a
> écrit :
>
>
> Hi Ryan,
>
> Attached are the cfhistograms run within few mins of each other. On the
> surface, don't see anything which indicates too much skewing (assuming
> skewing ==keys spread across many SSTables) . Please confirm. Related to
> this, what does the "cell count" metric indicate ; didn't find a clear
> explanation in the documents.
>
> Thanks,
> Joseph
>
>
> On Thu, Sep 1, 2016 at 6:30 PM, Ryan Svihla <r...@foundev.pro> wrote:
>
> Have you looked at cfhistograms/tablehistograms your data maybe just
> skewed (most likely explanation is probably the correct one here)
>
> Regard,
>
> Ryan Svihla
>
> _
> From: Joseph Tech <jaalex.t...@gmail.com>
> Sent: Wednesday, August 31, 2016 11:16 PM
> Subject: Re: Read timeouts on primary key queries
> To: <user@cassandra.apache.org>
>
>
>
> Patrick,
>
> The desc table is below (only col names changed) :
>
> CREATE TABLE db.tbl (
> id1 text,
> id2 text,
> id3 text,
> id4 text,
> f1 text,
> f2 map<text, text>,
> f3 map<text, text>,
> created timestamp,
> updated timestamp,
> PRIMARY KEY (id1, id2, id3, id4)
> ) WITH CLUSTERING ORDER BY (id2 ASC, id3 ASC, id4 ASC)
> AND bloom_filter_fp_chance = 0.01
> AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
> AND comment = ''
> AND compaction = {'sstable_size_in_mb': '50', 'class':
> 'org.apache.cassandra.db. compaction. LeveledCompactionStrategy'}
> AND compression = {'sstable_compression': 'org.apache.cassandra.io.
> compress.LZ4Compressor'}
> AND dclocal_read_repair_chance = 0.0
> AND default_time_to_live = 0
> AND gc_grace_seconds = 864000
> AND max_index_interval = 2048
> AND memtable_flush_period_in_ms = 0
> AND min_index_interval = 128
> AND read_repair_chance = 0.1
> AND speculative_retry = '99.0PERCENTILE';
>
> and the query is select * from tbl where id1=? and id2=? and id3=? and
> id4=?
>
> The timeouts happen within ~2s to ~5s, while the successful calls have avg
> of 8ms and p99 of 15s. These times are seen from app side, the actual query
> times would be slightly lower.
>
> Is there a way to capture traces only when queries take longer than a
> specified duration? . We can't enable tracing in production given the
> volume of traffic. We see that the same query which timed out works fine
> later, so not sure if the trace of a successful run would help.
>
> Thanks,
> Joseph
>
>
> On Wed, Aug 31, 2016 at 8:05 PM, Patrick McFadin <pmcfa...@gmail.com>
> wrote:
>
> If you are getting a timeout on one table, then a mismatch of RF and node
> count doesn't seem as likely.
>
> Time to look at your query. You said it was a 'select * from table wh

Re: Read timeouts on primary key queries

2016-09-06 Thread Romain Hardouin
ou have set a global 
read repair chance:    read_repair_chance = 0.1And not the local read repair:   
 dclocal_read_repair_chance = 0.0 Is there any reason to do that or is it just 
the old (pre 2.0.9) default configuration? 
The cell count is the number of triplets: (name, value, timestamp)
Also, I see that you have set sstable_size_in_mb at 50 MB. What is the rational 
behind this? (Yes I'm curious :-) ). Anyway your "SSTables per read" are good.
Best,
Romain
Le Lundi 5 septembre 2016 13h32, Joseph Tech <jaalex.t...@gmail.com> a 
écrit :
 

 Hi Ryan,
Attached are the cfhistograms run within few mins of each other. On the 
surface, don't see anything which indicates too much skewing (assuming skewing 
==keys spread across many SSTables) . Please confirm. Related to this, what 
does the "cell count" metric indicate ; didn't find a clear explanation in the 
documents.
Thanks,Joseph

On Thu, Sep 1, 2016 at 6:30 PM, Ryan Svihla <r...@foundev.pro> wrote:

 Have you looked at cfhistograms/tablehistograms your data maybe just skewed 
(most likely explanation is probably the correct one here)

Regard,
Ryan Svihla
 _
From: Joseph Tech <jaalex.t...@gmail.com>
Sent: Wednesday, August 31, 2016 11:16 PM
Subject: Re: Read timeouts on primary key queries
To: <user@cassandra.apache.org>


Patrick,
The desc table is below (only col names changed) : 
CREATE TABLE db.tbl (    id1 text,    id2 text,    id3 text,    id4 text,    f1 
text,    f2 map<text, text>,    f3 map<text, text>,    created timestamp,    
updated timestamp,    PRIMARY KEY (id1, id2, id3, id4)) WITH CLUSTERING ORDER 
BY (id2 ASC, id3 ASC, id4 ASC)    AND bloom_filter_fp_chance = 0.01    AND 
caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'    AND comment = ''    
AND compaction = {'sstable_size_in_mb': '50', 'class': 
'org.apache.cassandra.db. compaction. LeveledCompactionStrategy'}    AND 
compression = {'sstable_compression': 'org.apache.cassandra.io. 
compress.LZ4Compressor'}    AND dclocal_read_repair_chance = 0.0    AND 
default_time_to_live = 0    AND gc_grace_seconds = 864000    AND 
max_index_interval = 2048    AND memtable_flush_period_in_ms = 0    AND 
min_index_interval = 128    AND read_repair_chance = 0.1    AND 
speculative_retry = '99.0PERCENTILE';
and the query is select * from tbl where id1=? and id2=? and id3=? and id4=?
The timeouts happen within ~2s to ~5s, while the successful calls have avg of 
8ms and p99 of 15s. These times are seen from app side, the actual query times 
would be slightly lower. 
Is there a way to capture traces only when queries take longer than a specified 
duration? . We can't enable tracing in production given the volume of traffic. 
We see that the same query which timed out works fine later, so not sure if the 
trace of a successful run would help.
Thanks,Joseph

On Wed, Aug 31, 2016 at 8:05 PM, Patrick McFadin <pmcfa...@gmail.com> wrote:

If you are getting a timeout on one table, then a mismatch of RF and node count 
doesn't seem as likely. 
Time to look at your query. You said it was a 'select * from table where key=?' 
type query. I would next use the trace facility in cqlsh to investigate 
further. That's a good way to find hard to find issues. You should be looking 
for clear ledge where you go from single digit ms to 4 or 5 digit ms times. 
The other place to look is your data model for that table if you want to post 
the output from a desc table.
Patrick


On Tue, Aug 30, 2016 at 11:07 AM, Joseph Tech <jaalex.t...@gmail.com> wrote:

On further analysis, this issue happens only on 1 table in the KS which has the 
max reads. 
@Atul, I will look at system health, but didnt see anything standing out from 
GC logs. (using JDK 1.8_92 with G1GC). 
@Patrick , could you please elaborate the "mismatch on node count + RF" part.
On Tue, Aug 30, 2016 at 5:35 PM, Atul Saroha <atul.sar...@snapdeal.com> wrote:

There could be many reasons for this if it is intermittent. CPU usage + I/O 
wait status. As read are I/O intensive, your IOPS requirement should be met 
that time load. Heap issue if CPU is busy for GC only. Network health could be 
the reason. So better to look system health during that time when it comes.

-- -- 
-- ---
Atul Saroha
Lead Software Engineer
M: +91 8447784271 T: +91 124-415-6069 EXT: 12369
Plot # 362, ASF Centre - Tower A, Udyog Vihar,
 Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA
On Tue, Aug 30, 2016 at 5:10 PM, Joseph Tech <jaalex.t...@gmail.com> wrote:

Hi Patrick,
The nodetool status shows all nodes up and normal now. From OpsCenter "Event 
Log" , there are some nodes reported as being down/up etc. during the timeframe 
of timeout, but these are Search workload nodes from the remote (non-local) D

Re: Read timeouts on primary key queries

2016-09-05 Thread Joseph Tech
Attached are the sstablemeta outputs from 2 SSTables of size 28 MB and 52
MB (out2). The records are inserted with different TTLs based on their
nature ; test records with 1 day, typeA records with 6 months, typeB
records with 1 year etc. There are also explicit DELETEs from this table,
though it's much lower than the rate of inserts.

I am not sure how to interpret this output, or if it's the right SSTables
that were picked. Please advise. Is there a way to get the sstables
corresponding to the keys that timed out, though they are accessible later.

On Mon, Sep 5, 2016 at 10:58 PM, Anshu Vajpayee <anshu.vajpa...@gmail.com>
wrote:

> We have seen read time out issue in cassandra due to high droppable
> tombstone ratio for repository.
>
> Please check for high droppable tombstone ratio for your repo.
>
> On Mon, Sep 5, 2016 at 8:11 PM, Romain Hardouin <romainh...@yahoo.fr>
> wrote:
>
>> Yes dclocal_read_repair_chance will reduce the cross-DC traffic and
>> latency, so you can swap the values ( https://issues.apache.org/ji
>> ra/browse/CASSANDRA-7320 ). I guess the sstable_size_in_mb was set to 50
>> because back in the day (C* 1.0) the default size was way too small: 5 MB.
>> So maybe someone in your company tried "10 * the default" i.e. 50 MB. Now
>> the default is 160 MB. I don't say to change the value but just keep in
>> mind that you're using a small value here, it could help you someday.
>>
>> Regarding the cells, the histograms shows an *estimation* of the min,
>> p50, ..., p99, max of cells based on SSTables metadata. On your screenshot,
>> the Max is 4768. So you have a partition key with ~ 4768 cells. The p99 is
>> 1109, so 99% of your partition keys have less than (or equal to) 1109
>> cells.
>> You can see these data of a given sstable with the tool sstablemetadata.
>>
>> Best,
>>
>> Romain
>>
>>
>>
>> Le Lundi 5 septembre 2016 15h17, Joseph Tech <jaalex.t...@gmail.com> a
>> écrit :
>>
>>
>> Thanks, Romain . We will try to enable the DEBUG logging (assuming it
>> won't clog the logs much) . Regarding the table configs, read_repair_chance
>> must be carried over from older versions - mostly defaults. I think 
>> sstable_size_in_mb
>> was set to limit the max SSTable size, though i am not sure on the reason
>> for the 50 MB value.
>>
>> Does setting dclocal_read_repair_chance help in reducing cross-DC
>> traffic (haven't looked into this parameter, just going by the name).
>>
>> By the cell count definition : is it incremented based on the number of
>> writes for a given name(key?) and value. This table is heavy on reads and
>> writes. If so, the value should be much higher?
>>
>> On Mon, Sep 5, 2016 at 7:35 AM, Romain Hardouin <romainh...@yahoo.fr>
>> wrote:
>>
>> Hi,
>>
>> Try to put org.apache.cassandra.db. ConsistencyLevel at DEBUG level, it
>> could help to find a regular pattern. By the way, I see that you have set a
>> global read repair chance:
>> read_repair_chance = 0.1
>> And not the local read repair:
>> dclocal_read_repair_chance = 0.0
>> Is there any reason to do that or is it just the old (pre 2.0.9) default
>> configuration?
>>
>> The cell count is the number of triplets: (name, value, timestamp)
>>
>> Also, I see that you have set sstable_size_in_mb at 50 MB. What is the
>> rational behind this? (Yes I'm curious :-) ). Anyway your "SSTables per
>> read" are good.
>>
>> Best,
>>
>> Romain
>>
>> Le Lundi 5 septembre 2016 13h32, Joseph Tech <jaalex.t...@gmail.com> a
>> écrit :
>>
>>
>> Hi Ryan,
>>
>> Attached are the cfhistograms run within few mins of each other. On the
>> surface, don't see anything which indicates too much skewing (assuming
>> skewing ==keys spread across many SSTables) . Please confirm. Related to
>> this, what does the "cell count" metric indicate ; didn't find a clear
>> explanation in the documents.
>>
>> Thanks,
>> Joseph
>>
>>
>> On Thu, Sep 1, 2016 at 6:30 PM, Ryan Svihla <r...@foundev.pro> wrote:
>>
>> Have you looked at cfhistograms/tablehistograms your data maybe just
>> skewed (most likely explanation is probably the correct one here)
>>
>> Regard,
>>
>> Ryan Svihla
>>
>> _
>> From: Joseph Tech <jaalex.t...@gmail.com>
>> Sent: Wednesday, August 31, 2016 11:16 PM
>> Subject: Re: Read timeouts on primary key queries
>> To: <user@cassandra.apache.org>
>>
>>
&

Re: Read timeouts on primary key queries

2016-09-05 Thread Anshu Vajpayee
We have seen read time out issue in cassandra due to high droppable
tombstone ratio for repository.

Please check for high droppable tombstone ratio for your repo.

On Mon, Sep 5, 2016 at 8:11 PM, Romain Hardouin <romainh...@yahoo.fr> wrote:

> Yes dclocal_read_repair_chance will reduce the cross-DC traffic and
> latency, so you can swap the values ( https://issues.apache.org/
> jira/browse/CASSANDRA-7320 ). I guess the sstable_size_in_mb was set to
> 50 because back in the day (C* 1.0) the default size was way too small: 5
> MB. So maybe someone in your company tried "10 * the default" i.e. 50 MB.
> Now the default is 160 MB. I don't say to change the value but just keep in
> mind that you're using a small value here, it could help you someday.
>
> Regarding the cells, the histograms shows an *estimation* of the min, p50,
> ..., p99, max of cells based on SSTables metadata. On your screenshot, the
> Max is 4768. So you have a partition key with ~ 4768 cells. The p99 is
> 1109, so 99% of your partition keys have less than (or equal to) 1109
> cells.
> You can see these data of a given sstable with the tool sstablemetadata.
>
> Best,
>
> Romain
>
>
>
> Le Lundi 5 septembre 2016 15h17, Joseph Tech <jaalex.t...@gmail.com> a
> écrit :
>
>
> Thanks, Romain . We will try to enable the DEBUG logging (assuming it
> won't clog the logs much) . Regarding the table configs, read_repair_chance
> must be carried over from older versions - mostly defaults. I think 
> sstable_size_in_mb
> was set to limit the max SSTable size, though i am not sure on the reason
> for the 50 MB value.
>
> Does setting dclocal_read_repair_chance help in reducing cross-DC traffic
> (haven't looked into this parameter, just going by the name).
>
> By the cell count definition : is it incremented based on the number of
> writes for a given name(key?) and value. This table is heavy on reads and
> writes. If so, the value should be much higher?
>
> On Mon, Sep 5, 2016 at 7:35 AM, Romain Hardouin <romainh...@yahoo.fr>
> wrote:
>
> Hi,
>
> Try to put org.apache.cassandra.db. ConsistencyLevel at DEBUG level, it
> could help to find a regular pattern. By the way, I see that you have set a
> global read repair chance:
> read_repair_chance = 0.1
> And not the local read repair:
> dclocal_read_repair_chance = 0.0
> Is there any reason to do that or is it just the old (pre 2.0.9) default
> configuration?
>
> The cell count is the number of triplets: (name, value, timestamp)
>
> Also, I see that you have set sstable_size_in_mb at 50 MB. What is the
> rational behind this? (Yes I'm curious :-) ). Anyway your "SSTables per
> read" are good.
>
> Best,
>
> Romain
>
> Le Lundi 5 septembre 2016 13h32, Joseph Tech <jaalex.t...@gmail.com> a
> écrit :
>
>
> Hi Ryan,
>
> Attached are the cfhistograms run within few mins of each other. On the
> surface, don't see anything which indicates too much skewing (assuming
> skewing ==keys spread across many SSTables) . Please confirm. Related to
> this, what does the "cell count" metric indicate ; didn't find a clear
> explanation in the documents.
>
> Thanks,
> Joseph
>
>
> On Thu, Sep 1, 2016 at 6:30 PM, Ryan Svihla <r...@foundev.pro> wrote:
>
> Have you looked at cfhistograms/tablehistograms your data maybe just
> skewed (most likely explanation is probably the correct one here)
>
> Regard,
>
> Ryan Svihla
>
> _
> From: Joseph Tech <jaalex.t...@gmail.com>
> Sent: Wednesday, August 31, 2016 11:16 PM
> Subject: Re: Read timeouts on primary key queries
> To: <user@cassandra.apache.org>
>
>
>
> Patrick,
>
> The desc table is below (only col names changed) :
>
> CREATE TABLE db.tbl (
> id1 text,
> id2 text,
> id3 text,
> id4 text,
> f1 text,
> f2 map<text, text>,
> f3 map<text, text>,
> created timestamp,
> updated timestamp,
> PRIMARY KEY (id1, id2, id3, id4)
> ) WITH CLUSTERING ORDER BY (id2 ASC, id3 ASC, id4 ASC)
> AND bloom_filter_fp_chance = 0.01
> AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
> AND comment = ''
> AND compaction = {'sstable_size_in_mb': '50', 'class':
> 'org.apache.cassandra.db. compaction. LeveledCompactionStrategy'}
> AND compression = {'sstable_compression': 'org.apache.cassandra.io.
> compress.LZ4Compressor'}
> AND dclocal_read_repair_chance = 0.0
> AND default_time_to_live = 0
> AND gc_grace_seconds = 864000
> AND max_index_interval = 2048
> AND memtable_flush_period_in_ms = 0
&

Re: Read timeouts on primary key queries

2016-09-05 Thread Romain Hardouin
Yes dclocal_read_repair_chance will reduce the cross-DC traffic and latency, so 
you can swap the values ( https://issues.apache.org/jira/browse/CASSANDRA-7320 
). I guess the sstable_size_in_mb was set to 50 because back in the day (C* 
1.0) the default size was way too small: 5 MB. So maybe someone in your company 
tried "10 * the default" i.e. 50 MB. Now the default is 160 MB. I don't say to 
change the value but just keep in mind that you're using a small value here, it 
could help you someday.
Regarding the cells, the histograms shows an *estimation* of the min, p50, ..., 
p99, max of cells based on SSTables metadata. On your screenshot, the Max is 
4768. So you have a partition key with ~ 4768 cells. The p99 is 1109, so 99% of 
your partition keys have less than (or equal to) 1109 cells. You can see these 
data of a given sstable with the tool sstablemetadata.
Best,
Romain
 

Le Lundi 5 septembre 2016 15h17, Joseph Tech <jaalex.t...@gmail.com> a 
écrit :
 

 Thanks, Romain . We will try to enable the DEBUG logging (assuming it won't 
clog the logs much) . Regarding the table configs, read_repair_chance must be 
carried over from older versions - mostly defaults. I think sstable_size_in_mb 
was set to limit the max SSTable size, though i am not sure on the reason for 
the 50 MB value.
Does setting dclocal_read_repair_chance help in reducing cross-DC traffic 
(haven't looked into this parameter, just going by the name).

By the cell count definition : is it incremented based on the number of writes 
for a given name(key?) and value. This table is heavy on reads and writes. If 
so, the value should be much higher?
On Mon, Sep 5, 2016 at 7:35 AM, Romain Hardouin <romainh...@yahoo.fr> wrote:

Hi,
Try to put org.apache.cassandra.db. ConsistencyLevel at DEBUG level, it could 
help to find a regular pattern. By the way, I see that you have set a global 
read repair chance:    read_repair_chance = 0.1And not the local read repair:   
 dclocal_read_repair_chance = 0.0 Is there any reason to do that or is it just 
the old (pre 2.0.9) default configuration? 
The cell count is the number of triplets: (name, value, timestamp)
Also, I see that you have set sstable_size_in_mb at 50 MB. What is the rational 
behind this? (Yes I'm curious :-) ). Anyway your "SSTables per read" are good.
Best,
Romain
Le Lundi 5 septembre 2016 13h32, Joseph Tech <jaalex.t...@gmail.com> a 
écrit :
 

 Hi Ryan,
Attached are the cfhistograms run within few mins of each other. On the 
surface, don't see anything which indicates too much skewing (assuming skewing 
==keys spread across many SSTables) . Please confirm. Related to this, what 
does the "cell count" metric indicate ; didn't find a clear explanation in the 
documents.
Thanks,Joseph

On Thu, Sep 1, 2016 at 6:30 PM, Ryan Svihla <r...@foundev.pro> wrote:

 Have you looked at cfhistograms/tablehistograms your data maybe just skewed 
(most likely explanation is probably the correct one here)

Regard,
Ryan Svihla
 _
From: Joseph Tech <jaalex.t...@gmail.com>
Sent: Wednesday, August 31, 2016 11:16 PM
Subject: Re: Read timeouts on primary key queries
To: <user@cassandra.apache.org>


Patrick,
The desc table is below (only col names changed) : 
CREATE TABLE db.tbl (    id1 text,    id2 text,    id3 text,    id4 text,    f1 
text,    f2 map<text, text>,    f3 map<text, text>,    created timestamp,    
updated timestamp,    PRIMARY KEY (id1, id2, id3, id4)) WITH CLUSTERING ORDER 
BY (id2 ASC, id3 ASC, id4 ASC)    AND bloom_filter_fp_chance = 0.01    AND 
caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'    AND comment = ''    
AND compaction = {'sstable_size_in_mb': '50', 'class': 
'org.apache.cassandra.db. compaction. LeveledCompactionStrategy'}    AND 
compression = {'sstable_compression': 'org.apache.cassandra.io. 
compress.LZ4Compressor'}    AND dclocal_read_repair_chance = 0.0    AND 
default_time_to_live = 0    AND gc_grace_seconds = 864000    AND 
max_index_interval = 2048    AND memtable_flush_period_in_ms = 0    AND 
min_index_interval = 128    AND read_repair_chance = 0.1    AND 
speculative_retry = '99.0PERCENTILE';
and the query is select * from tbl where id1=? and id2=? and id3=? and id4=?
The timeouts happen within ~2s to ~5s, while the successful calls have avg of 
8ms and p99 of 15s. These times are seen from app side, the actual query times 
would be slightly lower. 
Is there a way to capture traces only when queries take longer than a specified 
duration? . We can't enable tracing in production given the volume of traffic. 
We see that the same query which timed out works fine later, so not sure if the 
trace of a successful run would help.
Thanks,Joseph

On Wed, Aug 31, 2016 at 8:05 PM, Patrick McFadin <pmcfa...@gmail.com> wrote:

If you are getting a timeout on one table, then a mismatch of RF and node co

Re: Read timeouts on primary key queries

2016-09-05 Thread Joseph Tech
Thanks, Romain . We will try to enable the DEBUG logging (assuming it won't
clog the logs much) . Regarding the table configs, read_repair_chance must
be carried over from older versions - mostly defaults. I think
sstable_size_in_mb
was set to limit the max SSTable size, though i am not sure on the reason
for the 50 MB value.

Does setting dclocal_read_repair_chance help in reducing cross-DC traffic
(haven't looked into this parameter, just going by the name).

By the cell count definition : is it incremented based on the number of
writes for a given name(key?) and value. This table is heavy on reads and
writes. If so, the value should be much higher?

On Mon, Sep 5, 2016 at 7:35 AM, Romain Hardouin <romainh...@yahoo.fr> wrote:

> Hi,
>
> Try to put org.apache.cassandra.db.ConsistencyLevel at DEBUG level, it
> could help to find a regular pattern. By the way, I see that you have set a
> global read repair chance:
> read_repair_chance = 0.1
> And not the local read repair:
> dclocal_read_repair_chance = 0.0
> Is there any reason to do that or is it just the old (pre 2.0.9) default
> configuration?
>
> The cell count is the number of triplets: (name, value, timestamp)
>
> Also, I see that you have set sstable_size_in_mb at 50 MB. What is the
> rational behind this? (Yes I'm curious :-) ). Anyway your "SSTables per
> read" are good.
>
> Best,
>
> Romain
>
> Le Lundi 5 septembre 2016 13h32, Joseph Tech <jaalex.t...@gmail.com> a
> écrit :
>
>
> Hi Ryan,
>
> Attached are the cfhistograms run within few mins of each other. On the
> surface, don't see anything which indicates too much skewing (assuming
> skewing ==keys spread across many SSTables) . Please confirm. Related to
> this, what does the "cell count" metric indicate ; didn't find a clear
> explanation in the documents.
>
> Thanks,
> Joseph
>
>
> On Thu, Sep 1, 2016 at 6:30 PM, Ryan Svihla <r...@foundev.pro> wrote:
>
> Have you looked at cfhistograms/tablehistograms your data maybe just
> skewed (most likely explanation is probably the correct one here)
>
> Regard,
>
> Ryan Svihla
>
> _
> From: Joseph Tech <jaalex.t...@gmail.com>
> Sent: Wednesday, August 31, 2016 11:16 PM
> Subject: Re: Read timeouts on primary key queries
> To: <user@cassandra.apache.org>
>
>
>
> Patrick,
>
> The desc table is below (only col names changed) :
>
> CREATE TABLE db.tbl (
> id1 text,
> id2 text,
> id3 text,
> id4 text,
> f1 text,
> f2 map<text, text>,
> f3 map<text, text>,
> created timestamp,
> updated timestamp,
> PRIMARY KEY (id1, id2, id3, id4)
> ) WITH CLUSTERING ORDER BY (id2 ASC, id3 ASC, id4 ASC)
> AND bloom_filter_fp_chance = 0.01
> AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
> AND comment = ''
> AND compaction = {'sstable_size_in_mb': '50', 'class':
> 'org.apache.cassandra.db. compaction. LeveledCompactionStrategy'}
> AND compression = {'sstable_compression': 'org.apache.cassandra.io.
> compress.LZ4Compressor'}
> AND dclocal_read_repair_chance = 0.0
> AND default_time_to_live = 0
> AND gc_grace_seconds = 864000
> AND max_index_interval = 2048
> AND memtable_flush_period_in_ms = 0
> AND min_index_interval = 128
> AND read_repair_chance = 0.1
> AND speculative_retry = '99.0PERCENTILE';
>
> and the query is select * from tbl where id1=? and id2=? and id3=? and
> id4=?
>
> The timeouts happen within ~2s to ~5s, while the successful calls have avg
> of 8ms and p99 of 15s. These times are seen from app side, the actual query
> times would be slightly lower.
>
> Is there a way to capture traces only when queries take longer than a
> specified duration? . We can't enable tracing in production given the
> volume of traffic. We see that the same query which timed out works fine
> later, so not sure if the trace of a successful run would help.
>
> Thanks,
> Joseph
>
>
> On Wed, Aug 31, 2016 at 8:05 PM, Patrick McFadin <pmcfa...@gmail.com>
> wrote:
>
> If you are getting a timeout on one table, then a mismatch of RF and node
> count doesn't seem as likely.
>
> Time to look at your query. You said it was a 'select * from table where
> key=?' type query. I would next use the trace facility in cqlsh to
> investigate further. That's a good way to find hard to find issues. You
> should be looking for clear ledge where you go from single digit ms to 4 or
> 5 digit ms times.
>
> The other place to look is your data model for that table if you want to
> post the output from a desc table.
>

Re: Read timeouts on primary key queries

2016-09-05 Thread Romain Hardouin
Hi,
Try to put org.apache.cassandra.db.ConsistencyLevel at DEBUG level, it could 
help to find a regular pattern. By the way, I see that you have set a global 
read repair chance:    read_repair_chance = 0.1And not the local read repair:   
 dclocal_read_repair_chance = 0.0 Is there any reason to do that or is it just 
the old (pre 2.0.9) default configuration? 
The cell count is the number of triplets: (name, value, timestamp)
Also, I see that you have set sstable_size_in_mb at 50 MB. What is the rational 
behind this? (Yes I'm curious :-) ). Anyway your "SSTables per read" are good.
Best,
Romain
Le Lundi 5 septembre 2016 13h32, Joseph Tech <jaalex.t...@gmail.com> a 
écrit :
 

 Hi Ryan,
Attached are the cfhistograms run within few mins of each other. On the 
surface, don't see anything which indicates too much skewing (assuming skewing 
==keys spread across many SSTables) . Please confirm. Related to this, what 
does the "cell count" metric indicate ; didn't find a clear explanation in the 
documents.
Thanks,Joseph

On Thu, Sep 1, 2016 at 6:30 PM, Ryan Svihla <r...@foundev.pro> wrote:

 Have you looked at cfhistograms/tablehistograms your data maybe just skewed 
(most likely explanation is probably the correct one here)

Regard,
Ryan Svihla
 _
From: Joseph Tech <jaalex.t...@gmail.com>
Sent: Wednesday, August 31, 2016 11:16 PM
Subject: Re: Read timeouts on primary key queries
To: <user@cassandra.apache.org>


Patrick,
The desc table is below (only col names changed) : 
CREATE TABLE db.tbl (    id1 text,    id2 text,    id3 text,    id4 text,    f1 
text,    f2 map<text, text>,    f3 map<text, text>,    created timestamp,    
updated timestamp,    PRIMARY KEY (id1, id2, id3, id4)) WITH CLUSTERING ORDER 
BY (id2 ASC, id3 ASC, id4 ASC)    AND bloom_filter_fp_chance = 0.01    AND 
caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'    AND comment = ''    
AND compaction = {'sstable_size_in_mb': '50', 'class': 
'org.apache.cassandra.db. compaction. LeveledCompactionStrategy'}    AND 
compression = {'sstable_compression': 'org.apache.cassandra.io. 
compress.LZ4Compressor'}    AND dclocal_read_repair_chance = 0.0    AND 
default_time_to_live = 0    AND gc_grace_seconds = 864000    AND 
max_index_interval = 2048    AND memtable_flush_period_in_ms = 0    AND 
min_index_interval = 128    AND read_repair_chance = 0.1    AND 
speculative_retry = '99.0PERCENTILE';
and the query is select * from tbl where id1=? and id2=? and id3=? and id4=?
The timeouts happen within ~2s to ~5s, while the successful calls have avg of 
8ms and p99 of 15s. These times are seen from app side, the actual query times 
would be slightly lower. 
Is there a way to capture traces only when queries take longer than a specified 
duration? . We can't enable tracing in production given the volume of traffic. 
We see that the same query which timed out works fine later, so not sure if the 
trace of a successful run would help.
Thanks,Joseph

On Wed, Aug 31, 2016 at 8:05 PM, Patrick McFadin <pmcfa...@gmail.com> wrote:

If you are getting a timeout on one table, then a mismatch of RF and node count 
doesn't seem as likely. 
Time to look at your query. You said it was a 'select * from table where key=?' 
type query. I would next use the trace facility in cqlsh to investigate 
further. That's a good way to find hard to find issues. You should be looking 
for clear ledge where you go from single digit ms to 4 or 5 digit ms times. 
The other place to look is your data model for that table if you want to post 
the output from a desc table.
Patrick


On Tue, Aug 30, 2016 at 11:07 AM, Joseph Tech <jaalex.t...@gmail.com> wrote:

On further analysis, this issue happens only on 1 table in the KS which has the 
max reads. 
@Atul, I will look at system health, but didnt see anything standing out from 
GC logs. (using JDK 1.8_92 with G1GC). 
@Patrick , could you please elaborate the "mismatch on node count + RF" part.
On Tue, Aug 30, 2016 at 5:35 PM, Atul Saroha <atul.sar...@snapdeal.com> wrote:

There could be many reasons for this if it is intermittent. CPU usage + I/O 
wait status. As read are I/O intensive, your IOPS requirement should be met 
that time load. Heap issue if CPU is busy for GC only. Network health could be 
the reason. So better to look system health during that time when it comes.

-- -- 
-- ---
Atul Saroha
Lead Software Engineer
M: +91 8447784271 T: +91 124-415-6069 EXT: 12369
Plot # 362, ASF Centre - Tower A, Udyog Vihar,
 Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA
On Tue, Aug 30, 2016 at 5:10 PM, Joseph Tech <jaalex.t...@gmail.com> wrote:

Hi Patrick,
The nodetool status shows all nodes up and normal now. From OpsCenter "Event 
Log" , there are so

Re: Read timeouts on primary key queries

2016-09-05 Thread Joseph Tech
Hi Ryan,

Attached are the cfhistograms run within few mins of each other. On the
surface, don't see anything which indicates too much skewing (assuming
skewing ==keys spread across many SSTables) . Please confirm. Related to
this, what does the "cell count" metric indicate ; didn't find a clear
explanation in the documents.

Thanks,
Joseph


On Thu, Sep 1, 2016 at 6:30 PM, Ryan Svihla <r...@foundev.pro> wrote:

> Have you looked at cfhistograms/tablehistograms your data maybe just
> skewed (most likely explanation is probably the correct one here)
>
> Regard,
>
> Ryan Svihla
>
> _
> From: Joseph Tech <jaalex.t...@gmail.com>
> Sent: Wednesday, August 31, 2016 11:16 PM
> Subject: Re: Read timeouts on primary key queries
> To: <user@cassandra.apache.org>
>
>
>
> Patrick,
>
> The desc table is below (only col names changed) :
>
> CREATE TABLE db.tbl (
> id1 text,
> id2 text,
> id3 text,
> id4 text,
> f1 text,
> f2 map<text, text>,
> f3 map<text, text>,
> created timestamp,
> updated timestamp,
> PRIMARY KEY (id1, id2, id3, id4)
> ) WITH CLUSTERING ORDER BY (id2 ASC, id3 ASC, id4 ASC)
> AND bloom_filter_fp_chance = 0.01
> AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
> AND comment = ''
> AND compaction = {'sstable_size_in_mb': '50', 'class':
> 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
> AND compression = {'sstable_compression': 'org.apache.cassandra.io.
> compress.LZ4Compressor'}
> AND dclocal_read_repair_chance = 0.0
> AND default_time_to_live = 0
> AND gc_grace_seconds = 864000
> AND max_index_interval = 2048
> AND memtable_flush_period_in_ms = 0
> AND min_index_interval = 128
> AND read_repair_chance = 0.1
> AND speculative_retry = '99.0PERCENTILE';
>
> and the query is select * from tbl where id1=? and id2=? and id3=? and
> id4=?
>
> The timeouts happen within ~2s to ~5s, while the successful calls have avg
> of 8ms and p99 of 15s. These times are seen from app side, the actual query
> times would be slightly lower.
>
> Is there a way to capture traces only when queries take longer than a
> specified duration? . We can't enable tracing in production given the
> volume of traffic. We see that the same query which timed out works fine
> later, so not sure if the trace of a successful run would help.
>
> Thanks,
> Joseph
>
>
> On Wed, Aug 31, 2016 at 8:05 PM, Patrick McFadin <pmcfa...@gmail.com>
> wrote:
>
>> If you are getting a timeout on one table, then a mismatch of RF and node
>> count doesn't seem as likely.
>>
>> Time to look at your query. You said it was a 'select * from table where
>> key=?' type query. I would next use the trace facility in cqlsh to
>> investigate further. That's a good way to find hard to find issues. You
>> should be looking for clear ledge where you go from single digit ms to 4 or
>> 5 digit ms times.
>>
>> The other place to look is your data model for that table if you want to
>> post the output from a desc table.
>>
>> Patrick
>>
>>
>>
>> On Tue, Aug 30, 2016 at 11:07 AM, Joseph Tech <jaalex.t...@gmail.com>
>> wrote:
>>
>>> On further analysis, this issue happens only on 1 table in the KS which
>>> has the max reads.
>>>
>>> @Atul, I will look at system health, but didnt see anything standing out
>>> from GC logs. (using JDK 1.8_92 with G1GC).
>>>
>>> @Patrick , could you please elaborate the "mismatch on node count + RF"
>>> part.
>>>
>>> On Tue, Aug 30, 2016 at 5:35 PM, Atul Saroha <atul.sar...@snapdeal.com>
>>> wrote:
>>>
>>>> There could be many reasons for this if it is intermittent. CPU usage +
>>>> I/O wait status. As read are I/O intensive, your IOPS requirement should be
>>>> met that time load. Heap issue if CPU is busy for GC only. Network health
>>>> could be the reason. So better to look system health during that time when
>>>> it comes.
>>>>
>>>> 
>>>> -
>>>> Atul Saroha
>>>> *Lead Software Engineer*
>>>> *M*: +91 8447784271 *T*: +91 124-415-6069 *EXT*: 12369
>>>> Plot # 362, ASF Centre - Tower A, Udyog Vihar,
>>>>  Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA
>>>>
>>>> On Tue, Aug 30,

Re: Read timeouts on primary key queries

2016-09-01 Thread Ryan Svihla
Have you looked at cfhistograms/tablehistograms your data maybe just skewed 
(most likely explanation is probably the correct one here)

Regard,
Ryan Svihla

_
From: Joseph Tech <jaalex.t...@gmail.com>
Sent: Wednesday, August 31, 2016 11:16 PM
Subject: Re: Read timeouts on primary key queries
To:  <user@cassandra.apache.org>


Patrick,
The desc table is below (only col names changed) : 
CREATE TABLE db.tbl (    id1 text,    id2 text,    id3 text,    id4 text,    f1 
text,    f2 map<text, text>,    f3 map<text, text>,    created timestamp,    
updated timestamp,    PRIMARY KEY (id1, id2, id3, id4)) WITH CLUSTERING ORDER 
BY (id2 ASC, id3 ASC, id4 ASC)    AND bloom_filter_fp_chance = 0.01    AND 
caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'    AND comment = ''    
AND compaction = {'sstable_size_in_mb': '50', 'class': 
'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}    AND 
compression = {'sstable_compression': 
'org.apache.cassandra.io.compress.LZ4Compressor'}    AND 
dclocal_read_repair_chance = 0.0    AND default_time_to_live = 0    AND 
gc_grace_seconds = 864000    AND max_index_interval = 2048    AND 
memtable_flush_period_in_ms = 0    AND min_index_interval = 128    AND 
read_repair_chance = 0.1    AND speculative_retry = '99.0PERCENTILE';
and the query is select * from tbl where id1=? and id2=? and id3=? and id4=?
The timeouts happen within ~2s to ~5s, while the successful calls have avg of 
8ms and p99 of 15s. These times are seen from app side, the actual query times 
would be slightly lower. 
Is there a way to capture traces only when queries take longer than a specified 
duration? . We can't enable tracing in production given the volume of traffic. 
We see that the same query which timed out works fine later, so not sure if the 
trace of a successful run would help.
Thanks,Joseph

On Wed, Aug 31, 2016 at 8:05 PM, Patrick McFadin <pmcfa...@gmail.com> wrote:
If you are getting a timeout on one table, then a mismatch of RF and node count 
doesn't seem as likely. 
Time to look at your query. You said it was a 'select * from table where key=?' 
type query. I would next use the trace facility in cqlsh to investigate 
further. That's a good way to find hard to find issues. You should be looking 
for clear ledge where you go from single digit ms to 4 or 5 digit ms times. 
The other place to look is your data model for that table if you want to post 
the output from a desc table.
Patrick


On Tue, Aug 30, 2016 at 11:07 AM, Joseph Tech <jaalex.t...@gmail.com> wrote:
On further analysis, this issue happens only on 1 table in the KS which has the 
max reads. 
@Atul, I will look at system health, but didnt see anything standing out from 
GC logs. (using JDK 1.8_92 with G1GC). 
@Patrick , could you please elaborate the "mismatch on node count + RF" part.
On Tue, Aug 30, 2016 at 5:35 PM, Atul Saroha <atul.sar...@snapdeal.com> wrote:
There could be many reasons for this if it is intermittent. CPU usage + I/O 
wait status. As read are I/O intensive, your IOPS requirement should be met 
that time load. Heap issue if CPU is busy for GC only. Network health could be 
the reason. So better to look system health during that time when it comes.

-
Atul Saroha
Lead Software Engineer
M: +91 8447784271 T: +91 124-415-6069 EXT: 12369
Plot # 362, ASF Centre - Tower A, Udyog Vihar,
 Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA
On Tue, Aug 30, 2016 at 5:10 PM, Joseph Tech <jaalex.t...@gmail.com> wrote:
Hi Patrick,
The nodetool status shows all nodes up and normal now. From OpsCenter "Event 
Log" , there are some nodes reported as being down/up etc. during the timeframe 
of timeout, but these are Search workload nodes from the remote (non-local) DC. 
The RF is 3 and there are 9 nodes per DC.
Thanks,Joseph
On Mon, Aug 29, 2016 at 11:07 PM, Patrick McFadin <pmcfa...@gmail.com> wrote:
You aren't achieving quorum on your reads as the error is explains. That means 
you either have some nodes down or your topology is not matching up. The fact 
you are using LOCAL_QUORUM might point to a datacenter mis-match on node count 
+ RF. 
What does your nodetool status look like?
Patrick
On Mon, Aug 29, 2016 at 10:14 AM, Joseph Tech <jaalex.t...@gmail.com> wrote:
Hi,
We recently started getting intermittent timeouts on primary key queries 
(select * from table where key=)
The error is : com.datastax.driver.core.exceptions.ReadTimeoutException: 
Cassandra timeout during read query at consistency LOCAL_QUORUM (2 responses 
were required but only 1 replica
a responded)
The same query would work fine when tried directly from cqlsh. There are no 
indications in system.log for the table in question, though there were 
compactions in prog

Re: Read timeouts on primary key queries

2016-08-31 Thread Joseph Tech
Patrick,

The desc table is below (only col names changed) :

CREATE TABLE db.tbl (
id1 text,
id2 text,
id3 text,
id4 text,
f1 text,
f2 map,
f3 map,
created timestamp,
updated timestamp,
PRIMARY KEY (id1, id2, id3, id4)
) WITH CLUSTERING ORDER BY (id2 ASC, id3 ASC, id4 ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'sstable_size_in_mb': '50', 'class':
'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
AND compression = {'sstable_compression':
'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.0
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.1
AND speculative_retry = '99.0PERCENTILE';

and the query is select * from tbl where id1=? and id2=? and id3=? and id4=?

The timeouts happen within ~2s to ~5s, while the successful calls have avg
of 8ms and p99 of 15s. These times are seen from app side, the actual query
times would be slightly lower.

Is there a way to capture traces only when queries take longer than a
specified duration? . We can't enable tracing in production given the
volume of traffic. We see that the same query which timed out works fine
later, so not sure if the trace of a successful run would help.

Thanks,
Joseph


On Wed, Aug 31, 2016 at 8:05 PM, Patrick McFadin  wrote:

> If you are getting a timeout on one table, then a mismatch of RF and node
> count doesn't seem as likely.
>
> Time to look at your query. You said it was a 'select * from table where
> key=?' type query. I would next use the trace facility in cqlsh to
> investigate further. That's a good way to find hard to find issues. You
> should be looking for clear ledge where you go from single digit ms to 4 or
> 5 digit ms times.
>
> The other place to look is your data model for that table if you want to
> post the output from a desc table.
>
> Patrick
>
>
>
> On Tue, Aug 30, 2016 at 11:07 AM, Joseph Tech 
> wrote:
>
>> On further analysis, this issue happens only on 1 table in the KS which
>> has the max reads.
>>
>> @Atul, I will look at system health, but didnt see anything standing out
>> from GC logs. (using JDK 1.8_92 with G1GC).
>>
>> @Patrick , could you please elaborate the "mismatch on node count + RF"
>> part.
>>
>> On Tue, Aug 30, 2016 at 5:35 PM, Atul Saroha 
>> wrote:
>>
>>> There could be many reasons for this if it is intermittent. CPU usage +
>>> I/O wait status. As read are I/O intensive, your IOPS requirement should be
>>> met that time load. Heap issue if CPU is busy for GC only. Network health
>>> could be the reason. So better to look system health during that time when
>>> it comes.
>>>
>>> 
>>> -
>>> Atul Saroha
>>> *Lead Software Engineer*
>>> *M*: +91 8447784271 *T*: +91 124-415-6069 *EXT*: 12369
>>> Plot # 362, ASF Centre - Tower A, Udyog Vihar,
>>>  Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA
>>>
>>> On Tue, Aug 30, 2016 at 5:10 PM, Joseph Tech 
>>> wrote:
>>>
 Hi Patrick,

 The nodetool status shows all nodes up and normal now. From OpsCenter
 "Event Log" , there are some nodes reported as being down/up etc. during
 the timeframe of timeout, but these are Search workload nodes from the
 remote (non-local) DC. The RF is 3 and there are 9 nodes per DC.

 Thanks,
 Joseph

 On Mon, Aug 29, 2016 at 11:07 PM, Patrick McFadin 
 wrote:

> You aren't achieving quorum on your reads as the error is explains.
> That means you either have some nodes down or your topology is not 
> matching
> up. The fact you are using LOCAL_QUORUM might point to a datacenter
> mis-match on node count + RF.
>
> What does your nodetool status look like?
>
> Patrick
>
> On Mon, Aug 29, 2016 at 10:14 AM, Joseph Tech 
> wrote:
>
>> Hi,
>>
>> We recently started getting intermittent timeouts on primary key
>> queries (select * from table where key=)
>>
>> The error is : com.datastax.driver.core.exceptions.ReadTimeoutException:
>> Cassandra timeout during read query at consistency LOCAL_QUORUM (2
>> responses were required but only 1 replica
>> a responded)
>>
>> The same query would work fine when tried directly from cqlsh. There
>> are no indications in system.log for the table in question, though there
>> were compactions in progress for tables in another keyspace which is more
>> frequently accessed.
>>
>> My understanding is that the 

Re: Read timeouts on primary key queries

2016-08-31 Thread Patrick McFadin
If you are getting a timeout on one table, then a mismatch of RF and node
count doesn't seem as likely.

Time to look at your query. You said it was a 'select * from table where
key=?' type query. I would next use the trace facility in cqlsh to
investigate further. That's a good way to find hard to find issues. You
should be looking for clear ledge where you go from single digit ms to 4 or
5 digit ms times.

The other place to look is your data model for that table if you want to
post the output from a desc table.

Patrick



On Tue, Aug 30, 2016 at 11:07 AM, Joseph Tech  wrote:

> On further analysis, this issue happens only on 1 table in the KS which
> has the max reads.
>
> @Atul, I will look at system health, but didnt see anything standing out
> from GC logs. (using JDK 1.8_92 with G1GC).
>
> @Patrick , could you please elaborate the "mismatch on node count + RF"
> part.
>
> On Tue, Aug 30, 2016 at 5:35 PM, Atul Saroha 
> wrote:
>
>> There could be many reasons for this if it is intermittent. CPU usage +
>> I/O wait status. As read are I/O intensive, your IOPS requirement should be
>> met that time load. Heap issue if CPU is busy for GC only. Network health
>> could be the reason. So better to look system health during that time when
>> it comes.
>>
>> 
>> -
>> Atul Saroha
>> *Lead Software Engineer*
>> *M*: +91 8447784271 *T*: +91 124-415-6069 *EXT*: 12369
>> Plot # 362, ASF Centre - Tower A, Udyog Vihar,
>>  Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA
>>
>> On Tue, Aug 30, 2016 at 5:10 PM, Joseph Tech 
>> wrote:
>>
>>> Hi Patrick,
>>>
>>> The nodetool status shows all nodes up and normal now. From OpsCenter
>>> "Event Log" , there are some nodes reported as being down/up etc. during
>>> the timeframe of timeout, but these are Search workload nodes from the
>>> remote (non-local) DC. The RF is 3 and there are 9 nodes per DC.
>>>
>>> Thanks,
>>> Joseph
>>>
>>> On Mon, Aug 29, 2016 at 11:07 PM, Patrick McFadin 
>>> wrote:
>>>
 You aren't achieving quorum on your reads as the error is explains.
 That means you either have some nodes down or your topology is not matching
 up. The fact you are using LOCAL_QUORUM might point to a datacenter
 mis-match on node count + RF.

 What does your nodetool status look like?

 Patrick

 On Mon, Aug 29, 2016 at 10:14 AM, Joseph Tech 
 wrote:

> Hi,
>
> We recently started getting intermittent timeouts on primary key
> queries (select * from table where key=)
>
> The error is : com.datastax.driver.core.exceptions.ReadTimeoutException:
> Cassandra timeout during read query at consistency LOCAL_QUORUM (2
> responses were required but only 1 replica
> a responded)
>
> The same query would work fine when tried directly from cqlsh. There
> are no indications in system.log for the table in question, though there
> were compactions in progress for tables in another keyspace which is more
> frequently accessed.
>
> My understanding is that the chances of primary key queries timing out
> is very minimal. Please share the possible reasons / ways to debug this
> issue.
>
> We are using Cassandra 2.1 (DSE 4.8.7).
>
> Thanks,
> Joseph
>
>
>
>

>>>
>>
>


Re: Read timeouts on primary key queries

2016-08-30 Thread Joseph Tech
On further analysis, this issue happens only on 1 table in the KS which has
the max reads.

@Atul, I will look at system health, but didnt see anything standing out
from GC logs. (using JDK 1.8_92 with G1GC).

@Patrick , could you please elaborate the "mismatch on node count + RF"
part.

On Tue, Aug 30, 2016 at 5:35 PM, Atul Saroha 
wrote:

> There could be many reasons for this if it is intermittent. CPU usage +
> I/O wait status. As read are I/O intensive, your IOPS requirement should be
> met that time load. Heap issue if CPU is busy for GC only. Network health
> could be the reason. So better to look system health during that time when
> it comes.
>
> 
> -
> Atul Saroha
> *Lead Software Engineer*
> *M*: +91 8447784271 *T*: +91 124-415-6069 *EXT*: 12369
> Plot # 362, ASF Centre - Tower A, Udyog Vihar,
>  Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA
>
> On Tue, Aug 30, 2016 at 5:10 PM, Joseph Tech 
> wrote:
>
>> Hi Patrick,
>>
>> The nodetool status shows all nodes up and normal now. From OpsCenter
>> "Event Log" , there are some nodes reported as being down/up etc. during
>> the timeframe of timeout, but these are Search workload nodes from the
>> remote (non-local) DC. The RF is 3 and there are 9 nodes per DC.
>>
>> Thanks,
>> Joseph
>>
>> On Mon, Aug 29, 2016 at 11:07 PM, Patrick McFadin 
>> wrote:
>>
>>> You aren't achieving quorum on your reads as the error is explains. That
>>> means you either have some nodes down or your topology is not matching up.
>>> The fact you are using LOCAL_QUORUM might point to a datacenter mis-match
>>> on node count + RF.
>>>
>>> What does your nodetool status look like?
>>>
>>> Patrick
>>>
>>> On Mon, Aug 29, 2016 at 10:14 AM, Joseph Tech 
>>> wrote:
>>>
 Hi,

 We recently started getting intermittent timeouts on primary key
 queries (select * from table where key=)

 The error is : com.datastax.driver.core.exceptions.ReadTimeoutException:
 Cassandra timeout during read query at consistency LOCAL_QUORUM (2
 responses were required but only 1 replica
 a responded)

 The same query would work fine when tried directly from cqlsh. There
 are no indications in system.log for the table in question, though there
 were compactions in progress for tables in another keyspace which is more
 frequently accessed.

 My understanding is that the chances of primary key queries timing out
 is very minimal. Please share the possible reasons / ways to debug this
 issue.

 We are using Cassandra 2.1 (DSE 4.8.7).

 Thanks,
 Joseph




>>>
>>
>


Re: Read timeouts on primary key queries

2016-08-30 Thread Atul Saroha
There could be many reasons for this if it is intermittent. CPU usage + I/O
wait status. As read are I/O intensive, your IOPS requirement should be met
that time load. Heap issue if CPU is busy for GC only. Network health could
be the reason. So better to look system health during that time when it
comes.

-
Atul Saroha
*Lead Software Engineer*
*M*: +91 8447784271 *T*: +91 124-415-6069 *EXT*: 12369
Plot # 362, ASF Centre - Tower A, Udyog Vihar,
 Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA

On Tue, Aug 30, 2016 at 5:10 PM, Joseph Tech  wrote:

> Hi Patrick,
>
> The nodetool status shows all nodes up and normal now. From OpsCenter
> "Event Log" , there are some nodes reported as being down/up etc. during
> the timeframe of timeout, but these are Search workload nodes from the
> remote (non-local) DC. The RF is 3 and there are 9 nodes per DC.
>
> Thanks,
> Joseph
>
> On Mon, Aug 29, 2016 at 11:07 PM, Patrick McFadin 
> wrote:
>
>> You aren't achieving quorum on your reads as the error is explains. That
>> means you either have some nodes down or your topology is not matching up.
>> The fact you are using LOCAL_QUORUM might point to a datacenter mis-match
>> on node count + RF.
>>
>> What does your nodetool status look like?
>>
>> Patrick
>>
>> On Mon, Aug 29, 2016 at 10:14 AM, Joseph Tech 
>> wrote:
>>
>>> Hi,
>>>
>>> We recently started getting intermittent timeouts on primary key queries
>>> (select * from table where key=)
>>>
>>> The error is : com.datastax.driver.core.exceptions.ReadTimeoutException:
>>> Cassandra timeout during read query at consistency LOCAL_QUORUM (2
>>> responses were required but only 1 replica
>>> a responded)
>>>
>>> The same query would work fine when tried directly from cqlsh. There are
>>> no indications in system.log for the table in question, though there were
>>> compactions in progress for tables in another keyspace which is more
>>> frequently accessed.
>>>
>>> My understanding is that the chances of primary key queries timing out
>>> is very minimal. Please share the possible reasons / ways to debug this
>>> issue.
>>>
>>> We are using Cassandra 2.1 (DSE 4.8.7).
>>>
>>> Thanks,
>>> Joseph
>>>
>>>
>>>
>>>
>>
>


Re: Read timeouts on primary key queries

2016-08-30 Thread Joseph Tech
Hi Patrick,

The nodetool status shows all nodes up and normal now. From OpsCenter
"Event Log" , there are some nodes reported as being down/up etc. during
the timeframe of timeout, but these are Search workload nodes from the
remote (non-local) DC. The RF is 3 and there are 9 nodes per DC.

Thanks,
Joseph

On Mon, Aug 29, 2016 at 11:07 PM, Patrick McFadin 
wrote:

> You aren't achieving quorum on your reads as the error is explains. That
> means you either have some nodes down or your topology is not matching up.
> The fact you are using LOCAL_QUORUM might point to a datacenter mis-match
> on node count + RF.
>
> What does your nodetool status look like?
>
> Patrick
>
> On Mon, Aug 29, 2016 at 10:14 AM, Joseph Tech 
> wrote:
>
>> Hi,
>>
>> We recently started getting intermittent timeouts on primary key queries
>> (select * from table where key=)
>>
>> The error is : com.datastax.driver.core.exceptions.ReadTimeoutException:
>> Cassandra timeout during read query at consistency LOCAL_QUORUM (2
>> responses were required but only 1 replica
>> a responded)
>>
>> The same query would work fine when tried directly from cqlsh. There are
>> no indications in system.log for the table in question, though there were
>> compactions in progress for tables in another keyspace which is more
>> frequently accessed.
>>
>> My understanding is that the chances of primary key queries timing out is
>> very minimal. Please share the possible reasons / ways to debug this issue.
>>
>> We are using Cassandra 2.1 (DSE 4.8.7).
>>
>> Thanks,
>> Joseph
>>
>>
>>
>>
>


Re: Read timeouts on primary key queries

2016-08-29 Thread Patrick McFadin
You aren't achieving quorum on your reads as the error is explains. That
means you either have some nodes down or your topology is not matching up.
The fact you are using LOCAL_QUORUM might point to a datacenter mis-match
on node count + RF.

What does your nodetool status look like?

Patrick

On Mon, Aug 29, 2016 at 10:14 AM, Joseph Tech  wrote:

> Hi,
>
> We recently started getting intermittent timeouts on primary key queries
> (select * from table where key=)
>
> The error is : com.datastax.driver.core.exceptions.ReadTimeoutException:
> Cassandra timeout during read query at consistency LOCAL_QUORUM (2
> responses were required but only 1 replica
> a responded)
>
> The same query would work fine when tried directly from cqlsh. There are
> no indications in system.log for the table in question, though there were
> compactions in progress for tables in another keyspace which is more
> frequently accessed.
>
> My understanding is that the chances of primary key queries timing out is
> very minimal. Please share the possible reasons / ways to debug this issue.
>
> We are using Cassandra 2.1 (DSE 4.8.7).
>
> Thanks,
> Joseph
>
>
>
>


Read timeouts on primary key queries

2016-08-29 Thread Joseph Tech
Hi,

We recently started getting intermittent timeouts on primary key queries
(select * from table where key=)

The error is : com.datastax.driver.core.exceptions.ReadTimeoutException:
Cassandra timeout during read query at consistency LOCAL_QUORUM (2
responses were required but only 1 replica
a responded)

The same query would work fine when tried directly from cqlsh. There are no
indications in system.log for the table in question, though there were
compactions in progress for tables in another keyspace which is more
frequently accessed.

My understanding is that the chances of primary key queries timing out is
very minimal. Please share the possible reasons / ways to debug this issue.

We are using Cassandra 2.1 (DSE 4.8.7).

Thanks,
Joseph


Re: Consistent read timeouts for bursts of reads

2016-03-04 Thread Mike Heffner
Emils,

We believe we've tracked it down to the following issue:
https://issues.apache.org/jira/browse/CASSANDRA-11302, introduced in 2.1.5.

We are running a build of 2.2.5 with that patch and so far have not seen
any more timeouts.

Mike

On Fri, Mar 4, 2016 at 3:14 AM, Emīls Šolmanis 
wrote:

> Mike,
>
> Is that where you've bisected it to having been introduced?
>
> I'll see what I can do, but doubt it, since we've long since upgraded prod
> to 2.2.4 (and stage before that) and the tests I'm running were for a new
> feature.
>
>
> On Fri, 4 Mar 2016 03:54 Mike Heffner,  wrote:
>
>> Emils,
>>
>> I realize this may be a big downgrade, but are you timeouts reproducible
>> under Cassandra 2.1.4?
>>
>> Mike
>>
>> On Thu, Feb 25, 2016 at 10:34 AM, Emīls Šolmanis <
>> emils.solma...@gmail.com> wrote:
>>
>>> Having had a read through the archives, I missed this at first, but this
>>> seems to be *exactly* like what we're experiencing.
>>>
>>> http://www.mail-archive.com/user@cassandra.apache.org/msg46064.html
>>>
>>> Only difference is we're getting this for reads and using CQL, but the
>>> behaviour is identical.
>>>
>>> On Thu, 25 Feb 2016 at 14:55 Emīls Šolmanis 
>>> wrote:
>>>
 Hello,

 We're having a problem with concurrent requests. It seems that whenever
 we try resolving more
 than ~ 15 queries at the same time, one or two get a read timeout and
 then succeed on a retry.

 We're running Cassandra 2.2.4 accessed via the 2.1.9 Datastax driver on
 AWS.

 What we've found while investigating:

  * this is not db-wide. Trying the same pattern against another table
 everything works fine.
  * it fails 1 or 2 requests regardless of how many are executed in
 parallel, i.e., it's still 1 or 2 when we ramp it up to ~ 120 concurrent
 requests and doesn't seem to scale up.
  * the problem is consistently reproducible. It happens both under
 heavier load and when just firing off a single batch of requests for
 testing.
  * tracing the faulty requests says everything is great. An example
 trace: https://gist.github.com/emilssolmanis/41e1e2ecdfd9a0569b1a
  * the only peculiar thing in the logs is there's no acknowledgement of
 the request being accepted by the server, as seen in
 https://gist.github.com/emilssolmanis/242d9d02a6d8fb91da8a
  * there's nothing funny in the timed out Cassandra node's logs around
 that time as far as I can tell, not even in the debug logs.

 Any ideas about what might be causing this, pointers to server config
 options, or how else we might debug this would be much appreciated.

 Kind regards,
 Emils


>>
>>
>> --
>>
>>   Mike Heffner 
>>   Librato, Inc.
>>
>>


-- 

  Mike Heffner 
  Librato, Inc.


Re: Consistent read timeouts for bursts of reads

2016-03-04 Thread Emīls Šolmanis
Mike,

Is that where you've bisected it to having been introduced?

I'll see what I can do, but doubt it, since we've long since upgraded prod
to 2.2.4 (and stage before that) and the tests I'm running were for a new
feature.

On Fri, 4 Mar 2016 03:54 Mike Heffner,  wrote:

> Emils,
>
> I realize this may be a big downgrade, but are you timeouts reproducible
> under Cassandra 2.1.4?
>
> Mike
>
> On Thu, Feb 25, 2016 at 10:34 AM, Emīls Šolmanis  > wrote:
>
>> Having had a read through the archives, I missed this at first, but this
>> seems to be *exactly* like what we're experiencing.
>>
>> http://www.mail-archive.com/user@cassandra.apache.org/msg46064.html
>>
>> Only difference is we're getting this for reads and using CQL, but the
>> behaviour is identical.
>>
>> On Thu, 25 Feb 2016 at 14:55 Emīls Šolmanis 
>> wrote:
>>
>>> Hello,
>>>
>>> We're having a problem with concurrent requests. It seems that whenever
>>> we try resolving more
>>> than ~ 15 queries at the same time, one or two get a read timeout and
>>> then succeed on a retry.
>>>
>>> We're running Cassandra 2.2.4 accessed via the 2.1.9 Datastax driver on
>>> AWS.
>>>
>>> What we've found while investigating:
>>>
>>>  * this is not db-wide. Trying the same pattern against another table
>>> everything works fine.
>>>  * it fails 1 or 2 requests regardless of how many are executed in
>>> parallel, i.e., it's still 1 or 2 when we ramp it up to ~ 120 concurrent
>>> requests and doesn't seem to scale up.
>>>  * the problem is consistently reproducible. It happens both under
>>> heavier load and when just firing off a single batch of requests for
>>> testing.
>>>  * tracing the faulty requests says everything is great. An example
>>> trace: https://gist.github.com/emilssolmanis/41e1e2ecdfd9a0569b1a
>>>  * the only peculiar thing in the logs is there's no acknowledgement of
>>> the request being accepted by the server, as seen in
>>> https://gist.github.com/emilssolmanis/242d9d02a6d8fb91da8a
>>>  * there's nothing funny in the timed out Cassandra node's logs around
>>> that time as far as I can tell, not even in the debug logs.
>>>
>>> Any ideas about what might be causing this, pointers to server config
>>> options, or how else we might debug this would be much appreciated.
>>>
>>> Kind regards,
>>> Emils
>>>
>>>
>
>
> --
>
>   Mike Heffner 
>   Librato, Inc.
>
>


Re: Consistent read timeouts for bursts of reads

2016-03-03 Thread Mike Heffner
Emils,

I realize this may be a big downgrade, but are you timeouts reproducible
under Cassandra 2.1.4?

Mike

On Thu, Feb 25, 2016 at 10:34 AM, Emīls Šolmanis 
wrote:

> Having had a read through the archives, I missed this at first, but this
> seems to be *exactly* like what we're experiencing.
>
> http://www.mail-archive.com/user@cassandra.apache.org/msg46064.html
>
> Only difference is we're getting this for reads and using CQL, but the
> behaviour is identical.
>
> On Thu, 25 Feb 2016 at 14:55 Emīls Šolmanis 
> wrote:
>
>> Hello,
>>
>> We're having a problem with concurrent requests. It seems that whenever
>> we try resolving more
>> than ~ 15 queries at the same time, one or two get a read timeout and
>> then succeed on a retry.
>>
>> We're running Cassandra 2.2.4 accessed via the 2.1.9 Datastax driver on
>> AWS.
>>
>> What we've found while investigating:
>>
>>  * this is not db-wide. Trying the same pattern against another table
>> everything works fine.
>>  * it fails 1 or 2 requests regardless of how many are executed in
>> parallel, i.e., it's still 1 or 2 when we ramp it up to ~ 120 concurrent
>> requests and doesn't seem to scale up.
>>  * the problem is consistently reproducible. It happens both under
>> heavier load and when just firing off a single batch of requests for
>> testing.
>>  * tracing the faulty requests says everything is great. An example
>> trace: https://gist.github.com/emilssolmanis/41e1e2ecdfd9a0569b1a
>>  * the only peculiar thing in the logs is there's no acknowledgement of
>> the request being accepted by the server, as seen in
>> https://gist.github.com/emilssolmanis/242d9d02a6d8fb91da8a
>>  * there's nothing funny in the timed out Cassandra node's logs around
>> that time as far as I can tell, not even in the debug logs.
>>
>> Any ideas about what might be causing this, pointers to server config
>> options, or how else we might debug this would be much appreciated.
>>
>> Kind regards,
>> Emils
>>
>>


-- 

  Mike Heffner 
  Librato, Inc.


Re: Consistent read timeouts for bursts of reads

2016-03-01 Thread Carlos Alonso
We have had similar issues sometimes.

Usually the problem was that failing queries where reading the same
partition that another query still running and that partition is too big.

The fact that is reading the same partition is why your query works upon
retry. The fact that the partition (or the retrieved range) is too big is
why the nodes get overloaded and end up dropping the read requests.

If you see GC pressure that would point towards my hypothesis too.

Hope this helps.

Carlos Alonso | Software Engineer | @calonso 

On 25 February 2016 at 16:34, Emīls Šolmanis 
wrote:

> Having had a read through the archives, I missed this at first, but this
> seems to be *exactly* like what we're experiencing.
>
> http://www.mail-archive.com/user@cassandra.apache.org/msg46064.html
>
> Only difference is we're getting this for reads and using CQL, but the
> behaviour is identical.
>
> On Thu, 25 Feb 2016 at 14:55 Emīls Šolmanis 
> wrote:
>
>> Hello,
>>
>> We're having a problem with concurrent requests. It seems that whenever
>> we try resolving more
>> than ~ 15 queries at the same time, one or two get a read timeout and
>> then succeed on a retry.
>>
>> We're running Cassandra 2.2.4 accessed via the 2.1.9 Datastax driver on
>> AWS.
>>
>> What we've found while investigating:
>>
>>  * this is not db-wide. Trying the same pattern against another table
>> everything works fine.
>>  * it fails 1 or 2 requests regardless of how many are executed in
>> parallel, i.e., it's still 1 or 2 when we ramp it up to ~ 120 concurrent
>> requests and doesn't seem to scale up.
>>  * the problem is consistently reproducible. It happens both under
>> heavier load and when just firing off a single batch of requests for
>> testing.
>>  * tracing the faulty requests says everything is great. An example
>> trace: https://gist.github.com/emilssolmanis/41e1e2ecdfd9a0569b1a
>>  * the only peculiar thing in the logs is there's no acknowledgement of
>> the request being accepted by the server, as seen in
>> https://gist.github.com/emilssolmanis/242d9d02a6d8fb91da8a
>>  * there's nothing funny in the timed out Cassandra node's logs around
>> that time as far as I can tell, not even in the debug logs.
>>
>> Any ideas about what might be causing this, pointers to server config
>> options, or how else we might debug this would be much appreciated.
>>
>> Kind regards,
>> Emils
>>
>>


Re: Consistent read timeouts for bursts of reads

2016-02-25 Thread Emīls Šolmanis
Having had a read through the archives, I missed this at first, but this
seems to be *exactly* like what we're experiencing.

http://www.mail-archive.com/user@cassandra.apache.org/msg46064.html

Only difference is we're getting this for reads and using CQL, but the
behaviour is identical.

On Thu, 25 Feb 2016 at 14:55 Emīls Šolmanis 
wrote:

> Hello,
>
> We're having a problem with concurrent requests. It seems that whenever we
> try resolving more
> than ~ 15 queries at the same time, one or two get a read timeout and then
> succeed on a retry.
>
> We're running Cassandra 2.2.4 accessed via the 2.1.9 Datastax driver on
> AWS.
>
> What we've found while investigating:
>
>  * this is not db-wide. Trying the same pattern against another table
> everything works fine.
>  * it fails 1 or 2 requests regardless of how many are executed in
> parallel, i.e., it's still 1 or 2 when we ramp it up to ~ 120 concurrent
> requests and doesn't seem to scale up.
>  * the problem is consistently reproducible. It happens both under heavier
> load and when just firing off a single batch of requests for testing.
>  * tracing the faulty requests says everything is great. An example trace:
> https://gist.github.com/emilssolmanis/41e1e2ecdfd9a0569b1a
>  * the only peculiar thing in the logs is there's no acknowledgement of
> the request being accepted by the server, as seen in
> https://gist.github.com/emilssolmanis/242d9d02a6d8fb91da8a
>  * there's nothing funny in the timed out Cassandra node's logs around
> that time as far as I can tell, not even in the debug logs.
>
> Any ideas about what might be causing this, pointers to server config
> options, or how else we might debug this would be much appreciated.
>
> Kind regards,
> Emils
>
>


Consistent read timeouts for bursts of reads

2016-02-25 Thread Emīls Šolmanis
Hello,

We're having a problem with concurrent requests. It seems that whenever we
try resolving more
than ~ 15 queries at the same time, one or two get a read timeout and then
succeed on a retry.

We're running Cassandra 2.2.4 accessed via the 2.1.9 Datastax driver on AWS.

What we've found while investigating:

 * this is not db-wide. Trying the same pattern against another table
everything works fine.
 * it fails 1 or 2 requests regardless of how many are executed in
parallel, i.e., it's still 1 or 2 when we ramp it up to ~ 120 concurrent
requests and doesn't seem to scale up.
 * the problem is consistently reproducible. It happens both under heavier
load and when just firing off a single batch of requests for testing.
 * tracing the faulty requests says everything is great. An example trace:
https://gist.github.com/emilssolmanis/41e1e2ecdfd9a0569b1a
 * the only peculiar thing in the logs is there's no acknowledgement of the
request being accepted by the server, as seen in
https://gist.github.com/emilssolmanis/242d9d02a6d8fb91da8a
 * there's nothing funny in the timed out Cassandra node's logs around that
time as far as I can tell, not even in the debug logs.

Any ideas about what might be causing this, pointers to server config
options, or how else we might debug this would be much appreciated.

Kind regards,
Emils


Read timeouts with ALLOW FILTERING turned on

2014-08-05 Thread Clint Kelly
Hi all,

Allow me to rephrase a question I asked last week.  I am performing some
queries with ALLOW FILTERING and getting consistent read timeouts like the
following:



com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra
timeout during read query at consistency ONE (1 responses were
required but only 0 replica responded)


These errors occur only during multi-row scans, and only during integration
tests on our build server.

I tried to see if I could replicate this error by reducing
read_request_timeout_in_ms when I run Cassandra on my local machine
(where I have not seen this error), but that is not working.  Are there any
other parameters that I need to adjust?  I'd feel better if I could at
least replicate this failure by reducing the read_request_timeout_in_ms
(since doing so would mean I actually understand what is going wrong...).

Best regards,
Clint


Re: Read timeouts with ALLOW FILTERING turned on

2014-08-05 Thread Robert Coli
On Tue, Aug 5, 2014 at 10:01 AM, Clint Kelly clint.ke...@gmail.com wrote:

 Allow me to rephrase a question I asked last week.  I am performing some
 queries with ALLOW FILTERING and getting consistent read timeouts like the
 following:


ALLOW FILTERING should be renamed PROBABLY TIMEOUT in order to properly
describe its typical performance.

As a general statement, if you have to ALLOW FILTERING, you are probably
Doing It Wrong in terms of schema design.

A correctly operated cluster is unlikely to need to increase the default
timeouts. If you find yourself needing to do so, you are, again, probably
Doing It Wrong.

=Rob


Re: Read timeouts with ALLOW FILTERING turned on

2014-08-05 Thread Sávio S . Teles de Oliveira
How much did you reduce *read_request_timeout_in_ms* on your local machine?
Cassandra timeout during read query is higher than one machine because
Cassandra server must run the read operation in more servers (so you have
network traffic).


2014-08-05 14:54 GMT-03:00 Robert Coli rc...@eventbrite.com:

 On Tue, Aug 5, 2014 at 10:01 AM, Clint Kelly clint.ke...@gmail.com
 wrote:

 Allow me to rephrase a question I asked last week.  I am performing some
 queries with ALLOW FILTERING and getting consistent read timeouts like the
 following:


 ALLOW FILTERING should be renamed PROBABLY TIMEOUT in order to properly
 describe its typical performance.

 As a general statement, if you have to ALLOW FILTERING, you are probably
 Doing It Wrong in terms of schema design.

 A correctly operated cluster is unlikely to need to increase the default
 timeouts. If you find yourself needing to do so, you are, again, probably
 Doing It Wrong.

 =Rob




-- 
Atenciosamente,
Sávio S. Teles de Oliveira
voice: +55 62 9136 6996
http://br.linkedin.com/in/savioteles
Mestrando em Ciências da Computação - UFG
Arquiteto de Software
CUIA Internet Brasil


Re: Read timeouts with ALLOW FILTERING turned on

2014-08-05 Thread Clint Kelly
Hi Rob,

Thanks for your feedback.  I understand that use of ALLOW FILTERING is
not a best practice.  In this case, however, I am building a tool on
top of Cassandra that allows users to sometimes do things that are
less than optimal.  When they try to do expensive queries like this,
I'd rather provide a higher limit before timing out, but I can't seem
to change the behavior of Cassandra by tweaking any of the parameters
in the cassandra.yaml file or in the DataStax Java driver's Cluster
object.

FWIW these queries are also in batch jobs where we can tolerate the
extra latency.

Thanks for your help!

Best regards,
Clint


On Tue, Aug 5, 2014 at 10:54 AM, Robert Coli rc...@eventbrite.com wrote:
 On Tue, Aug 5, 2014 at 10:01 AM, Clint Kelly clint.ke...@gmail.com wrote:

 Allow me to rephrase a question I asked last week.  I am performing some
 queries with ALLOW FILTERING and getting consistent read timeouts like the
 following:


 ALLOW FILTERING should be renamed PROBABLY TIMEOUT in order to properly
 describe its typical performance.

 As a general statement, if you have to ALLOW FILTERING, you are probably
 Doing It Wrong in terms of schema design.

 A correctly operated cluster is unlikely to need to increase the default
 timeouts. If you find yourself needing to do so, you are, again, probably
 Doing It Wrong.

 =Rob


Re: Read timeouts with ALLOW FILTERING turned on

2014-08-05 Thread Clint Kelly
Ah FWIW I was able to reproduce the problem by reducing
range_request_timeout_in_ms.  This is great since I want to increase
the timeout for batch jobs where we scan a large set of rows, but
leave the timeout for single-row queries alone.

Best regards,
Clint


On Tue, Aug 5, 2014 at 11:42 AM, Clint Kelly clint.ke...@gmail.com wrote:
 Hi Rob,

 Thanks for your feedback.  I understand that use of ALLOW FILTERING is
 not a best practice.  In this case, however, I am building a tool on
 top of Cassandra that allows users to sometimes do things that are
 less than optimal.  When they try to do expensive queries like this,
 I'd rather provide a higher limit before timing out, but I can't seem
 to change the behavior of Cassandra by tweaking any of the parameters
 in the cassandra.yaml file or in the DataStax Java driver's Cluster
 object.

 FWIW these queries are also in batch jobs where we can tolerate the
 extra latency.

 Thanks for your help!

 Best regards,
 Clint


 On Tue, Aug 5, 2014 at 10:54 AM, Robert Coli rc...@eventbrite.com wrote:
 On Tue, Aug 5, 2014 at 10:01 AM, Clint Kelly clint.ke...@gmail.com wrote:

 Allow me to rephrase a question I asked last week.  I am performing some
 queries with ALLOW FILTERING and getting consistent read timeouts like the
 following:


 ALLOW FILTERING should be renamed PROBABLY TIMEOUT in order to properly
 describe its typical performance.

 As a general statement, if you have to ALLOW FILTERING, you are probably
 Doing It Wrong in terms of schema design.

 A correctly operated cluster is unlikely to need to increase the default
 timeouts. If you find yourself needing to do so, you are, again, probably
 Doing It Wrong.

 =Rob


Re: Read timeouts with ALLOW FILTERING turned on

2014-08-05 Thread Robert Coli
On Tue, Aug 5, 2014 at 11:53 AM, Clint Kelly clint.ke...@gmail.com wrote:

 Ah FWIW I was able to reproduce the problem by reducing
 range_request_timeout_in_ms.  This is great since I want to increase
 the timeout for batch jobs where we scan a large set of rows, but
 leave the timeout for single-row queries alone.


You have just explicated (a subset of) the reason the timeouts were broken
out.

https://issues.apache.org/jira/browse/CASSANDRA-2819

=Rob


Re: Occasional read timeouts seen during row scans

2014-08-04 Thread Clint Kelly
Hi all,

1. I saw this issue in an integration test with a single CassandraDaemon
running, so I don't think it was a time synchronization issue.

2. I did not look in the log for garbage collection issues, but I was able
to reproduce this 100% deterministically, so I think it was an issue having
to do with the organization and size of my data.  I have been unable to fix
this by retrying failed reads (because this behavior, when it occurs, is
deterministic).

I was looking for some kind of guidance on how to tune Cassandra to
increase or decrease this timeout threshold such that I can tolerate a
higher timeout in the cluster and so that I can reproduce this in some unit
or integration tests.

Also if anyone has any ideas on how my particular table layout might lead
to these kinds of problems, that would be great.  Thanks!

Best regards,
Clint




On Sat, Aug 2, 2014 at 4:40 AM, Jack Krupansky j...@basetechnology.com
wrote:

 Are you seeing garbage collections in the log at around the same time as
 these occasional timeouts?

 Can you identify which requests are timing out? And then can you try some
 of them again and see if they succeed at least sometimes and how long they
 take then?

 Do you have a test case that you believe does the worst case for
 filtering? How long does it take?

 Can you monitor if the timed-out node is compute bound or I/O bound at the
 times of failure? Do you see spikes for compute or I/O?

 Can your app simply retry the timed-out request? Does even a retry
 typically fail, or does retry get you to 100% success? I would note that
 even the best distributed systems do not guarantee zero failures for
 environmental issues, so apps need to tolerate occasional failures.

 -- Jack Krupansky

 -Original Message- From: Duncan Sands
 Sent: Saturday, August 2, 2014 7:04 AM
 To: user@cassandra.apache.org
 Subject: Re: Occasional read timeouts seen during row scans


 Hi Clint, is time correctly synchronized between your nodes?

 Ciao, Duncan.

 On 02/08/14 02:12, Clint Kelly wrote:

 BTW a few other details, sorry for omitting these:

   * We are using version 2.0.4 of the Java driver
   * We are running against Cassandra 2.0.9
   * I tried messing around with the page size (even reducing it down to a
 single
 record) and that didn't seem to help (in the cases where I was
 observing the
 timeout)

 Best regards,
 Clint



 On Fri, Aug 1, 2014 at 5:02 PM, Clint Kelly clint.ke...@gmail.com
 mailto:clint.ke...@gmail.com wrote:

 Hi everyone,

 I am seeing occasional read timeouts during multi-row queries, but I'm
 having difficulty reproducing them or understanding what the problem
 is.

 First, some background:

 Our team wrote a custom MapReduce InputFormat that looks pretty
 similar to the DataStax InputFormat except that it allows queries that
 touch multiple CQL tables with the same PRIMARY KEY format (it then
 assembles together results from multiple tables for the same primary
 key before sending them back to the user in the RecordReader).

 During a large batch job in a cluster and during some integration
 tests, we see errors like the following:

 com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra
 timeout during read query at consistency ONE (1 responses were
 required but only 0 replica responded)

 Our queries look like this:

 SELECT token(eid_component), eid_component, lg, family, qualifier,
 version, value FROM kiji_it0.t_foo WHERE lg=? AND family=? AND
 qualifier=?  AND token(eid_component) = ? AND token(eid_component) =
 ?ALLOW FILTERING;

 Our tables look like the following:

 CREATE TABLE kiji_it0.t_foo (
   eid_component varchar,
   lg varchar,
   family blob,
   qualifier blob,
   version bigint,
   value blob,
   PRIMARY KEY ((eid_component), lg, family, qualifier, version))
 WITH CLUSTERING ORDER BY (lg ASC, family ASC, qualifier ASC, version
 DESC);

 with an additional index on the lg column (the lg column is
 *extremely* low cardinality).

 (FWIW I realize that having ALLOW FILTERING is potentially a Very
 Bad Idea, but we are building a framework on top of Cassandra and
 MapReduce that allows our users to occasionally make queries like
 this.  We don't really mind taking a performance hit since these are
 batch jobs.  We are considering eventually supporting some automatic
 denormalization, but have not done so yet.)

 If I change the query above to remove the WHERE clauses, the errors
 go away.

 I think I understand the problem here---there are some rows that have
 huge amounts of data that we have to scan over, and occasionally those
 scans take so long that there is a timeout.

 I have a couple of questions:

 1. What parameters in my code or in the Cassandra cluster do I need to
 adjust to get rid of these timeouts?  Our table layout is designed

Re: Occasional read timeouts seen during row scans

2014-08-02 Thread Duncan Sands

Hi Clint, is time correctly synchronized between your nodes?

Ciao, Duncan.

On 02/08/14 02:12, Clint Kelly wrote:

BTW a few other details, sorry for omitting these:

  * We are using version 2.0.4 of the Java driver
  * We are running against Cassandra 2.0.9
  * I tried messing around with the page size (even reducing it down to a single
record) and that didn't seem to help (in the cases where I was observing the
timeout)

Best regards,
Clint



On Fri, Aug 1, 2014 at 5:02 PM, Clint Kelly clint.ke...@gmail.com
mailto:clint.ke...@gmail.com wrote:

Hi everyone,

I am seeing occasional read timeouts during multi-row queries, but I'm
having difficulty reproducing them or understanding what the problem
is.

First, some background:

Our team wrote a custom MapReduce InputFormat that looks pretty
similar to the DataStax InputFormat except that it allows queries that
touch multiple CQL tables with the same PRIMARY KEY format (it then
assembles together results from multiple tables for the same primary
key before sending them back to the user in the RecordReader).

During a large batch job in a cluster and during some integration
tests, we see errors like the following:

com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra
timeout during read query at consistency ONE (1 responses were
required but only 0 replica responded)

Our queries look like this:

SELECT token(eid_component), eid_component, lg, family, qualifier,
version, value FROM kiji_it0.t_foo WHERE lg=? AND family=? AND
qualifier=?  AND token(eid_component) = ? AND token(eid_component) =
?ALLOW FILTERING;

Our tables look like the following:

CREATE TABLE kiji_it0.t_foo (
  eid_component varchar,
  lg varchar,
  family blob,
  qualifier blob,
  version bigint,
  value blob,
  PRIMARY KEY ((eid_component), lg, family, qualifier, version))
WITH CLUSTERING ORDER BY (lg ASC, family ASC, qualifier ASC, version DESC);

with an additional index on the lg column (the lg column is
*extremely* low cardinality).

(FWIW I realize that having ALLOW FILTERING is potentially a Very
Bad Idea, but we are building a framework on top of Cassandra and
MapReduce that allows our users to occasionally make queries like
this.  We don't really mind taking a performance hit since these are
batch jobs.  We are considering eventually supporting some automatic
denormalization, but have not done so yet.)

If I change the query above to remove the WHERE clauses, the errors go away.

I think I understand the problem here---there are some rows that have
huge amounts of data that we have to scan over, and occasionally those
scans take so long that there is a timeout.

I have a couple of questions:

1. What parameters in my code or in the Cassandra cluster do I need to
adjust to get rid of these timeouts?  Our table layout is designed
such that its real-time performance should be pretty good, so I don't
mind if the batch queries are a little bit slow.  Do I need to change
the read_request_timeout_in_ms parameter?  Or something else?

2. I have tried to create a test to reproduce this problem, but I have
been unable to do so.  Any suggestions on how to do this?  I tried
creating a table similar to that described above and filling in a huge
amount of data for some rows to try to increase the amount of space
that we'd need to skip over.  I also tried reducing
read_request_timeout_in_ms from 5000 ms to 50 ms and still no dice.

Let me know if anyone has any thoughts or suggestions.  At a minimum
I'd like to be able to reproduce these read timeout errors in some
integration tests.

Thanks!

Best regards,
Clint






Re: Occasional read timeouts seen during row scans

2014-08-02 Thread Jack Krupansky
Are you seeing garbage collections in the log at around the same time as 
these occasional timeouts?


Can you identify which requests are timing out? And then can you try some of 
them again and see if they succeed at least sometimes and how long they take 
then?


Do you have a test case that you believe does the worst case for filtering? 
How long does it take?


Can you monitor if the timed-out node is compute bound or I/O bound at the 
times of failure? Do you see spikes for compute or I/O?


Can your app simply retry the timed-out request? Does even a retry typically 
fail, or does retry get you to 100% success? I would note that even the best 
distributed systems do not guarantee zero failures for environmental issues, 
so apps need to tolerate occasional failures.


-- Jack Krupansky

-Original Message- 
From: Duncan Sands

Sent: Saturday, August 2, 2014 7:04 AM
To: user@cassandra.apache.org
Subject: Re: Occasional read timeouts seen during row scans

Hi Clint, is time correctly synchronized between your nodes?

Ciao, Duncan.

On 02/08/14 02:12, Clint Kelly wrote:

BTW a few other details, sorry for omitting these:

  * We are using version 2.0.4 of the Java driver
  * We are running against Cassandra 2.0.9
  * I tried messing around with the page size (even reducing it down to a 
single
record) and that didn't seem to help (in the cases where I was 
observing the

timeout)

Best regards,
Clint



On Fri, Aug 1, 2014 at 5:02 PM, Clint Kelly clint.ke...@gmail.com
mailto:clint.ke...@gmail.com wrote:

Hi everyone,

I am seeing occasional read timeouts during multi-row queries, but I'm
having difficulty reproducing them or understanding what the problem
is.

First, some background:

Our team wrote a custom MapReduce InputFormat that looks pretty
similar to the DataStax InputFormat except that it allows queries that
touch multiple CQL tables with the same PRIMARY KEY format (it then
assembles together results from multiple tables for the same primary
key before sending them back to the user in the RecordReader).

During a large batch job in a cluster and during some integration
tests, we see errors like the following:

com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra
timeout during read query at consistency ONE (1 responses were
required but only 0 replica responded)

Our queries look like this:

SELECT token(eid_component), eid_component, lg, family, qualifier,
version, value FROM kiji_it0.t_foo WHERE lg=? AND family=? AND
qualifier=?  AND token(eid_component) = ? AND token(eid_component) =
?ALLOW FILTERING;

Our tables look like the following:

CREATE TABLE kiji_it0.t_foo (
  eid_component varchar,
  lg varchar,
  family blob,
  qualifier blob,
  version bigint,
  value blob,
  PRIMARY KEY ((eid_component), lg, family, qualifier, version))
WITH CLUSTERING ORDER BY (lg ASC, family ASC, qualifier ASC, version 
DESC);


with an additional index on the lg column (the lg column is
*extremely* low cardinality).

(FWIW I realize that having ALLOW FILTERING is potentially a Very
Bad Idea, but we are building a framework on top of Cassandra and
MapReduce that allows our users to occasionally make queries like
this.  We don't really mind taking a performance hit since these are
batch jobs.  We are considering eventually supporting some automatic
denormalization, but have not done so yet.)

If I change the query above to remove the WHERE clauses, the errors go 
away.


I think I understand the problem here---there are some rows that have
huge amounts of data that we have to scan over, and occasionally those
scans take so long that there is a timeout.

I have a couple of questions:

1. What parameters in my code or in the Cassandra cluster do I need to
adjust to get rid of these timeouts?  Our table layout is designed
such that its real-time performance should be pretty good, so I don't
mind if the batch queries are a little bit slow.  Do I need to change
the read_request_timeout_in_ms parameter?  Or something else?

2. I have tried to create a test to reproduce this problem, but I have
been unable to do so.  Any suggestions on how to do this?  I tried
creating a table similar to that described above and filling in a huge
amount of data for some rows to try to increase the amount of space
that we'd need to skip over.  I also tried reducing
read_request_timeout_in_ms from 5000 ms to 50 ms and still no dice.

Let me know if anyone has any thoughts or suggestions.  At a minimum
I'd like to be able to reproduce these read timeout errors in some
integration tests.

Thanks!

Best regards,
Clint






Occasional read timeouts seen during row scans

2014-08-01 Thread Clint Kelly
Hi everyone,

I am seeing occasional read timeouts during multi-row queries, but I'm
having difficulty reproducing them or understanding what the problem
is.

First, some background:

Our team wrote a custom MapReduce InputFormat that looks pretty
similar to the DataStax InputFormat except that it allows queries that
touch multiple CQL tables with the same PRIMARY KEY format (it then
assembles together results from multiple tables for the same primary
key before sending them back to the user in the RecordReader).

During a large batch job in a cluster and during some integration
tests, we see errors like the following:

com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra
timeout during read query at consistency ONE (1 responses were
required but only 0 replica responded)

Our queries look like this:

SELECT token(eid_component), eid_component, lg, family, qualifier,
version, value FROM kiji_it0.t_foo WHERE lg=? AND family=? AND
qualifier=?  AND token(eid_component) = ? AND token(eid_component) =
?ALLOW FILTERING;

Our tables look like the following:

CREATE TABLE kiji_it0.t_foo (
 eid_component varchar,
 lg varchar,
 family blob,
 qualifier blob,
 version bigint,
 value blob,
 PRIMARY KEY ((eid_component), lg, family, qualifier, version))
WITH CLUSTERING ORDER BY (lg ASC, family ASC, qualifier ASC, version DESC);

with an additional index on the lg column (the lg column is
*extremely* low cardinality).

(FWIW I realize that having ALLOW FILTERING is potentially a Very
Bad Idea, but we are building a framework on top of Cassandra and
MapReduce that allows our users to occasionally make queries like
this.  We don't really mind taking a performance hit since these are
batch jobs.  We are considering eventually supporting some automatic
denormalization, but have not done so yet.)

If I change the query above to remove the WHERE clauses, the errors go away.

I think I understand the problem here---there are some rows that have
huge amounts of data that we have to scan over, and occasionally those
scans take so long that there is a timeout.

I have a couple of questions:

1. What parameters in my code or in the Cassandra cluster do I need to
adjust to get rid of these timeouts?  Our table layout is designed
such that its real-time performance should be pretty good, so I don't
mind if the batch queries are a little bit slow.  Do I need to change
the read_request_timeout_in_ms parameter?  Or something else?

2. I have tried to create a test to reproduce this problem, but I have
been unable to do so.  Any suggestions on how to do this?  I tried
creating a table similar to that described above and filling in a huge
amount of data for some rows to try to increase the amount of space
that we'd need to skip over.  I also tried reducing
read_request_timeout_in_ms from 5000 ms to 50 ms and still no dice.

Let me know if anyone has any thoughts or suggestions.  At a minimum
I'd like to be able to reproduce these read timeout errors in some
integration tests.

Thanks!

Best regards,
Clint


Re: Occasional read timeouts seen during row scans

2014-08-01 Thread Clint Kelly
BTW a few other details, sorry for omitting these:


   - We are using version 2.0.4 of the Java driver
   - We are running against Cassandra 2.0.9
   - I tried messing around with the page size (even reducing it down to a
   single record) and that didn't seem to help (in the cases where I was
   observing the timeout)

Best regards,
Clint


On Fri, Aug 1, 2014 at 5:02 PM, Clint Kelly clint.ke...@gmail.com wrote:

 Hi everyone,

 I am seeing occasional read timeouts during multi-row queries, but I'm
 having difficulty reproducing them or understanding what the problem
 is.

 First, some background:

 Our team wrote a custom MapReduce InputFormat that looks pretty
 similar to the DataStax InputFormat except that it allows queries that
 touch multiple CQL tables with the same PRIMARY KEY format (it then
 assembles together results from multiple tables for the same primary
 key before sending them back to the user in the RecordReader).

 During a large batch job in a cluster and during some integration
 tests, we see errors like the following:

 com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra
 timeout during read query at consistency ONE (1 responses were
 required but only 0 replica responded)

 Our queries look like this:

 SELECT token(eid_component), eid_component, lg, family, qualifier,
 version, value FROM kiji_it0.t_foo WHERE lg=? AND family=? AND
 qualifier=?  AND token(eid_component) = ? AND token(eid_component) =
 ?ALLOW FILTERING;

 Our tables look like the following:

 CREATE TABLE kiji_it0.t_foo (
  eid_component varchar,
  lg varchar,
  family blob,
  qualifier blob,
  version bigint,
  value blob,
  PRIMARY KEY ((eid_component), lg, family, qualifier, version))
 WITH CLUSTERING ORDER BY (lg ASC, family ASC, qualifier ASC, version DESC);

 with an additional index on the lg column (the lg column is
 *extremely* low cardinality).

 (FWIW I realize that having ALLOW FILTERING is potentially a Very
 Bad Idea, but we are building a framework on top of Cassandra and
 MapReduce that allows our users to occasionally make queries like
 this.  We don't really mind taking a performance hit since these are
 batch jobs.  We are considering eventually supporting some automatic
 denormalization, but have not done so yet.)

 If I change the query above to remove the WHERE clauses, the errors go
 away.

 I think I understand the problem here---there are some rows that have
 huge amounts of data that we have to scan over, and occasionally those
 scans take so long that there is a timeout.

 I have a couple of questions:

 1. What parameters in my code or in the Cassandra cluster do I need to
 adjust to get rid of these timeouts?  Our table layout is designed
 such that its real-time performance should be pretty good, so I don't
 mind if the batch queries are a little bit slow.  Do I need to change
 the read_request_timeout_in_ms parameter?  Or something else?

 2. I have tried to create a test to reproduce this problem, but I have
 been unable to do so.  Any suggestions on how to do this?  I tried
 creating a table similar to that described above and filling in a huge
 amount of data for some rows to try to increase the amount of space
 that we'd need to skip over.  I also tried reducing
 read_request_timeout_in_ms from 5000 ms to 50 ms and still no dice.

 Let me know if anyone has any thoughts or suggestions.  At a minimum
 I'd like to be able to reproduce these read timeout errors in some
 integration tests.

 Thanks!

 Best regards,
 Clint



Re: Cassandra 2.0.2 - Frequent Read timeouts and delays in replication on 3-node cluster in AWS VPC

2013-11-21 Thread Steven A Robenalt
Looks like the read timeouts were a result of a bug that will be fixed in
2.0.3.

I found this question on the Datastax Java Driver mailing list:
https://groups.google.com/a/lists.datastax.com/forum/#!topic/java-driver-user/ao1ohSLpjRM

which led me to:
https://issues.apache.org/jira/browse/CASSANDRA-6299

I built and deployed a 2.0.3 snapshot this morning, which includes this
fix, and my cluster is now behaving normally (no read timeouts so far).



On Tue, Nov 19, 2013 at 4:55 PM, Steven A Robenalt srobe...@stanford.eduwrote:

 It seems that with NTP properly configured, the replication is now working
 as expected, but there are still a lot of read timeouts. The
 troubleshooting continues...


 On Tue, Nov 19, 2013 at 8:53 AM, Steven A Robenalt 
 srobe...@stanford.eduwrote:

 Thanks Michael, I will try that out.


 On Tue, Nov 19, 2013 at 5:28 AM, Laing, Michael 
 michael.la...@nytimes.com wrote:

 We had a similar problem when our nodes could not sync using ntp due to
 VPC ACL settings. -ml


 On Mon, Nov 18, 2013 at 8:49 PM, Steven A Robenalt 
 srobe...@stanford.edu wrote:

 Hi all,

 I am attempting to bring up our new app on a 3-node cluster and am
 having problems with frequent read timeouts and slow inter-node
 replication. Initially, these errors were mostly occurring in our app
 server, affecting 0.02%-1.0% of our queries in an otherwise unloaded
 cluster. No exceptions were logged on the servers in this case, and reads
 in a single node environment with the same code and client driver virtually
 never see exceptions like this, so I suspect problems with the
 inter-cluster communication between nodes.

 The 3 nodes are deployed in a single AWS VPC, and are all in a common
 subnet. The Cassandra version is 2.0.2 following an upgrade this past
 weekend due to NPEs in a secondary index that were affecting certain
 queries under 2.0.1. The servers are m1.large instances running AWS Linux
 and Oracle JDK7u40. The first 2 nodes in the cluster are the seed nodes.
 All database contents are CQL tables with replication factor of 3, and the
 application is Java-based, using the latest Datastax 2.0.0-rc1 Java Driver.

 In testing with the application, I noticed this afternoon that the
 contents of the 3 nodes differed in their respective copies of the same
 table for newly written data, for time periods exceeding several minutes,
 as reported by cqlsh on each node. Specifying different hosts from the same
 server using cqlsh also exhibited timeouts on multiple attempts to connect,
 and on executing some queries, though they eventually succeeded in all
 cases, and eventually the data in all nodes was fully replicated.

 The AWS servers have a security group with only ports 22, 7000, 9042,
 and 9160 open.

 At this time, it seems that either I am still missing something in my
 cluster configuration, or maybe there are other ports that are needed for
 inter-node communication.

 Any advice/suggestions would be appreciated.



 --
 Steve Robenalt
 Software Architect
 HighWire | Stanford University
 425 Broadway St, Redwood City, CA 94063

 srobe...@stanford.edu
 http://highwire.stanford.edu









 --
 Steve Robenalt
 Software Architect
 HighWire | Stanford University
 425 Broadway St, Redwood City, CA 94063

 srobe...@stanford.edu
 http://highwire.stanford.edu








 --
 Steve Robenalt
 Software Architect
 HighWire | Stanford University
 425 Broadway St, Redwood City, CA 94063

 srobe...@stanford.edu
 http://highwire.stanford.edu








-- 
Steve Robenalt
Software Architect
HighWire | Stanford University
425 Broadway St, Redwood City, CA 94063

srobe...@stanford.edu
http://highwire.stanford.edu


Re: Cassandra 2.0.2 - Frequent Read timeouts and delays in replication on 3-node cluster in AWS VPC

2013-11-19 Thread Steven A Robenalt
Thanks Michael, I will try that out.


On Tue, Nov 19, 2013 at 5:28 AM, Laing, Michael
michael.la...@nytimes.comwrote:

 We had a similar problem when our nodes could not sync using ntp due to
 VPC ACL settings. -ml


 On Mon, Nov 18, 2013 at 8:49 PM, Steven A Robenalt 
 srobe...@stanford.eduwrote:

 Hi all,

 I am attempting to bring up our new app on a 3-node cluster and am having
 problems with frequent read timeouts and slow inter-node replication.
 Initially, these errors were mostly occurring in our app server, affecting
 0.02%-1.0% of our queries in an otherwise unloaded cluster. No exceptions
 were logged on the servers in this case, and reads in a single node
 environment with the same code and client driver virtually never see
 exceptions like this, so I suspect problems with the inter-cluster
 communication between nodes.

 The 3 nodes are deployed in a single AWS VPC, and are all in a common
 subnet. The Cassandra version is 2.0.2 following an upgrade this past
 weekend due to NPEs in a secondary index that were affecting certain
 queries under 2.0.1. The servers are m1.large instances running AWS Linux
 and Oracle JDK7u40. The first 2 nodes in the cluster are the seed nodes.
 All database contents are CQL tables with replication factor of 3, and the
 application is Java-based, using the latest Datastax 2.0.0-rc1 Java Driver.

 In testing with the application, I noticed this afternoon that the
 contents of the 3 nodes differed in their respective copies of the same
 table for newly written data, for time periods exceeding several minutes,
 as reported by cqlsh on each node. Specifying different hosts from the same
 server using cqlsh also exhibited timeouts on multiple attempts to connect,
 and on executing some queries, though they eventually succeeded in all
 cases, and eventually the data in all nodes was fully replicated.

 The AWS servers have a security group with only ports 22, 7000, 9042, and
 9160 open.

 At this time, it seems that either I am still missing something in my
 cluster configuration, or maybe there are other ports that are needed for
 inter-node communication.

 Any advice/suggestions would be appreciated.



 --
 Steve Robenalt
 Software Architect
 HighWire | Stanford University
 425 Broadway St, Redwood City, CA 94063

 srobe...@stanford.edu
 http://highwire.stanford.edu









-- 
Steve Robenalt
Software Architect
HighWire | Stanford University
425 Broadway St, Redwood City, CA 94063

srobe...@stanford.edu
http://highwire.stanford.edu


Re: Cassandra 2.0.2 - Frequent Read timeouts and delays in replication on 3-node cluster in AWS VPC

2013-11-19 Thread Steven A Robenalt
It seems that with NTP properly configured, the replication is now working
as expected, but there are still a lot of read timeouts. The
troubleshooting continues...


On Tue, Nov 19, 2013 at 8:53 AM, Steven A Robenalt srobe...@stanford.eduwrote:

 Thanks Michael, I will try that out.


 On Tue, Nov 19, 2013 at 5:28 AM, Laing, Michael michael.la...@nytimes.com
  wrote:

 We had a similar problem when our nodes could not sync using ntp due to
 VPC ACL settings. -ml


 On Mon, Nov 18, 2013 at 8:49 PM, Steven A Robenalt srobe...@stanford.edu
  wrote:

 Hi all,

 I am attempting to bring up our new app on a 3-node cluster and am
 having problems with frequent read timeouts and slow inter-node
 replication. Initially, these errors were mostly occurring in our app
 server, affecting 0.02%-1.0% of our queries in an otherwise unloaded
 cluster. No exceptions were logged on the servers in this case, and reads
 in a single node environment with the same code and client driver virtually
 never see exceptions like this, so I suspect problems with the
 inter-cluster communication between nodes.

 The 3 nodes are deployed in a single AWS VPC, and are all in a common
 subnet. The Cassandra version is 2.0.2 following an upgrade this past
 weekend due to NPEs in a secondary index that were affecting certain
 queries under 2.0.1. The servers are m1.large instances running AWS Linux
 and Oracle JDK7u40. The first 2 nodes in the cluster are the seed nodes.
 All database contents are CQL tables with replication factor of 3, and the
 application is Java-based, using the latest Datastax 2.0.0-rc1 Java Driver.

 In testing with the application, I noticed this afternoon that the
 contents of the 3 nodes differed in their respective copies of the same
 table for newly written data, for time periods exceeding several minutes,
 as reported by cqlsh on each node. Specifying different hosts from the same
 server using cqlsh also exhibited timeouts on multiple attempts to connect,
 and on executing some queries, though they eventually succeeded in all
 cases, and eventually the data in all nodes was fully replicated.

 The AWS servers have a security group with only ports 22, 7000, 9042,
 and 9160 open.

 At this time, it seems that either I am still missing something in my
 cluster configuration, or maybe there are other ports that are needed for
 inter-node communication.

 Any advice/suggestions would be appreciated.



 --
 Steve Robenalt
 Software Architect
 HighWire | Stanford University
 425 Broadway St, Redwood City, CA 94063

 srobe...@stanford.edu
 http://highwire.stanford.edu









 --
 Steve Robenalt
 Software Architect
 HighWire | Stanford University
 425 Broadway St, Redwood City, CA 94063

 srobe...@stanford.edu
 http://highwire.stanford.edu








-- 
Steve Robenalt
Software Architect
HighWire | Stanford University
425 Broadway St, Redwood City, CA 94063

srobe...@stanford.edu
http://highwire.stanford.edu


Cassandra 2.0.2 - Frequent Read timeouts and delays in replication on 3-node cluster in AWS VPC

2013-11-18 Thread Steven A Robenalt
Hi all,

I am attempting to bring up our new app on a 3-node cluster and am having
problems with frequent read timeouts and slow inter-node replication.
Initially, these errors were mostly occurring in our app server, affecting
0.02%-1.0% of our queries in an otherwise unloaded cluster. No exceptions
were logged on the servers in this case, and reads in a single node
environment with the same code and client driver virtually never see
exceptions like this, so I suspect problems with the inter-cluster
communication between nodes.

The 3 nodes are deployed in a single AWS VPC, and are all in a common
subnet. The Cassandra version is 2.0.2 following an upgrade this past
weekend due to NPEs in a secondary index that were affecting certain
queries under 2.0.1. The servers are m1.large instances running AWS Linux
and Oracle JDK7u40. The first 2 nodes in the cluster are the seed nodes.
All database contents are CQL tables with replication factor of 3, and the
application is Java-based, using the latest Datastax 2.0.0-rc1 Java Driver.

In testing with the application, I noticed this afternoon that the contents
of the 3 nodes differed in their respective copies of the same table for
newly written data, for time periods exceeding several minutes, as reported
by cqlsh on each node. Specifying different hosts from the same server
using cqlsh also exhibited timeouts on multiple attempts to connect, and on
executing some queries, though they eventually succeeded in all cases, and
eventually the data in all nodes was fully replicated.

The AWS servers have a security group with only ports 22, 7000, 9042, and
9160 open.

At this time, it seems that either I am still missing something in my
cluster configuration, or maybe there are other ports that are needed for
inter-node communication.

Any advice/suggestions would be appreciated.



-- 
Steve Robenalt
Software Architect
HighWire | Stanford University
425 Broadway St, Redwood City, CA 94063

srobe...@stanford.edu
http://highwire.stanford.edu


Increased read timeouts during rolling upgrade to C* 1.2

2013-10-04 Thread Paulo Motta
Hello,

I have isolated one of our data centers to simulate a rolling restart
upgrade from C* 1.1.10 to 1.2.10. We replayed our production traffic to the
C* nodes during the upgrade and observed an increased number of read
timeouts during the upgrade process.

I executed nodetool drain before upgrading each node, and during the
upgrade nodetool ring was showing that node as DOWN, as expected. After
each upgrade all nodes were showing the upgraded node as UP, so apparently
all nodes were communicating fine.

I manually tried to insert and retrieve some data into both the newly
upgraded nodes and the old nodes, and the behavior was very unstable:
sometimes it worked, sometimes it didn't (TimedOutException), so I don't
think it was a network problem.

The number of read timeouts diminished as the number of upgraded nodes
increased, until it reached stability. The logs were showing the following
messages periodically:

 INFO [HANDSHAKE-/10.176.249.XX] 2013-10-03 17:36:16,948
OutboundTcpConnection.java (line 399) Handshaking version with
/10.176.249.XX
 INFO [HANDSHAKE-/10.176.182.YY] 2013-10-03 17:36:17,280
OutboundTcpConnection.java (line 408) Cannot handshake version with
/10.176.182.YY
 INFO [HANDSHAKE-/10.176.182.YY] 2013-10-03 17:36:17,280
OutboundTcpConnection.java (line 399) Handshaking version with
/10.176.182.YY
 INFO [HANDSHAKE-/10.188.13.ZZ] 2013-10-03 17:36:17,510
OutboundTcpConnection.java (line 408) Cannot handshake version with
/10.188.13.ZZ
 INFO [HANDSHAKE-/10.188.13.ZZ] 2013-10-03 17:36:17,511
OutboundTcpConnection.java (line 399) Handshaking version with /10.188.13.ZZ
DEBUG [WRITE-/54.215.70.YY] 2013-10-03 18:01:50,237
OutboundTcpConnection.java (line 338) Target max version is -2147483648; no
version information yet, will retry
TRACE [HANDSHAKE-/10.177.14.XX] 2013-10-03 18:01:50,237
OutboundTcpConnection.java (line 406) Cannot handshake version with
/10.177.14.XX
java.nio.channels.AsynchronousCloseException
at
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:185)
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:272)
 at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:176)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
 at java.io.InputStream.read(InputStream.java:82)
 at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:64)
at java.io.DataInputStream.readInt(DataInputStream.java:370)
at
org.apache.cassandra.net.OutboundTcpConnection$1.run(OutboundTcpConnection.java:400)

Another fact is that the number of completed compaction tasks decreased as
the number of upgraded nodes increased. I don't know if that's related to
the increased number read timeouts or just a coincidence. The timeout
configuration is the default (1ms).

Two similar issues were reported, but without satisfactory responses:

-
http://stackoverflow.com/questions/15355115/rolling-upgrade-for-cassandra-1-0-9-cluster-to-1-2-1
- https://issues.apache.org/jira/browse/CASSANDRA-5740

Is that an expected behavior or is there something that might be going
wrong during the upgrade? Has anyone faced similar issues?

Any help would be very much appreciated.

Thanks,

Paulo


Re: Increased read timeouts during rolling upgrade to C* 1.2

2013-10-04 Thread Paulo Motta
One more piece of information to help troubleshooting the issue:

During the nodetool drain operation just before the upgrade, instead of
just stopping accepting new writes, the node actually shuts itself down.
This bug was also reported in this other thread:
http://mail-archives.apache.org/mod_mbox/cassandra-user/201303.mbox/%3CCAFDWQMTrYm7hBxXKoW8+eVKfNE6zvjW2h8_BSVGmOL7=grd...@mail.gmail.com%3E

Since I started Cassandra 1.2 only a few seconds before cassandra 1.1 died
(after the nodetool drain), I'm afraid there wasn't sufficient time for the
remaining nodes to update the metadata about the downed node. So when the
upgraded node was restarted, the metadata in the other nodes was still
referring to the previous version of the same node, so this may have caused
the handshake problem, and consequently the read timeout. Does that theory
make sense?


2013/10/4 Robert Coli rc...@eventbrite.com

 On Fri, Oct 4, 2013 at 9:09 AM, Paulo Motta pauloricard...@gmail.comwrote:

 I manually tried to insert and retrieve some data into both the newly
 upgraded nodes and the old nodes, and the behavior was very unstable:
 sometimes it worked, sometimes it didn't (TimedOutException), so I don't
 think it was a network problem.

 The number of read timeouts diminished as the number of upgraded nodes
 increased, until it reached stability. The logs were showing the following
 messages periodically:

 ...

 Two similar issues were reported, but without satisfactory responses:

 -
 http://stackoverflow.com/questions/15355115/rolling-upgrade-for-cassandra-1-0-9-cluster-to-1-2-1
 - https://issues.apache.org/jira/browse/CASSANDRA-5740


 Both of these issues relate to upgrading from 1._0_.x to 1.2.x, which is
 not supported.

 Were I you, I would summarize the above experience in a JIRA ticket, as
 1.1.x to 1.2.x should be a supported operation and should not unexpectedly
 result in decreased availability during the upgrade.

 =Rob




-- 
Paulo Ricardo

-- 
European Master in Distributed Computing***
Royal Institute of Technology - KTH
*
*Instituto Superior Técnico - IST*
*http://paulormg.com*


Re: kswapd0 causing read timeouts

2012-06-18 Thread Holger Hoffstaette
On Mon, 18 Jun 2012 11:57:17 -0700, Gurpreet Singh wrote:

 Thanks for all the information Holger.
 
 Will do the jvm updates, kernel updates will be slow to come by. I see
 that with disk access mode standard, the performance is stable and better
 than in mmap mode, so i will probably stick to that.

Please let us know how things work out.

 Are you suggesting i try out mongodb?

Uhm, no. :) I meant that it also uses mmap exclusively (!), and
consequently can also have pretty bad/irregular performance when the
(active) data set grows much larger than RAM. To  be fair, that is a
pretty hard problem in general.

-h




Re: kswapd0 causing read timeouts

2012-06-14 Thread Gurpreet Singh
JNA is installed. swappiness was 0. vfs_cache_pressure was 100. 2 questions
on this..
1. Is there a way to find out if mlockall really worked other than just the
mlockall successful log message?
2. Does cassandra only mlock the jvm heap or also the mmaped memory?

I disabled mmap completely, and things look so much better.
latency is surprisingly half of what i see when i have mmap enabled.
Its funny that i keep reading tall claims abt mmap, but in practise a lot
of ppl have problems with it, especially when it uses up all the memory. We
have tried mmap for different purposes in our company before,and had
finally ended up disabling it, because it just doesnt handle things right
when memory is low. Maybe the proc/sys/vm needs to be configured right, but
thats not the easiest of configurations to get right.

Right now, i am handling only 80 gigs of data. kernel version is 2.6.26.
java version is 1.6.21
/G

On Wed, Jun 13, 2012 at 8:42 PM, Al Tobey a...@ooyala.com wrote:

 I would check /etc/sysctl.conf and get the values of
 /proc/sys/vm/swappiness and /proc/sys/vm/vfs_cache_pressure.

 If you don't have JNA enabled (which Cassandra uses to fadvise) and
 swappiness is at its default of 60, the Linux kernel will happily swap out
 your heap for cache space.  Set swappiness to 1 or 'swapoff -a' and kswapd
 shouldn't be doing much unless you have a too-large heap or some other app
 using up memory on the system.


 On Wed, Jun 13, 2012 at 11:30 AM, ruslan usifov 
 ruslan.usi...@gmail.comwrote:

 Hm, it's very strange what amount of you data? You linux kernel
 version? Java version?

 PS: i can suggest switch diskaccessmode to standart in you case
 PS:PS also upgrade you linux to latest, and javahotspot to 1.6.32
 (from oracle site)

 2012/6/13 Gurpreet Singh gurpreet.si...@gmail.com:
  Alright, here it goes again...
  Even with mmap_index_only, once the RES memory hit 15 gigs, the read
 latency
  went berserk. This happens in 12 hours if diskaccessmode is mmap, abt
 48 hrs
  if its mmap_index_only.
 
  only reads happening at 50 reads/second
  row cache size: 730 mb, row cache hit ratio: 0.75
  key cache size: 400 mb, key cache hit ratio: 0.4
  heap size (max 8 gigs): used 6.1-6.9 gigs
 
  No messages about reducing cache sizes in the logs
 
  stats:
  vmstat 1 : no swapping here, however high sys cpu utilization
  iostat (looks great) - avg-qu-sz = 8, avg await = 7 ms, svc time = 0.6,
 util
  = 15-30%
  top - VIRT - 19.8g, SHR - 6.1g, RES - 15g, high cpu, buffers - 2mb
  cfstats - 70-100 ms. This number used to be 20-30 ms.
 
  The value of the SHR keeps increasing (owing to mmap i guess), while at
 the
  same time buffers keeps decreasing. buffers starts as high as 50 mb, and
  goes down to 2 mb.
 
 
  This is very easily reproducible for me. Every time the RES memory hits
 abt
  15 gigs, the client starts getting timeouts from cassandra, the sys cpu
  jumps a lot. All this, even though my row cache hit ratio is almost
 0.75.
 
  Other than just turning off mmap completely, is there any other
 solution or
  setting to avoid a cassandra restart every cpl of days. Something to
 keep
  the RES memory to hit such a high number. I have been constantly
 monitoring
  the RES, was not seeing issues when RES was at 14 gigs.
  /G
 
  On Fri, Jun 8, 2012 at 10:02 PM, Gurpreet Singh 
 gurpreet.si...@gmail.com
  wrote:
 
  Aaron, Ruslan,
  I changed the disk access mode to mmap_index_only, and it has been
 stable
  ever since, well at least for the past 20 hours. Previously, in abt
 10-12
  hours, as soon as the resident memory was full, the client would start
  timing out on all its reads. It looks fine for now, i am going to let
 it
  continue to see how long it lasts and if the problem comes again.
 
  Aaron,
  yes, i had turned swap off.
 
  The total cpu utilization was at 700% roughly.. It looked like kswapd0
 was
  using just 1 cpu, but cassandra (jsvc) cpu utilization increased quite
 a
  bit. top was reporting high system cpu, and low user cpu.
  vmstat was not showing swapping. java heap size max is 8 gigs. while
 only
  4 gigs was in use, so java heap was doing great. no gc in the logs.
 iostat
  was doing ok from what i remember, i will have to reproduce the issue
 for
  the exact numbers.
 
  cfstats latency had gone very high, but that is partly due to high cpu
  usage.
 
  One thing was clear, that the SHR was inching higher (due to the mmap)
  while buffer cache which started at abt 20-25mb reduced to 2 MB by the
 end,
  which probably means that pagecache was being evicted by the kswapd0.
 Is
  there a way to fix the size of the buffer cache and not let system
 evict it
  in favour of mmap?
 
  Also, mmapping data files would basically cause not only the data
 (asked
  for) to be read into main memory, but also a bunch of extra pages
  (readahead), which would not be very useful, right? The same thing for
 index
  would actually be more useful, as there would be more index entries in
 the
  readahead 

Re: kswapd0 causing read timeouts

2012-06-14 Thread ruslan usifov
Upgrade java (version 1.6.21 have memleaks) to latest 1.6.32. Its
abnormally that on 80Gigs you have 15Gigs of index

vfs_cache_pressure - used for inodes and dentrys

Also to check that you have memleaks use drop_cache sysctl





2012/6/14 Gurpreet Singh gurpreet.si...@gmail.com:
 JNA is installed. swappiness was 0. vfs_cache_pressure was 100. 2 questions
 on this..
 1. Is there a way to find out if mlockall really worked other than just the
 mlockall successful log message?
 2. Does cassandra only mlock the jvm heap or also the mmaped memory?

 I disabled mmap completely, and things look so much better.
 latency is surprisingly half of what i see when i have mmap enabled.
 Its funny that i keep reading tall claims abt mmap, but in practise a lot of
 ppl have problems with it, especially when it uses up all the memory. We
 have tried mmap for different purposes in our company before,and had finally
 ended up disabling it, because it just doesnt handle things right when
 memory is low. Maybe the proc/sys/vm needs to be configured right, but thats
 not the easiest of configurations to get right.

 Right now, i am handling only 80 gigs of data. kernel version is 2.6.26.
 java version is 1.6.21
 /G


 On Wed, Jun 13, 2012 at 8:42 PM, Al Tobey a...@ooyala.com wrote:

 I would check /etc/sysctl.conf and get the values of
 /proc/sys/vm/swappiness and /proc/sys/vm/vfs_cache_pressure.

 If you don't have JNA enabled (which Cassandra uses to fadvise) and
 swappiness is at its default of 60, the Linux kernel will happily swap out
 your heap for cache space.  Set swappiness to 1 or 'swapoff -a' and kswapd
 shouldn't be doing much unless you have a too-large heap or some other app
 using up memory on the system.


 On Wed, Jun 13, 2012 at 11:30 AM, ruslan usifov ruslan.usi...@gmail.com
 wrote:

 Hm, it's very strange what amount of you data? You linux kernel
 version? Java version?

 PS: i can suggest switch diskaccessmode to standart in you case
 PS:PS also upgrade you linux to latest, and javahotspot to 1.6.32
 (from oracle site)

 2012/6/13 Gurpreet Singh gurpreet.si...@gmail.com:
  Alright, here it goes again...
  Even with mmap_index_only, once the RES memory hit 15 gigs, the read
  latency
  went berserk. This happens in 12 hours if diskaccessmode is mmap, abt
  48 hrs
  if its mmap_index_only.
 
  only reads happening at 50 reads/second
  row cache size: 730 mb, row cache hit ratio: 0.75
  key cache size: 400 mb, key cache hit ratio: 0.4
  heap size (max 8 gigs): used 6.1-6.9 gigs
 
  No messages about reducing cache sizes in the logs
 
  stats:
  vmstat 1 : no swapping here, however high sys cpu utilization
  iostat (looks great) - avg-qu-sz = 8, avg await = 7 ms, svc time = 0.6,
  util
  = 15-30%
  top - VIRT - 19.8g, SHR - 6.1g, RES - 15g, high cpu, buffers - 2mb
  cfstats - 70-100 ms. This number used to be 20-30 ms.
 
  The value of the SHR keeps increasing (owing to mmap i guess), while at
  the
  same time buffers keeps decreasing. buffers starts as high as 50 mb,
  and
  goes down to 2 mb.
 
 
  This is very easily reproducible for me. Every time the RES memory hits
  abt
  15 gigs, the client starts getting timeouts from cassandra, the sys cpu
  jumps a lot. All this, even though my row cache hit ratio is almost
  0.75.
 
  Other than just turning off mmap completely, is there any other
  solution or
  setting to avoid a cassandra restart every cpl of days. Something to
  keep
  the RES memory to hit such a high number. I have been constantly
  monitoring
  the RES, was not seeing issues when RES was at 14 gigs.
  /G
 
  On Fri, Jun 8, 2012 at 10:02 PM, Gurpreet Singh
  gurpreet.si...@gmail.com
  wrote:
 
  Aaron, Ruslan,
  I changed the disk access mode to mmap_index_only, and it has been
  stable
  ever since, well at least for the past 20 hours. Previously, in abt
  10-12
  hours, as soon as the resident memory was full, the client would start
  timing out on all its reads. It looks fine for now, i am going to let
  it
  continue to see how long it lasts and if the problem comes again.
 
  Aaron,
  yes, i had turned swap off.
 
  The total cpu utilization was at 700% roughly.. It looked like kswapd0
  was
  using just 1 cpu, but cassandra (jsvc) cpu utilization increased quite
  a
  bit. top was reporting high system cpu, and low user cpu.
  vmstat was not showing swapping. java heap size max is 8 gigs. while
  only
  4 gigs was in use, so java heap was doing great. no gc in the logs.
  iostat
  was doing ok from what i remember, i will have to reproduce the issue
  for
  the exact numbers.
 
  cfstats latency had gone very high, but that is partly due to high cpu
  usage.
 
  One thing was clear, that the SHR was inching higher (due to the mmap)
  while buffer cache which started at abt 20-25mb reduced to 2 MB by the
  end,
  which probably means that pagecache was being evicted by the kswapd0.
  Is
  there a way to fix the size of the buffer cache and not let system
  evict it
  in favour of 

Re: kswapd0 causing read timeouts

2012-06-14 Thread ruslan usifov
2012/6/14 Gurpreet Singh gurpreet.si...@gmail.com:
 JNA is installed. swappiness was 0. vfs_cache_pressure was 100. 2 questions
 on this..
 1. Is there a way to find out if mlockall really worked other than just the
 mlockall successful log message?
yes you must see something like this (from our test server):

 INFO [main] 2012-06-14 02:03:14,745 DatabaseDescriptor.java (line
233) Global memtable threshold is enabled at 512MB


 2. Does cassandra only mlock the jvm heap or also the mmaped memory?

Cassandra obviously mlock only heap, and doesn't mmaped sstables



 I disabled mmap completely, and things look so much better.
 latency is surprisingly half of what i see when i have mmap enabled.
 Its funny that i keep reading tall claims abt mmap, but in practise a lot of
 ppl have problems with it, especially when it uses up all the memory. We
 have tried mmap for different purposes in our company before,and had finally
 ended up disabling it, because it just doesnt handle things right when
 memory is low. Maybe the proc/sys/vm needs to be configured right, but thats
 not the easiest of configurations to get right.

 Right now, i am handling only 80 gigs of data. kernel version is 2.6.26.
 java version is 1.6.21
 /G


 On Wed, Jun 13, 2012 at 8:42 PM, Al Tobey a...@ooyala.com wrote:

 I would check /etc/sysctl.conf and get the values of
 /proc/sys/vm/swappiness and /proc/sys/vm/vfs_cache_pressure.

 If you don't have JNA enabled (which Cassandra uses to fadvise) and
 swappiness is at its default of 60, the Linux kernel will happily swap out
 your heap for cache space.  Set swappiness to 1 or 'swapoff -a' and kswapd
 shouldn't be doing much unless you have a too-large heap or some other app
 using up memory on the system.


 On Wed, Jun 13, 2012 at 11:30 AM, ruslan usifov ruslan.usi...@gmail.com
 wrote:

 Hm, it's very strange what amount of you data? You linux kernel
 version? Java version?

 PS: i can suggest switch diskaccessmode to standart in you case
 PS:PS also upgrade you linux to latest, and javahotspot to 1.6.32
 (from oracle site)

 2012/6/13 Gurpreet Singh gurpreet.si...@gmail.com:
  Alright, here it goes again...
  Even with mmap_index_only, once the RES memory hit 15 gigs, the read
  latency
  went berserk. This happens in 12 hours if diskaccessmode is mmap, abt
  48 hrs
  if its mmap_index_only.
 
  only reads happening at 50 reads/second
  row cache size: 730 mb, row cache hit ratio: 0.75
  key cache size: 400 mb, key cache hit ratio: 0.4
  heap size (max 8 gigs): used 6.1-6.9 gigs
 
  No messages about reducing cache sizes in the logs
 
  stats:
  vmstat 1 : no swapping here, however high sys cpu utilization
  iostat (looks great) - avg-qu-sz = 8, avg await = 7 ms, svc time = 0.6,
  util
  = 15-30%
  top - VIRT - 19.8g, SHR - 6.1g, RES - 15g, high cpu, buffers - 2mb
  cfstats - 70-100 ms. This number used to be 20-30 ms.
 
  The value of the SHR keeps increasing (owing to mmap i guess), while at
  the
  same time buffers keeps decreasing. buffers starts as high as 50 mb,
  and
  goes down to 2 mb.
 
 
  This is very easily reproducible for me. Every time the RES memory hits
  abt
  15 gigs, the client starts getting timeouts from cassandra, the sys cpu
  jumps a lot. All this, even though my row cache hit ratio is almost
  0.75.
 
  Other than just turning off mmap completely, is there any other
  solution or
  setting to avoid a cassandra restart every cpl of days. Something to
  keep
  the RES memory to hit such a high number. I have been constantly
  monitoring
  the RES, was not seeing issues when RES was at 14 gigs.
  /G
 
  On Fri, Jun 8, 2012 at 10:02 PM, Gurpreet Singh
  gurpreet.si...@gmail.com
  wrote:
 
  Aaron, Ruslan,
  I changed the disk access mode to mmap_index_only, and it has been
  stable
  ever since, well at least for the past 20 hours. Previously, in abt
  10-12
  hours, as soon as the resident memory was full, the client would start
  timing out on all its reads. It looks fine for now, i am going to let
  it
  continue to see how long it lasts and if the problem comes again.
 
  Aaron,
  yes, i had turned swap off.
 
  The total cpu utilization was at 700% roughly.. It looked like kswapd0
  was
  using just 1 cpu, but cassandra (jsvc) cpu utilization increased quite
  a
  bit. top was reporting high system cpu, and low user cpu.
  vmstat was not showing swapping. java heap size max is 8 gigs. while
  only
  4 gigs was in use, so java heap was doing great. no gc in the logs.
  iostat
  was doing ok from what i remember, i will have to reproduce the issue
  for
  the exact numbers.
 
  cfstats latency had gone very high, but that is partly due to high cpu
  usage.
 
  One thing was clear, that the SHR was inching higher (due to the mmap)
  while buffer cache which started at abt 20-25mb reduced to 2 MB by the
  end,
  which probably means that pagecache was being evicted by the kswapd0.
  Is
  there a way to fix the size of the buffer cache and not let system
  evict it
 

Re: kswapd0 causing read timeouts

2012-06-14 Thread ruslan usifov
Soory i mistaken,here is right string

 INFO [main] 2012-06-14 02:03:14,520 CLibrary.java (line 109) JNA
mlockall successful




2012/6/15 ruslan usifov ruslan.usi...@gmail.com:
 2012/6/14 Gurpreet Singh gurpreet.si...@gmail.com:
 JNA is installed. swappiness was 0. vfs_cache_pressure was 100. 2 questions
 on this..
 1. Is there a way to find out if mlockall really worked other than just the
 mlockall successful log message?
 yes you must see something like this (from our test server):

  INFO [main] 2012-06-14 02:03:14,745 DatabaseDescriptor.java (line
 233) Global memtable threshold is enabled at 512MB


 2. Does cassandra only mlock the jvm heap or also the mmaped memory?

 Cassandra obviously mlock only heap, and doesn't mmaped sstables



 I disabled mmap completely, and things look so much better.
 latency is surprisingly half of what i see when i have mmap enabled.
 Its funny that i keep reading tall claims abt mmap, but in practise a lot of
 ppl have problems with it, especially when it uses up all the memory. We
 have tried mmap for different purposes in our company before,and had finally
 ended up disabling it, because it just doesnt handle things right when
 memory is low. Maybe the proc/sys/vm needs to be configured right, but thats
 not the easiest of configurations to get right.

 Right now, i am handling only 80 gigs of data. kernel version is 2.6.26.
 java version is 1.6.21
 /G


 On Wed, Jun 13, 2012 at 8:42 PM, Al Tobey a...@ooyala.com wrote:

 I would check /etc/sysctl.conf and get the values of
 /proc/sys/vm/swappiness and /proc/sys/vm/vfs_cache_pressure.

 If you don't have JNA enabled (which Cassandra uses to fadvise) and
 swappiness is at its default of 60, the Linux kernel will happily swap out
 your heap for cache space.  Set swappiness to 1 or 'swapoff -a' and kswapd
 shouldn't be doing much unless you have a too-large heap or some other app
 using up memory on the system.


 On Wed, Jun 13, 2012 at 11:30 AM, ruslan usifov ruslan.usi...@gmail.com
 wrote:

 Hm, it's very strange what amount of you data? You linux kernel
 version? Java version?

 PS: i can suggest switch diskaccessmode to standart in you case
 PS:PS also upgrade you linux to latest, and javahotspot to 1.6.32
 (from oracle site)

 2012/6/13 Gurpreet Singh gurpreet.si...@gmail.com:
  Alright, here it goes again...
  Even with mmap_index_only, once the RES memory hit 15 gigs, the read
  latency
  went berserk. This happens in 12 hours if diskaccessmode is mmap, abt
  48 hrs
  if its mmap_index_only.
 
  only reads happening at 50 reads/second
  row cache size: 730 mb, row cache hit ratio: 0.75
  key cache size: 400 mb, key cache hit ratio: 0.4
  heap size (max 8 gigs): used 6.1-6.9 gigs
 
  No messages about reducing cache sizes in the logs
 
  stats:
  vmstat 1 : no swapping here, however high sys cpu utilization
  iostat (looks great) - avg-qu-sz = 8, avg await = 7 ms, svc time = 0.6,
  util
  = 15-30%
  top - VIRT - 19.8g, SHR - 6.1g, RES - 15g, high cpu, buffers - 2mb
  cfstats - 70-100 ms. This number used to be 20-30 ms.
 
  The value of the SHR keeps increasing (owing to mmap i guess), while at
  the
  same time buffers keeps decreasing. buffers starts as high as 50 mb,
  and
  goes down to 2 mb.
 
 
  This is very easily reproducible for me. Every time the RES memory hits
  abt
  15 gigs, the client starts getting timeouts from cassandra, the sys cpu
  jumps a lot. All this, even though my row cache hit ratio is almost
  0.75.
 
  Other than just turning off mmap completely, is there any other
  solution or
  setting to avoid a cassandra restart every cpl of days. Something to
  keep
  the RES memory to hit such a high number. I have been constantly
  monitoring
  the RES, was not seeing issues when RES was at 14 gigs.
  /G
 
  On Fri, Jun 8, 2012 at 10:02 PM, Gurpreet Singh
  gurpreet.si...@gmail.com
  wrote:
 
  Aaron, Ruslan,
  I changed the disk access mode to mmap_index_only, and it has been
  stable
  ever since, well at least for the past 20 hours. Previously, in abt
  10-12
  hours, as soon as the resident memory was full, the client would start
  timing out on all its reads. It looks fine for now, i am going to let
  it
  continue to see how long it lasts and if the problem comes again.
 
  Aaron,
  yes, i had turned swap off.
 
  The total cpu utilization was at 700% roughly.. It looked like kswapd0
  was
  using just 1 cpu, but cassandra (jsvc) cpu utilization increased quite
  a
  bit. top was reporting high system cpu, and low user cpu.
  vmstat was not showing swapping. java heap size max is 8 gigs. while
  only
  4 gigs was in use, so java heap was doing great. no gc in the logs.
  iostat
  was doing ok from what i remember, i will have to reproduce the issue
  for
  the exact numbers.
 
  cfstats latency had gone very high, but that is partly due to high cpu
  usage.
 
  One thing was clear, that the SHR was inching higher (due to the mmap)
  while buffer cache which started at abt 20-25mb 

Re: kswapd0 causing read timeouts

2012-06-13 Thread Gurpreet Singh
Alright, here it goes again...
Even with mmap_index_only, once the RES memory hit 15 gigs, the read
latency went berserk. This happens in 12 hours if diskaccessmode is mmap,
abt 48 hrs if its mmap_index_only.

only reads happening at 50 reads/second
row cache size: 730 mb, row cache hit ratio: 0.75
key cache size: 400 mb, key cache hit ratio: 0.4
heap size (max 8 gigs): used 6.1-6.9 gigs

No messages about reducing cache sizes in the logs

stats:
vmstat 1 : no swapping here, however high sys cpu utilization
iostat (looks great) - avg-qu-sz = 8, avg await = 7 ms, svc time = 0.6,
util = 15-30%
top - VIRT - 19.8g, SHR - 6.1g, RES - 15g, high cpu, buffers - 2mb
cfstats - 70-100 ms. This number used to be 20-30 ms.

The value of the SHR keeps increasing (owing to mmap i guess), while at the
same time buffers keeps decreasing. buffers starts as high as 50 mb, and
goes down to 2 mb.


This is very easily reproducible for me. Every time the RES memory hits abt
15 gigs, the client starts getting timeouts from cassandra, the sys cpu
jumps a lot. All this, even though my row cache hit ratio is almost 0.75.

Other than just turning off mmap completely, is there any other solution or
setting to avoid a cassandra restart every cpl of days. Something to keep
the RES memory to hit such a high number. I have been constantly monitoring
the RES, was not seeing issues when RES was at 14 gigs.
/G

On Fri, Jun 8, 2012 at 10:02 PM, Gurpreet Singh gurpreet.si...@gmail.comwrote:

 Aaron, Ruslan,
 I changed the disk access mode to mmap_index_only, and it has been stable
 ever since, well at least for the past 20 hours. Previously, in abt 10-12
 hours, as soon as the resident memory was full, the client would start
 timing out on all its reads. It looks fine for now, i am going to let it
 continue to see how long it lasts and if the problem comes again.

 Aaron,
 yes, i had turned swap off.

 The total cpu utilization was at 700% roughly.. It looked like kswapd0 was
 using just 1 cpu, but cassandra (jsvc) cpu utilization increased quite a
 bit. top was reporting high system cpu, and low user cpu.
 vmstat was not showing swapping. java heap size max is 8 gigs. while only
 4 gigs was in use, so java heap was doing great. no gc in the logs. iostat
 was doing ok from what i remember, i will have to reproduce the issue for
 the exact numbers.

 cfstats latency had gone very high, but that is partly due to high cpu
 usage.

 One thing was clear, that the SHR was inching higher (due to the mmap)
 while buffer cache which started at abt 20-25mb reduced to 2 MB by the end,
 which probably means that pagecache was being evicted by the kswapd0. Is
 there a way to fix the size of the buffer cache and not let system evict it
 in favour of mmap?

 Also, mmapping data files would basically cause not only the data (asked
 for) to be read into main memory, but also a bunch of extra pages
 (readahead), which would not be very useful, right? The same thing for
 index would actually be more useful, as there would be more index entries
 in the readahead part.. and the index files being small wouldnt cause
 memory pressure that page cache would be evicted. mmapping the data files
 would make sense if the data size is smaller than the RAM or the hot data
 set is smaller than the RAM, otherwise just the index would probably be a
 better thing to mmap, no?. In my case data size is 85 gigs, while available
 RAM is 16 gigs (only 8 gigs after heap).

 /G


 On Fri, Jun 8, 2012 at 11:44 AM, aaron morton aa...@thelastpickle.comwrote:

 Ruslan,
 Why did you suggest changing the disk_access_mode ?

 Gurpreet,
 I would leave the disk_access_mode with the default until you have a
 reason to change it.

   8 core, 16 gb ram, 6 data disks raid0, no swap configured

 is swap disabled ?

  Gradually,
  the system cpu becomes high almost 70%, and the client starts getting
  continuous timeouts

 70% of one core or 70% of all cores ?
 Check the server logs, is there GC activity ?
 check nodetool cfstats to see the read latency for the cf.

 Take a look at vmstat to see if you are swapping, and look at iostats to
 see if io is the problem
 http://spyced.blogspot.co.nz/2010/01/linux-performance-basics.html

 Cheers

   -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 8/06/2012, at 9:00 PM, Gurpreet Singh wrote:

 Thanks Ruslan.
 I will try the mmap_index_only.
 Is there any guideline as to when to leave it to auto and when to use
 mmap_index_only?

 /G

 On Fri, Jun 8, 2012 at 1:21 AM, ruslan usifov ruslan.usi...@gmail.comwrote:

 disk_access_mode: mmap??

 set to disk_access_mode: mmap_index_only in cassandra yaml

 2012/6/8 Gurpreet Singh gurpreet.si...@gmail.com:
  Hi,
  I am testing cassandra 1.1 on a 1 node cluster.
  8 core, 16 gb ram, 6 data disks raid0, no swap configured
 
  cassandra 1.1.1
  heap size: 8 gigs
  key cache size in mb: 800 (used only 200mb till now)
  memtable_total_space_in_mb : 2048
 
  I am 

Re: kswapd0 causing read timeouts

2012-06-13 Thread ruslan usifov
Hm, it's very strange what amount of you data? You linux kernel
version? Java version?

PS: i can suggest switch diskaccessmode to standart in you case
PS:PS also upgrade you linux to latest, and javahotspot to 1.6.32
(from oracle site)

2012/6/13 Gurpreet Singh gurpreet.si...@gmail.com:
 Alright, here it goes again...
 Even with mmap_index_only, once the RES memory hit 15 gigs, the read latency
 went berserk. This happens in 12 hours if diskaccessmode is mmap, abt 48 hrs
 if its mmap_index_only.

 only reads happening at 50 reads/second
 row cache size: 730 mb, row cache hit ratio: 0.75
 key cache size: 400 mb, key cache hit ratio: 0.4
 heap size (max 8 gigs): used 6.1-6.9 gigs

 No messages about reducing cache sizes in the logs

 stats:
 vmstat 1 : no swapping here, however high sys cpu utilization
 iostat (looks great) - avg-qu-sz = 8, avg await = 7 ms, svc time = 0.6, util
 = 15-30%
 top - VIRT - 19.8g, SHR - 6.1g, RES - 15g, high cpu, buffers - 2mb
 cfstats - 70-100 ms. This number used to be 20-30 ms.

 The value of the SHR keeps increasing (owing to mmap i guess), while at the
 same time buffers keeps decreasing. buffers starts as high as 50 mb, and
 goes down to 2 mb.


 This is very easily reproducible for me. Every time the RES memory hits abt
 15 gigs, the client starts getting timeouts from cassandra, the sys cpu
 jumps a lot. All this, even though my row cache hit ratio is almost 0.75.

 Other than just turning off mmap completely, is there any other solution or
 setting to avoid a cassandra restart every cpl of days. Something to keep
 the RES memory to hit such a high number. I have been constantly monitoring
 the RES, was not seeing issues when RES was at 14 gigs.
 /G

 On Fri, Jun 8, 2012 at 10:02 PM, Gurpreet Singh gurpreet.si...@gmail.com
 wrote:

 Aaron, Ruslan,
 I changed the disk access mode to mmap_index_only, and it has been stable
 ever since, well at least for the past 20 hours. Previously, in abt 10-12
 hours, as soon as the resident memory was full, the client would start
 timing out on all its reads. It looks fine for now, i am going to let it
 continue to see how long it lasts and if the problem comes again.

 Aaron,
 yes, i had turned swap off.

 The total cpu utilization was at 700% roughly.. It looked like kswapd0 was
 using just 1 cpu, but cassandra (jsvc) cpu utilization increased quite a
 bit. top was reporting high system cpu, and low user cpu.
 vmstat was not showing swapping. java heap size max is 8 gigs. while only
 4 gigs was in use, so java heap was doing great. no gc in the logs. iostat
 was doing ok from what i remember, i will have to reproduce the issue for
 the exact numbers.

 cfstats latency had gone very high, but that is partly due to high cpu
 usage.

 One thing was clear, that the SHR was inching higher (due to the mmap)
 while buffer cache which started at abt 20-25mb reduced to 2 MB by the end,
 which probably means that pagecache was being evicted by the kswapd0. Is
 there a way to fix the size of the buffer cache and not let system evict it
 in favour of mmap?

 Also, mmapping data files would basically cause not only the data (asked
 for) to be read into main memory, but also a bunch of extra pages
 (readahead), which would not be very useful, right? The same thing for index
 would actually be more useful, as there would be more index entries in the
 readahead part.. and the index files being small wouldnt cause memory
 pressure that page cache would be evicted. mmapping the data files would
 make sense if the data size is smaller than the RAM or the hot data set is
 smaller than the RAM, otherwise just the index would probably be a better
 thing to mmap, no?. In my case data size is 85 gigs, while available RAM is
 16 gigs (only 8 gigs after heap).

 /G


 On Fri, Jun 8, 2012 at 11:44 AM, aaron morton aa...@thelastpickle.com
 wrote:

 Ruslan,
 Why did you suggest changing the disk_access_mode ?

 Gurpreet,
 I would leave the disk_access_mode with the default until you have a
 reason to change it.

  8 core, 16 gb ram, 6 data disks raid0, no swap configured

 is swap disabled ?

 Gradually,
  the system cpu becomes high almost 70%, and the client starts getting
  continuous timeouts

 70% of one core or 70% of all cores ?
 Check the server logs, is there GC activity ?
 check nodetool cfstats to see the read latency for the cf.

 Take a look at vmstat to see if you are swapping, and look at iostats to
 see if io is the problem
 http://spyced.blogspot.co.nz/2010/01/linux-performance-basics.html

 Cheers

 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 8/06/2012, at 9:00 PM, Gurpreet Singh wrote:

 Thanks Ruslan.
 I will try the mmap_index_only.
 Is there any guideline as to when to leave it to auto and when to use
 mmap_index_only?

 /G

 On Fri, Jun 8, 2012 at 1:21 AM, ruslan usifov ruslan.usi...@gmail.com
 wrote:

 disk_access_mode: mmap??

 set to disk_access_mode: mmap_index_only in 

Re: kswapd0 causing read timeouts

2012-06-13 Thread Al Tobey
I would check /etc/sysctl.conf and get the values of
/proc/sys/vm/swappiness and /proc/sys/vm/vfs_cache_pressure.

If you don't have JNA enabled (which Cassandra uses to fadvise) and
swappiness is at its default of 60, the Linux kernel will happily swap out
your heap for cache space.  Set swappiness to 1 or 'swapoff -a' and kswapd
shouldn't be doing much unless you have a too-large heap or some other app
using up memory on the system.

On Wed, Jun 13, 2012 at 11:30 AM, ruslan usifov ruslan.usi...@gmail.comwrote:

 Hm, it's very strange what amount of you data? You linux kernel
 version? Java version?

 PS: i can suggest switch diskaccessmode to standart in you case
 PS:PS also upgrade you linux to latest, and javahotspot to 1.6.32
 (from oracle site)

 2012/6/13 Gurpreet Singh gurpreet.si...@gmail.com:
  Alright, here it goes again...
  Even with mmap_index_only, once the RES memory hit 15 gigs, the read
 latency
  went berserk. This happens in 12 hours if diskaccessmode is mmap, abt 48
 hrs
  if its mmap_index_only.
 
  only reads happening at 50 reads/second
  row cache size: 730 mb, row cache hit ratio: 0.75
  key cache size: 400 mb, key cache hit ratio: 0.4
  heap size (max 8 gigs): used 6.1-6.9 gigs
 
  No messages about reducing cache sizes in the logs
 
  stats:
  vmstat 1 : no swapping here, however high sys cpu utilization
  iostat (looks great) - avg-qu-sz = 8, avg await = 7 ms, svc time = 0.6,
 util
  = 15-30%
  top - VIRT - 19.8g, SHR - 6.1g, RES - 15g, high cpu, buffers - 2mb
  cfstats - 70-100 ms. This number used to be 20-30 ms.
 
  The value of the SHR keeps increasing (owing to mmap i guess), while at
 the
  same time buffers keeps decreasing. buffers starts as high as 50 mb, and
  goes down to 2 mb.
 
 
  This is very easily reproducible for me. Every time the RES memory hits
 abt
  15 gigs, the client starts getting timeouts from cassandra, the sys cpu
  jumps a lot. All this, even though my row cache hit ratio is almost 0.75.
 
  Other than just turning off mmap completely, is there any other solution
 or
  setting to avoid a cassandra restart every cpl of days. Something to keep
  the RES memory to hit such a high number. I have been constantly
 monitoring
  the RES, was not seeing issues when RES was at 14 gigs.
  /G
 
  On Fri, Jun 8, 2012 at 10:02 PM, Gurpreet Singh 
 gurpreet.si...@gmail.com
  wrote:
 
  Aaron, Ruslan,
  I changed the disk access mode to mmap_index_only, and it has been
 stable
  ever since, well at least for the past 20 hours. Previously, in abt
 10-12
  hours, as soon as the resident memory was full, the client would start
  timing out on all its reads. It looks fine for now, i am going to let it
  continue to see how long it lasts and if the problem comes again.
 
  Aaron,
  yes, i had turned swap off.
 
  The total cpu utilization was at 700% roughly.. It looked like kswapd0
 was
  using just 1 cpu, but cassandra (jsvc) cpu utilization increased quite a
  bit. top was reporting high system cpu, and low user cpu.
  vmstat was not showing swapping. java heap size max is 8 gigs. while
 only
  4 gigs was in use, so java heap was doing great. no gc in the logs.
 iostat
  was doing ok from what i remember, i will have to reproduce the issue
 for
  the exact numbers.
 
  cfstats latency had gone very high, but that is partly due to high cpu
  usage.
 
  One thing was clear, that the SHR was inching higher (due to the mmap)
  while buffer cache which started at abt 20-25mb reduced to 2 MB by the
 end,
  which probably means that pagecache was being evicted by the kswapd0. Is
  there a way to fix the size of the buffer cache and not let system
 evict it
  in favour of mmap?
 
  Also, mmapping data files would basically cause not only the data (asked
  for) to be read into main memory, but also a bunch of extra pages
  (readahead), which would not be very useful, right? The same thing for
 index
  would actually be more useful, as there would be more index entries in
 the
  readahead part.. and the index files being small wouldnt cause memory
  pressure that page cache would be evicted. mmapping the data files would
  make sense if the data size is smaller than the RAM or the hot data set
 is
  smaller than the RAM, otherwise just the index would probably be a
 better
  thing to mmap, no?. In my case data size is 85 gigs, while available
 RAM is
  16 gigs (only 8 gigs after heap).
 
  /G
 
 
  On Fri, Jun 8, 2012 at 11:44 AM, aaron morton aa...@thelastpickle.com
  wrote:
 
  Ruslan,
  Why did you suggest changing the disk_access_mode ?
 
  Gurpreet,
  I would leave the disk_access_mode with the default until you have a
  reason to change it.
 
   8 core, 16 gb ram, 6 data disks raid0, no swap configured
 
  is swap disabled ?
 
  Gradually,
   the system cpu becomes high almost 70%, and the client starts
 getting
   continuous timeouts
 
  70% of one core or 70% of all cores ?
  Check the server logs, is there GC activity ?
  check nodetool cfstats to see the read 

kswapd0 causing read timeouts

2012-06-08 Thread Gurpreet Singh
Hi,
I am testing cassandra 1.1 on a 1 node cluster.
8 core, 16 gb ram, 6 data disks raid0, no swap configured

cassandra 1.1.1
heap size: 8 gigs
key cache size in mb: 800 (used only 200mb till now)
memtable_total_space_in_mb : 2048

I am running a read workload.. about 30 reads/second. no writes at all.
The system runs fine for roughly 12 hours.

jconsole shows that my heap size has hardly touched 4 gigs.
top shows -
  SHR increasing slowly from 100 mb to 6.6 gigs in  these 12 hrs
  RES increases slowly from 6 gigs all the way to 15 gigs
  buffers are at a healthy 25 mb at some point and that goes down to 2 mb
in these 12 hrs
  VIRT stays at 85 gigs

I understand that SHR goes up because of mmap, RES goes up because it is
showing SHR value as well.

After around 10-12 hrs, the cpu utilization of the system starts
increasing, and i notice that kswapd0 process starts becoming more active.
Gradually, the system cpu becomes high almost 70%, and the client starts
getting continuous timeouts. The fact that the buffers went down from 20 mb
to 2 mb suggests that kswapd0 is probably swapping out the pagecache.

Is there a way out of this to avoid the kswapd0 starting to do things even
when there is no swap configured?
This is very easily reproducible for me, and would like a way out of this
situation. Do i need to adjust vm memory management stuff like pagecache,
vfs_cache_pressure.. things like that?

just some extra information, jna is installed, mlockall is successful.
there is no compaction running.
would appreciate any help on this.
Thanks
Gurpreet


Re: kswapd0 causing read timeouts

2012-06-08 Thread ruslan usifov
disk_access_mode: mmap??

set to disk_access_mode: mmap_index_only in cassandra yaml

2012/6/8 Gurpreet Singh gurpreet.si...@gmail.com:
 Hi,
 I am testing cassandra 1.1 on a 1 node cluster.
 8 core, 16 gb ram, 6 data disks raid0, no swap configured

 cassandra 1.1.1
 heap size: 8 gigs
 key cache size in mb: 800 (used only 200mb till now)
 memtable_total_space_in_mb : 2048

 I am running a read workload.. about 30 reads/second. no writes at all.
 The system runs fine for roughly 12 hours.

 jconsole shows that my heap size has hardly touched 4 gigs.
 top shows -
   SHR increasing slowly from 100 mb to 6.6 gigs in  these 12 hrs
   RES increases slowly from 6 gigs all the way to 15 gigs
   buffers are at a healthy 25 mb at some point and that goes down to 2 mb in
 these 12 hrs
   VIRT stays at 85 gigs

 I understand that SHR goes up because of mmap, RES goes up because it is
 showing SHR value as well.

 After around 10-12 hrs, the cpu utilization of the system starts increasing,
 and i notice that kswapd0 process starts becoming more active. Gradually,
 the system cpu becomes high almost 70%, and the client starts getting
 continuous timeouts. The fact that the buffers went down from 20 mb to 2 mb
 suggests that kswapd0 is probably swapping out the pagecache.

 Is there a way out of this to avoid the kswapd0 starting to do things even
 when there is no swap configured?
 This is very easily reproducible for me, and would like a way out of this
 situation. Do i need to adjust vm memory management stuff like pagecache,
 vfs_cache_pressure.. things like that?

 just some extra information, jna is installed, mlockall is successful. there
 is no compaction running.
 would appreciate any help on this.
 Thanks
 Gurpreet




Re: kswapd0 causing read timeouts

2012-06-08 Thread Gurpreet Singh
Thanks Ruslan.
I will try the mmap_index_only.
Is there any guideline as to when to leave it to auto and when to use
mmap_index_only?

/G

On Fri, Jun 8, 2012 at 1:21 AM, ruslan usifov ruslan.usi...@gmail.comwrote:

 disk_access_mode: mmap??

 set to disk_access_mode: mmap_index_only in cassandra yaml

 2012/6/8 Gurpreet Singh gurpreet.si...@gmail.com:
  Hi,
  I am testing cassandra 1.1 on a 1 node cluster.
  8 core, 16 gb ram, 6 data disks raid0, no swap configured
 
  cassandra 1.1.1
  heap size: 8 gigs
  key cache size in mb: 800 (used only 200mb till now)
  memtable_total_space_in_mb : 2048
 
  I am running a read workload.. about 30 reads/second. no writes at all.
  The system runs fine for roughly 12 hours.
 
  jconsole shows that my heap size has hardly touched 4 gigs.
  top shows -
SHR increasing slowly from 100 mb to 6.6 gigs in  these 12 hrs
RES increases slowly from 6 gigs all the way to 15 gigs
buffers are at a healthy 25 mb at some point and that goes down to 2
 mb in
  these 12 hrs
VIRT stays at 85 gigs
 
  I understand that SHR goes up because of mmap, RES goes up because it is
  showing SHR value as well.
 
  After around 10-12 hrs, the cpu utilization of the system starts
 increasing,
  and i notice that kswapd0 process starts becoming more active. Gradually,
  the system cpu becomes high almost 70%, and the client starts getting
  continuous timeouts. The fact that the buffers went down from 20 mb to 2
 mb
  suggests that kswapd0 is probably swapping out the pagecache.
 
  Is there a way out of this to avoid the kswapd0 starting to do things
 even
  when there is no swap configured?
  This is very easily reproducible for me, and would like a way out of this
  situation. Do i need to adjust vm memory management stuff like pagecache,
  vfs_cache_pressure.. things like that?
 
  just some extra information, jna is installed, mlockall is successful.
 there
  is no compaction running.
  would appreciate any help on this.
  Thanks
  Gurpreet
 
 



Re: kswapd0 causing read timeouts

2012-06-08 Thread ruslan usifov
2012/6/8 aaron morton aa...@thelastpickle.com:
 Ruslan,
 Why did you suggest changing the disk_access_mode ?

Because this bring problems on empty seat, in any case for me mmap
bring similar problem and i doesn't have find any solution to resolve
it, only  change disk_access_mode:-((. For me also will be interesting
hear results of author of this theme


 Gurpreet,
 I would leave the disk_access_mode with the default until you have a reason
 to change it.

  8 core, 16 gb ram, 6 data disks raid0, no swap configured

 is swap disabled ?

 Gradually,
  the system cpu becomes high almost 70%, and the client starts getting
  continuous timeouts

 70% of one core or 70% of all cores ?
 Check the server logs, is there GC activity ?
 check nodetool cfstats to see the read latency for the cf.

 Take a look at vmstat to see if you are swapping, and look at iostats to see
 if io is the problem
 http://spyced.blogspot.co.nz/2010/01/linux-performance-basics.html

 Cheers

 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 8/06/2012, at 9:00 PM, Gurpreet Singh wrote:

 Thanks Ruslan.
 I will try the mmap_index_only.
 Is there any guideline as to when to leave it to auto and when to use
 mmap_index_only?

 /G

 On Fri, Jun 8, 2012 at 1:21 AM, ruslan usifov ruslan.usi...@gmail.com
 wrote:

 disk_access_mode: mmap??

 set to disk_access_mode: mmap_index_only in cassandra yaml

 2012/6/8 Gurpreet Singh gurpreet.si...@gmail.com:
  Hi,
  I am testing cassandra 1.1 on a 1 node cluster.
  8 core, 16 gb ram, 6 data disks raid0, no swap configured
 
  cassandra 1.1.1
  heap size: 8 gigs
  key cache size in mb: 800 (used only 200mb till now)
  memtable_total_space_in_mb : 2048
 
  I am running a read workload.. about 30 reads/second. no writes at all.
  The system runs fine for roughly 12 hours.
 
  jconsole shows that my heap size has hardly touched 4 gigs.
  top shows -
    SHR increasing slowly from 100 mb to 6.6 gigs in  these 12 hrs
    RES increases slowly from 6 gigs all the way to 15 gigs
    buffers are at a healthy 25 mb at some point and that goes down to 2
  mb in
  these 12 hrs
    VIRT stays at 85 gigs
 
  I understand that SHR goes up because of mmap, RES goes up because it is
  showing SHR value as well.
 
  After around 10-12 hrs, the cpu utilization of the system starts
  increasing,
  and i notice that kswapd0 process starts becoming more active.
  Gradually,
  the system cpu becomes high almost 70%, and the client starts getting
  continuous timeouts. The fact that the buffers went down from 20 mb to 2
  mb
  suggests that kswapd0 is probably swapping out the pagecache.
 
  Is there a way out of this to avoid the kswapd0 starting to do things
  even
  when there is no swap configured?
  This is very easily reproducible for me, and would like a way out of
  this
  situation. Do i need to adjust vm memory management stuff like
  pagecache,
  vfs_cache_pressure.. things like that?
 
  just some extra information, jna is installed, mlockall is successful.
  there
  is no compaction running.
  would appreciate any help on this.
  Thanks
  Gurpreet
 
 





Re: kswapd0 causing read timeouts

2012-06-08 Thread Gurpreet Singh
Aaron, Ruslan,
I changed the disk access mode to mmap_index_only, and it has been stable
ever since, well at least for the past 20 hours. Previously, in abt 10-12
hours, as soon as the resident memory was full, the client would start
timing out on all its reads. It looks fine for now, i am going to let it
continue to see how long it lasts and if the problem comes again.

Aaron,
yes, i had turned swap off.

The total cpu utilization was at 700% roughly.. It looked like kswapd0 was
using just 1 cpu, but cassandra (jsvc) cpu utilization increased quite a
bit. top was reporting high system cpu, and low user cpu.
vmstat was not showing swapping. java heap size max is 8 gigs. while only 4
gigs was in use, so java heap was doing great. no gc in the logs. iostat
was doing ok from what i remember, i will have to reproduce the issue for
the exact numbers.

cfstats latency had gone very high, but that is partly due to high cpu
usage.

One thing was clear, that the SHR was inching higher (due to the mmap)
while buffer cache which started at abt 20-25mb reduced to 2 MB by the end,
which probably means that pagecache was being evicted by the kswapd0. Is
there a way to fix the size of the buffer cache and not let system evict it
in favour of mmap?

Also, mmapping data files would basically cause not only the data (asked
for) to be read into main memory, but also a bunch of extra pages
(readahead), which would not be very useful, right? The same thing for
index would actually be more useful, as there would be more index entries
in the readahead part.. and the index files being small wouldnt cause
memory pressure that page cache would be evicted. mmapping the data files
would make sense if the data size is smaller than the RAM or the hot data
set is smaller than the RAM, otherwise just the index would probably be a
better thing to mmap, no?. In my case data size is 85 gigs, while available
RAM is 16 gigs (only 8 gigs after heap).

/G


On Fri, Jun 8, 2012 at 11:44 AM, aaron morton aa...@thelastpickle.comwrote:

 Ruslan,
 Why did you suggest changing the disk_access_mode ?

 Gurpreet,
 I would leave the disk_access_mode with the default until you have a
 reason to change it.

  8 core, 16 gb ram, 6 data disks raid0, no swap configured

 is swap disabled ?

 Gradually,
  the system cpu becomes high almost 70%, and the client starts getting
  continuous timeouts

 70% of one core or 70% of all cores ?
 Check the server logs, is there GC activity ?
 check nodetool cfstats to see the read latency for the cf.

 Take a look at vmstat to see if you are swapping, and look at iostats to
 see if io is the problem
 http://spyced.blogspot.co.nz/2010/01/linux-performance-basics.html

 Cheers

 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 8/06/2012, at 9:00 PM, Gurpreet Singh wrote:

 Thanks Ruslan.
 I will try the mmap_index_only.
 Is there any guideline as to when to leave it to auto and when to use
 mmap_index_only?

 /G

 On Fri, Jun 8, 2012 at 1:21 AM, ruslan usifov ruslan.usi...@gmail.comwrote:

 disk_access_mode: mmap??

 set to disk_access_mode: mmap_index_only in cassandra yaml

 2012/6/8 Gurpreet Singh gurpreet.si...@gmail.com:
  Hi,
  I am testing cassandra 1.1 on a 1 node cluster.
  8 core, 16 gb ram, 6 data disks raid0, no swap configured
 
  cassandra 1.1.1
  heap size: 8 gigs
  key cache size in mb: 800 (used only 200mb till now)
  memtable_total_space_in_mb : 2048
 
  I am running a read workload.. about 30 reads/second. no writes at all.
  The system runs fine for roughly 12 hours.
 
  jconsole shows that my heap size has hardly touched 4 gigs.
  top shows -
SHR increasing slowly from 100 mb to 6.6 gigs in  these 12 hrs
RES increases slowly from 6 gigs all the way to 15 gigs
buffers are at a healthy 25 mb at some point and that goes down to 2
 mb in
  these 12 hrs
VIRT stays at 85 gigs
 
  I understand that SHR goes up because of mmap, RES goes up because it is
  showing SHR value as well.
 
  After around 10-12 hrs, the cpu utilization of the system starts
 increasing,
  and i notice that kswapd0 process starts becoming more active.
 Gradually,
  the system cpu becomes high almost 70%, and the client starts getting
  continuous timeouts. The fact that the buffers went down from 20 mb to
 2 mb
  suggests that kswapd0 is probably swapping out the pagecache.
 
  Is there a way out of this to avoid the kswapd0 starting to do things
 even
  when there is no swap configured?
  This is very easily reproducible for me, and would like a way out of
 this
  situation. Do i need to adjust vm memory management stuff like
 pagecache,
  vfs_cache_pressure.. things like that?
 
  just some extra information, jna is installed, mlockall is successful.
 there
  is no compaction running.
  would appreciate any help on this.
  Thanks
  Gurpreet
 
 






Fwd: read timeouts in cassandra 0.6.5

2010-11-05 Thread Adam Crain
Hi,

I have a simple keyspace:

Keyspace Name=reef-test
  ColumnFamily Name =Meas CompareWith=LongType /


 
ReplicaPlacementStrategyorg.apache.cassandra.locator.RackUnawareStrategy/ReplicaPlacementStrategy

  ReplicationFactor1/ReplicationFactor

 EndPointSnitchorg.apache.cassandra.locator.EndPointSnitch/EndPointSnitch

/Keyspace

We're using it as a data historian. Many rows of measurments, measurement
history is in the columns by milliseconds since UNIX epoch. The single node
never has a problem writing, but even with low volume it will frequently
timeout while reading:

Read timed out
org.apache.thrift.transport.TTransportException:
java.net.SocketTimeoutException: Read timed out
at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:128)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at 
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:314)
at 
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:262)
at 
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:192)
at 
org.apache.cassandra.thrift.Cassandra$Client.recv_multiget_slice(Cassandra.java:477)
at 
org.apache.cassandra.thrift.Cassandra$Client.multiget_slice(Cassandra.java:458)


What settings should I be tuning to avoid this? Usually the latency is
quite low, but once in every 10 queries or so it's completely of the
chart.


thanks,

Adam