[jira] [Resolved] (CASSANDRA-7903) tombstone created upon insert of new row

2014-09-09 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict resolved CASSANDRA-7903.
-
Resolution: Not a Problem

Inserting NULL has exact same semantics as delete, and inserts a tombstone.

 tombstone created upon insert of new row
 

 Key: CASSANDRA-7903
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7903
 Project: Cassandra
  Issue Type: Bug
Reporter: Thanh

 Tombstone is created upon insert of new row, depending on how the row is 
 inserted.
 Simple way to observe this behavior:
 Using cqlsh:
 CREATE TABLE users1 (
   userid text PRIMARY KEY,
   first_name text,
   last_name text);
 insert into users1 (userid, first_name) values ('a','a');
 tracing on;
 select * from users;
 Trace results show 1 live cell and 0 tombstone cells created as a result:
  userid | first_name | last_name
 ++---
   a |  a |  null
 (1 rows)
 …
 Read 1 live and 0 tombstoned cells | 00:31:31,487 | 10.240.203.201 |  
  1275
   Scanned 1 rows and matched 1 | 00:31:31,487 | 10.240.203.201 |  
  1328
 …
 Now,
 insert into users1 (userid, first_name,last_name) values ('b','b',null);
 select * from users;
 Trace results show 1 live cell and 1 tombstone cell created as a result:
  userid | first_name | last_name
 ++---
   a |  a |  null
   b |  b |  null
 (2 rows)
 …
 Read 1 live and 0 tombstoned cells | 00:35:09,357 | 10.240.203.201 |  
  1243
 Read 1 live and 1 tombstoned cells | 00:35:09,357 | 10.240.203.201 |  
  1383
 Scanned 2 rows and matched 2 | 00:35:09,357 | 10.240.203.201 |
1438
 …
 Tombstone is not expected to be created in either case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-7907) Determine how many network threads we need for native transport

2014-09-09 Thread Benedict (JIRA)
Benedict created CASSANDRA-7907:
---

 Summary: Determine how many network threads we need for native 
transport
 Key: CASSANDRA-7907
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7907
 Project: Cassandra
  Issue Type: Improvement
Reporter: Benedict
Priority: Minor


With the introduction of CASSANDRA-4718, it is highly likely we can cope with 
just _one_ network IO thread. We could even try pinning it to a single 
(optionally configurable) core, and (also optionally) pin all other threads to 
a different core, so that we can guarantee extremely prompt execution (and if 
pinned to the correct core the OS uses for managing the network, improve 
throughput further).

Testing this out will be challenging, as we need to simulate clients from lots 
of IPs. However, it is quite likely this would reduce the percentage of time 
spent in kernel networking calls, and the amount of context switching.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7719) Add PreparedStatements related metrics

2014-09-09 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127935#comment-14127935
 ] 

Benedict commented on CASSANDRA-7719:
-

bq. Don't use String.format in logger.trace, use parameterized messages. Like

More important than this is to guard the logger.trace() call with 
logger.isTraceEnabled(), so that it is optimised away when not enabled. If 
String.format buys you more useful formatting it can be justified, but here it 
looks like the logger can cope with your params just as well.

 Add PreparedStatements related metrics
 --

 Key: CASSANDRA-7719
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7719
 Project: Cassandra
  Issue Type: New Feature
Reporter: Michaël Figuière
Assignee: T Jake Luciani
Priority: Minor
 Fix For: 2.1.1

 Attachments: 7719.txt


 Cassandra newcomers often don't understand that they're expected to use 
 PreparedStatements for almost all of their repetitive queries executed in 
 production.
 It doesn't look like Cassandra currently expose any PreparedStatements 
 related metrics.It would be interesting, and I believe fairly simple, to add 
 several of them to make it possible, in development / management / monitoring 
 tools, to show warnings or alerts related to this bad practice.
 Thus I would suggest to add the following metrics:
 * Executed prepared statements count
 * Executed unprepared statements count
 * Amount of PreparedStatements that have been registered on the node



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7719) Add PreparedStatements related metrics

2014-09-09 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127999#comment-14127999
 ] 

Benedict commented on CASSANDRA-7719:
-

Only after constructing an object array and boxing the parameters. It cannot 
optimise that away since it happens prior to the invocation.

 Add PreparedStatements related metrics
 --

 Key: CASSANDRA-7719
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7719
 Project: Cassandra
  Issue Type: New Feature
Reporter: Michaël Figuière
Assignee: T Jake Luciani
Priority: Minor
 Fix For: 2.1.1

 Attachments: 7719.txt


 Cassandra newcomers often don't understand that they're expected to use 
 PreparedStatements for almost all of their repetitive queries executed in 
 production.
 It doesn't look like Cassandra currently expose any PreparedStatements 
 related metrics.It would be interesting, and I believe fairly simple, to add 
 several of them to make it possible, in development / management / monitoring 
 tools, to show warnings or alerts related to this bad practice.
 Thus I would suggest to add the following metrics:
 * Executed prepared statements count
 * Executed unprepared statements count
 * Amount of PreparedStatements that have been registered on the node



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7907) Determine how many network threads we need for native transport

2014-09-10 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128651#comment-14128651
 ] 

Benedict commented on CASSANDRA-7907:
-

The _pinning_ is a secondary concern that I definitely want to leave optional 
(i.e. implement it, but leave it configurable, and default to off until we 
collect extensive data on good widely applicable defaults). I don't expect the 
user to taskset, however; we'd do this for the user, but let them specify the 
cpu id in the yaml if they think they can do a better job of it.

bq. do we have reason to believe that we're bottle-necking on this

It's difficult to benchmark networking overheads accurately, but it's a 
significant portion (perhaps majority) of our cpu time for in-memory workloads. 
Anything we can do to reduce this we should explore.

 Determine how many network threads we need for native transport
 ---

 Key: CASSANDRA-7907
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7907
 Project: Cassandra
  Issue Type: Improvement
Reporter: Benedict
Priority: Minor

 With the introduction of CASSANDRA-4718, it is highly likely we can cope with 
 just _one_ network IO thread. We could even try pinning it to a single 
 (optionally configurable) core, and (also optionally) pin all other threads 
 to a different core, so that we can guarantee extremely prompt execution (and 
 if pinned to the correct core the OS uses for managing the network, improve 
 throughput further).
 Testing this out will be challenging, as we need to simulate clients from 
 lots of IPs. However, it is quite likely this would reduce the percentage of 
 time spent in kernel networking calls, and the amount of context switching.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7907) Determine how many network threads we need for native transport

2014-09-10 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14129314#comment-14129314
 ] 

Benedict commented on CASSANDRA-7907:
-

bq. I'd want some evidence that pinning to cores is going to give us a 
measurable benefit before adding it to the code-base

bq. _We could even *try* pinning_

Yes, we need to demonstrate an effect. But that is standard practice for 
performance enhancements :-)

We have prior evidence that an effect will be seen, however. Not only from 
general practice of having done this before in other contexts (including 
yourself), but [~jasobrown] has done this on Cassandra, I believe as part of 
his investigations for CASSANDRA-4718, and seen an effect.

 Determine how many network threads we need for native transport
 ---

 Key: CASSANDRA-7907
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7907
 Project: Cassandra
  Issue Type: Improvement
Reporter: Benedict
Priority: Minor

 With the introduction of CASSANDRA-4718, it is highly likely we can cope with 
 just _one_ network IO thread. We could even try pinning it to a single 
 (optionally configurable) core, and (also optionally) pin all other threads 
 to a different core, so that we can guarantee extremely prompt execution (and 
 if pinned to the correct core the OS uses for managing the network, improve 
 throughput further).
 Testing this out will be challenging, as we need to simulate clients from 
 lots of IPs. However, it is quite likely this would reduce the percentage of 
 time spent in kernel networking calls, and the amount of context switching.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-7468) Add time-based execution to cassandra-stress

2014-09-10 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-7468:

Assignee: Benedict  (was: Matt Kennedy)

 Add time-based execution to cassandra-stress
 

 Key: CASSANDRA-7468
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7468
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Matt Kennedy
Assignee: Benedict
Priority: Minor
 Fix For: 2.1.1

 Attachments: 7468v2.txt, trunk-7468-rebase.patch, trunk-7468.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7918) Provide graphing tool along with cassandra-stress

2014-11-26 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226766#comment-14226766
 ] 

Benedict commented on CASSANDRA-7918:
-

My plane journey was spent manically trying various graphing options to give 
everything you need to assess a branch in one view, and clearly. I'd hate that 
to go to waste. The new patch as it stands only produces the graphs we've 
always got - I'd like to see cstar and our bundled tool produce _better 
graphs_. Each one of the graphs in the gnuplot output is designed to let you 
see more information; it's all normalised, coloured and scattered so you can 
distinguish the results at each moment in time and overall. Too often with the 
web output I have to simply glance at the average to tell what's going on (or 
guess-and-peck numbers for zooming in), and have to click at each different 
stat which is laborious (and, let's be honest, we don't do it thoroughly, we 
just peck at a few... or perhaps I'm lazier than everyone else :))

To elaborate on the alternative, there are ten graphs in one view in the 
gnuplot version, scaled so you can tell everything they want you to know 
without clicking once. The left-most of each graph normalises each moment of 
each run against the base run, so that variability can be easily broken down 
across the run. The middle graph plots the raw data so you can get a feel for 
its shape, and the final graph plots the median, quartiles and deciles. The 
latencies are all plotted with selected scatters / lines to make distinguishing 
which p-range we're looking at, even when they cross. GC is also plotted 
specially as a cumulative run, since this tweaks out differences much more 
clearly also.

I have nothing against discarding the gnuplot approach, but I'd like to see 
whatever solution we produce deliver really great graphs that allow us to make 
decisions more easily and more accurately. Right now I'd prefer to put the 
gnuplot work into cstar than the other way around. Though I can tell the hatred 
for it runs deep!

 Provide graphing tool along with cassandra-stress
 -

 Key: CASSANDRA-7918
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7918
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Benedict
Assignee: Ryan McGuire
Priority: Minor

 Whilst cstar makes some pretty graphs, they're a little limited and also 
 require you to run your tests through it. It would be useful to be able to 
 graph results from any stress run easily.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8061) tmplink files are not removed

2014-11-26 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226972#comment-14226972
 ] 

Benedict commented on CASSANDRA-8061:
-

[~JoshuaMcKenzie] nice spot, that's definitely a bug. It would require the 
partitions to be circa 500K in size, but it couldn't leave a file intact and 
undeleted, it could only potentially leak a file descriptor. So it's possible 
it's related to CASSANDRA-8248, but definitely not this. We should probably 
reopen 8248 and file against that.


 tmplink files are not removed
 -

 Key: CASSANDRA-8061
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8061
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: Linux
Reporter: Gianluca Borello
Assignee: Joshua McKenzie
Priority: Critical
 Fix For: 2.1.3

 Attachments: 8061_v1.txt, 8248-thread_dump.txt


 After installing 2.1.0, I'm experiencing a bunch of tmplink files that are 
 filling my disk. I found https://issues.apache.org/jira/browse/CASSANDRA-7803 
 and that is very similar, and I confirm it happens both on 2.1.0 as well as 
 from the latest commit on the cassandra-2.1 branch 
 (https://github.com/apache/cassandra/commit/aca80da38c3d86a40cc63d9a122f7d45258e4685
  from the cassandra-2.1)
 Even starting with a clean keyspace, after a few hours I get:
 {noformat}
 $ sudo find /raid0 | grep tmplink | xargs du -hs
 2.7G  
 /raid0/cassandra/data/draios/protobuf1-ccc6dce04beb11e4abf997b38fbf920b/draios-protobuf1-tmplink-ka-4515-Data.db
 13M   
 /raid0/cassandra/data/draios/protobuf1-ccc6dce04beb11e4abf997b38fbf920b/draios-protobuf1-tmplink-ka-4515-Index.db
 1.8G  
 /raid0/cassandra/data/draios/protobuf_by_agent1-cd071a304beb11e4abf997b38fbf920b/draios-protobuf_by_agent1-tmplink-ka-1788-Data.db
 12M   
 /raid0/cassandra/data/draios/protobuf_by_agent1-cd071a304beb11e4abf997b38fbf920b/draios-protobuf_by_agent1-tmplink-ka-1788-Index.db
 5.2M  
 /raid0/cassandra/data/draios/protobuf_by_agent1-cd071a304beb11e4abf997b38fbf920b/draios-protobuf_by_agent1-tmplink-ka-2678-Index.db
 822M  
 /raid0/cassandra/data/draios/protobuf_by_agent1-cd071a304beb11e4abf997b38fbf920b/draios-protobuf_by_agent1-tmplink-ka-2678-Data.db
 7.3M  
 /raid0/cassandra/data/draios/protobuf_by_agent1-cd071a304beb11e4abf997b38fbf920b/draios-protobuf_by_agent1-tmplink-ka-3283-Index.db
 1.2G  
 /raid0/cassandra/data/draios/protobuf_by_agent1-cd071a304beb11e4abf997b38fbf920b/draios-protobuf_by_agent1-tmplink-ka-3283-Data.db
 6.7M  
 /raid0/cassandra/data/draios/protobuf_by_agent1-cd071a304beb11e4abf997b38fbf920b/draios-protobuf_by_agent1-tmplink-ka-3951-Index.db
 1.1G  
 /raid0/cassandra/data/draios/protobuf_by_agent1-cd071a304beb11e4abf997b38fbf920b/draios-protobuf_by_agent1-tmplink-ka-3951-Data.db
 11M   
 /raid0/cassandra/data/draios/protobuf_by_agent1-cd071a304beb11e4abf997b38fbf920b/draios-protobuf_by_agent1-tmplink-ka-4799-Index.db
 1.7G  
 /raid0/cassandra/data/draios/protobuf_by_agent1-cd071a304beb11e4abf997b38fbf920b/draios-protobuf_by_agent1-tmplink-ka-4799-Data.db
 812K  
 /raid0/cassandra/data/draios/mounted_fs_by_agent1-d7bf3e304beb11e4abf997b38fbf920b/draios-mounted_fs_by_agent1-tmplink-ka-234-Index.db
 122M  
 /raid0/cassandra/data/draios/mounted_fs_by_agent1-d7bf3e304beb11e4abf997b38fbf920b/draios-mounted_fs_by_agent1-tmplink-ka-208-Data.db
 744K  
 /raid0/cassandra/data/draios/mounted_fs_by_agent1-d7bf3e304beb11e4abf997b38fbf920b/draios-mounted_fs_by_agent1-tmplink-ka-739-Index.db
 660K  
 /raid0/cassandra/data/draios/mounted_fs_by_agent1-d7bf3e304beb11e4abf997b38fbf920b/draios-mounted_fs_by_agent1-tmplink-ka-193-Index.db
 796K  
 /raid0/cassandra/data/draios/mounted_fs_by_agent1-d7bf3e304beb11e4abf997b38fbf920b/draios-mounted_fs_by_agent1-tmplink-ka-230-Index.db
 137M  
 /raid0/cassandra/data/draios/mounted_fs_by_agent1-d7bf3e304beb11e4abf997b38fbf920b/draios-mounted_fs_by_agent1-tmplink-ka-230-Data.db
 161M  
 /raid0/cassandra/data/draios/mounted_fs_by_agent1-d7bf3e304beb11e4abf997b38fbf920b/draios-mounted_fs_by_agent1-tmplink-ka-269-Data.db
 139M  
 /raid0/cassandra/data/draios/mounted_fs_by_agent1-d7bf3e304beb11e4abf997b38fbf920b/draios-mounted_fs_by_agent1-tmplink-ka-234-Data.db
 940K  
 /raid0/cassandra/data/draios/mounted_fs_by_agent1-d7bf3e304beb11e4abf997b38fbf920b/draios-mounted_fs_by_agent1-tmplink-ka-786-Index.db
 936K  
 /raid0/cassandra/data/draios/mounted_fs_by_agent1-d7bf3e304beb11e4abf997b38fbf920b/draios-mounted_fs_by_agent1-tmplink-ka-269-Index.db
 161M  
 /raid0/cassandra/data/draios/mounted_fs_by_agent1-d7bf3e304beb11e4abf997b38fbf920b/draios-mounted_fs_by_agent1-tmplink-ka-786-Data.db
 672K  
 /raid0/cassandra/data/draios/mounted_fs_by_agent1-d7bf3e304beb11e4abf997b38fbf920b/draios-mounted_fs_by_agent1-tmplink-ka-197-Index.db
 113M  
 

[jira] [Commented] (CASSANDRA-8325) Cassandra 2.1.x fails to start on FreeBSD (JVM crash)

2014-11-26 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226992#comment-14226992
 ] 

Benedict commented on CASSANDRA-8325:
-

You might be right. The javadoc does make it quite explicit that this should 
not be permitted, however the hotspot code in library_call.cpp 
(inline_unsafe_access and classify_unsafe_addr) _seems_ to indicate it should 
be valid and behave the same, but it's hard to say for sure without getting the 
project working better to explore the code more fully.

However given that it is commented as not valid usafe, it does seem sensible to 
change it. But this means a potential performance penalty in one of the most 
heavily used codepaths.

 Cassandra 2.1.x fails to start on FreeBSD (JVM crash)
 -

 Key: CASSANDRA-8325
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8325
 Project: Cassandra
  Issue Type: Bug
 Environment: FreeBSD 10.0 with openjdk version 1.7.0_71, 64-Bit 
 Server VM
Reporter: Leonid Shalupov
 Attachments: hs_err_pid1856.log, system.log


 See attached error file after JVM crash
 {quote}
 FreeBSD xxx.intellij.net 10.0-RELEASE FreeBSD 10.0-RELEASE #0 r260789: Thu 
 Jan 16 22:34:59 UTC 2014 
 r...@snap.freebsd.org:/usr/obj/usr/src/sys/GENERIC  amd64
 {quote}
 {quote}
  % java -version
 openjdk version 1.7.0_71
 OpenJDK Runtime Environment (build 1.7.0_71-b14)
 OpenJDK 64-Bit Server VM (build 24.71-b01, mixed mode)
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-8383) Memtable flush may expire records from the commit log that are in a later memtable

2014-11-27 Thread Benedict (JIRA)
Benedict created CASSANDRA-8383:
---

 Summary: Memtable flush may expire records from the commit log 
that are in a later memtable
 Key: CASSANDRA-8383
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8383
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Benedict
Assignee: Benedict
Priority: Critical
 Fix For: 2.1.3


This is a pretty obvious bug with any care of thought, so not sure how I 
managed to introduce it. We use OpOrder to ensure all writes to a memtable have 
finished before flushing, however we also use this OpOrder to direct writes to 
the correct memtable. However this is insufficient, since the OpOrder is only a 
partial order; an operation from the future (i.e. for the next memtable) 
could still interleave with the past operations in such a way that they grab 
a CL entry inbetween the past operations. Since we simply take the max 
ReplayPosition of those in the past, this would mean any interleaved future 
operations would be expired even though they haven't been persisted to disk.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8383) Memtable flush may expire records from the commit log that are in a later memtable

2014-11-27 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227622#comment-14227622
 ] 

Benedict commented on CASSANDRA-8383:
-

Initial patch 
[here|https://github.com/belliottsmith/cassandra/tree/8383-bug-clexpirereorder]

We should also introduce a commit log correctness stress test, so we can 
reproduce this, be certain it is fixed, and so we can be sure to avoid this or 
similar scenarios in future. 

 Memtable flush may expire records from the commit log that are in a later 
 memtable
 --

 Key: CASSANDRA-8383
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8383
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Benedict
Assignee: Benedict
Priority: Critical
  Labels: commitlog
 Fix For: 2.1.3


 This is a pretty obvious bug with any care of thought, so not sure how I 
 managed to introduce it. We use OpOrder to ensure all writes to a memtable 
 have finished before flushing, however we also use this OpOrder to direct 
 writes to the correct memtable. However this is insufficient, since the 
 OpOrder is only a partial order; an operation from the future (i.e. for the 
 next memtable) could still interleave with the past operations in such a 
 way that they grab a CL entry inbetween the past operations. Since we 
 simply take the max ReplayPosition of those in the past, this would mean any 
 interleaved future operations would be expired even though they haven't been 
 persisted to disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8192) AssertionError in Memory.java

2014-11-27 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227821#comment-14227821
 ] 

Benedict commented on CASSANDRA-8192:
-

If it's the data, it's likely a pretty simple issue of corruption and 
suboptimal error checking. A compression metadata file is probably zero, so 
when we allocate memory to store it in, we don't allocate any memory for it. 
Exactly how it ended up empty is another matter, and is possibly a bug.

Try running {code}find . -iname *CompressionInfo.db -size 0{code} in your data 
directory

 AssertionError in Memory.java
 -

 Key: CASSANDRA-8192
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8192
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: Windows-7-32 bit, 3GB RAM, Java 1.7.0_67
Reporter: Andreas Schnitzerling
Assignee: Joshua McKenzie
 Fix For: 2.1.3

 Attachments: cassandra.bat, cassandra.yaml, 
 logdata-onlinedata-ka-196504-CompressionInfo.zip, printChunkOffsetErrors.txt, 
 system-compactions_in_progress-ka-47594-CompressionInfo.zip, 
 system-sstable_activity-jb-25-Filter.zip, system.log, system_AssertionTest.log


 Since update of 1 of 12 nodes from 2.1.0-rel to 2.1.1-rel Exception during 
 start up.
 {panel:title=system.log}
 ERROR [SSTableBatchOpen:1] 2014-10-27 09:44:00,079 CassandraDaemon.java:153 - 
 Exception in thread Thread[SSTableBatchOpen:1,5,main]
 java.lang.AssertionError: null
   at org.apache.cassandra.io.util.Memory.size(Memory.java:307) 
 ~[apache-cassandra-2.1.1.jar:2.1.1]
   at 
 org.apache.cassandra.io.compress.CompressionMetadata.init(CompressionMetadata.java:135)
  ~[apache-cassandra-2.1.1.jar:2.1.1]
   at 
 org.apache.cassandra.io.compress.CompressionMetadata.create(CompressionMetadata.java:83)
  ~[apache-cassandra-2.1.1.jar:2.1.1]
   at 
 org.apache.cassandra.io.util.CompressedSegmentedFile$Builder.metadata(CompressedSegmentedFile.java:50)
  ~[apache-cassandra-2.1.1.jar:2.1.1]
   at 
 org.apache.cassandra.io.util.CompressedPoolingSegmentedFile$Builder.complete(CompressedPoolingSegmentedFile.java:48)
  ~[apache-cassandra-2.1.1.jar:2.1.1]
   at 
 org.apache.cassandra.io.sstable.SSTableReader.load(SSTableReader.java:766) 
 ~[apache-cassandra-2.1.1.jar:2.1.1]
   at 
 org.apache.cassandra.io.sstable.SSTableReader.load(SSTableReader.java:725) 
 ~[apache-cassandra-2.1.1.jar:2.1.1]
   at 
 org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:402) 
 ~[apache-cassandra-2.1.1.jar:2.1.1]
   at 
 org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:302) 
 ~[apache-cassandra-2.1.1.jar:2.1.1]
   at 
 org.apache.cassandra.io.sstable.SSTableReader$4.run(SSTableReader.java:438) 
 ~[apache-cassandra-2.1.1.jar:2.1.1]
   at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) 
 ~[na:1.7.0_55]
   at java.util.concurrent.FutureTask.run(Unknown Source) ~[na:1.7.0_55]
   at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) 
 [na:1.7.0_55]
   at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) 
 [na:1.7.0_55]
   at java.lang.Thread.run(Unknown Source) [na:1.7.0_55]
 {panel}
 In the attached log you can still see as well CASSANDRA-8069 and 
 CASSANDRA-6283.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (CASSANDRA-8388) java.lang.AssertionError: null

2014-11-28 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict resolved CASSANDRA-8388.
-
Resolution: Duplicate

 java.lang.AssertionError: null 
 ---

 Key: CASSANDRA-8388
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8388
 Project: Cassandra
  Issue Type: Bug
Reporter: Ilya Komolkin

 21:00:10.156 [SSTableBatchOpen:5] ERROR o.a.c.service.CassandraDaemon - 
 Exception in thread Thread[SSTableBatchOpen:5,5,main]
 java.lang.AssertionError: null
 at org.apache.cassandra.io.util.Memory.size(Memory.java:307) 
 ~[cassandra-all-2.1.1.jar:2.1.1]
 at 
 org.apache.cassandra.io.compress.CompressionMetadata.init(CompressionMetadata.java:135)
  ~[cassandra-all-2.1.1.jar:2.1.1]
 at 
 org.apache.cassandra.io.compress.CompressionMetadata.create(CompressionMetadata.java:83)
  ~[cassandra-all-2.1.1.jar:2.1.1]
 at 
 org.apache.cassandra.io.util.CompressedSegmentedFile$Builder.metadata(CompressedSegmentedFile.java:50)
  ~[cassandra-all-2.1.1.jar:2.1.1]
 at 
 org.apache.cassandra.io.util.CompressedPoolingSegmentedFile$Builder.complete(CompressedPoolingSegmentedFile.java:48)
  ~[cassandra-all-2.1.1.jar:2.1.1]
 at 
 org.apache.cassandra.io.sstable.SSTableReader.load(SSTableReader.java:766) 
 ~[cassandra-all-2.1.1.jar:2.1.1]
 at 
 org.apache.cassandra.io.sstable.SSTableReader.load(SSTableReader.java:725) 
 ~[cassandra-all-2.1.1.jar:2.1.1]
 at 
 org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:402) 
 ~[cassandra-all-2.1.1.jar:2.1.1]
 at 
 org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:302) 
 ~[cassandra-all-2.1.1.jar:2.1.1]
 at 
 org.apache.cassandra.io.sstable.SSTableReader$4.run(SSTableReader.java:438) 
 ~[cassandra-all-2.1.1.jar:2.1.1]
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) 
 ~[na:1.7.0_71]
 at java.util.concurrent.FutureTask.run(FutureTask.java:262) 
 ~[na:1.7.0_71]
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  ~[na:1.7.0_71]
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  [na:1.7.0_71]
 at java.lang.Thread.run(Thread.java:745) [na:1.7.0_71]
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8389) org.apache.cassandra.io.sstable.CorruptSSTableException: org.apache.cassandra.io.compress.CorruptBlockException

2014-11-28 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228271#comment-14228271
 ] 

Benedict commented on CASSANDRA-8389:
-

It's not at all clear that this is a bug. Although it is possible, it seems 
likely the data is genuinely corrupted. What makes you suspect a bug, rather 
than corruption from hardware failure?

 org.apache.cassandra.io.sstable.CorruptSSTableException: 
 org.apache.cassandra.io.compress.CorruptBlockException
 ---

 Key: CASSANDRA-8389
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8389
 Project: Cassandra
  Issue Type: Bug
Reporter: Ilya Komolkin

 21:43:50.835 [CompactionExecutor:11] ERROR o.a.c.service.CassandraDaemon - 
 Exception in thread Thread[CompactionExecutor:11,1,main]
 org.apache.cassandra.io.sstable.CorruptSSTableException: 
 org.apache.cassandra.io.compress.CorruptBlockException: 
 (E:\Upsource_12391\data\cassandra\data\kernel\content-a61f1280764611e48c8e4915424c75fe\kernel-content-ka-142-Data.db):
  corruption detected, chunk at 17288734 of length 65502.
 at 
 org.apache.cassandra.io.compress.CompressedRandomAccessReader.reBuffer(CompressedRandomAccessReader.java:92)
  ~[cassandra-all-2.1.1.jar:2.1.1]
 at 
 org.apache.cassandra.io.compress.CompressedThrottledReader.reBuffer(CompressedThrottledReader.java:41)
  ~[cassandra-all-2.1.1.jar:2.1.1]
 at 
 org.apache.cassandra.io.util.RandomAccessReader.read(RandomAccessReader.java:326)
  ~[cassandra-all-2.1.1.jar:2.1.1]
 at 
 java.io.RandomAccessFile.readFully(RandomAccessFile.java:444) ~[na:1.7.0_71]
 at 
 java.io.RandomAccessFile.readFully(RandomAccessFile.java:424) ~[na:1.7.0_71]
 at 
 org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:351)
  ~[cassandra-all-2.1.1.jar:2.1.1]
 at 
 org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:348) 
 ~[cassandra-all-2.1.1.jar:2.1.1]
 at 
 org.apache.cassandra.utils.ByteBufferUtil.readWithLength(ByteBufferUtil.java:311)
  ~[cassandra-all-2.1.1.jar:2.1.1]
 at 
 org.apache.cassandra.db.ColumnSerializer.deserializeColumnBody(ColumnSerializer.java:132)
  ~[cassandra-all-2.1.1.jar:2.1.1]
 at 
 org.apache.cassandra.db.OnDiskAtom$Serializer.deserializeFromSSTable(OnDiskAtom.java:86)
  ~[cassandra-all-2.1.1.jar:2.1.1]
 at 
 org.apache.cassandra.db.AbstractCell$1.computeNext(AbstractCell.java:52) 
 ~[cassandra-all-2.1.1.jar:2.1.1]
 at 
 org.apache.cassandra.db.AbstractCell$1.computeNext(AbstractCell.java:46) 
 ~[cassandra-all-2.1.1.jar:2.1.1]
 at 
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
  ~[guava-16.0.jar:na]
 at 
 com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) 
 ~[guava-16.0.jar:na]
 at 
 org.apache.cassandra.io.sstable.SSTableIdentityIterator.hasNext(SSTableIdentityIterator.java:116)
  ~[cassandra-all-2.1.1.jar:2.1.1]
 at 
 org.apache.cassandra.utils.MergeIterator$OneToOne.computeNext(MergeIterator.java:202)
  ~[cassandra-all-2.1.1.jar:2.1.1]
 at 
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
  ~[guava-16.0.jar:na]
 at 
 com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) 
 ~[guava-16.0.jar:na]
 at 
 com.google.common.collect.Iterators$7.computeNext(Iterators.java:645) 
 ~[guava-16.0.jar:na]
 at 
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
  ~[guava-16.0.jar:na]
 at 
 com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) 
 ~[guava-16.0.jar:na]
 at 
 org.apache.cassandra.db.ColumnIndex$Builder.buildForCompaction(ColumnIndex.java:165)
  ~[cassandra-all-2.1.1.jar:2.1.1]
 at 
 org.apache.cassandra.db.compaction.LazilyCompactedRow.write(LazilyCompactedRow.java:110)
  ~[cassandra-all-2.1.1.jar:2.1.1]
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:200) 
 ~[cassandra-all-2.1.1.jar:2.1.1]
 at 
 org.apache.cassandra.io.sstable.SSTableRewriter.append(SSTableRewriter.java:110)
  ~[cassandra-all-2.1.1.jar:2.1.1]
 at 
 org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:183)
  ~[cassandra-all-2.1.1.jar:2.1.1]
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
  ~[cassandra-all-2.1.1.jar:2.1.1]
 at 
 

[jira] [Commented] (CASSANDRA-7039) DirectByteBuffer compatible LZ4 methods

2014-11-28 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228324#comment-14228324
 ] 

Benedict commented on CASSANDRA-7039:
-

Is there much point upgrading without making use of the new API?

 DirectByteBuffer compatible LZ4 methods
 ---

 Key: CASSANDRA-7039
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7039
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Branimir Lambov
Priority: Minor
  Labels: performance
 Fix For: 3.0

 Attachments: 7039.patch


 As we move more things off-heap, it's becoming more and more essential to be 
 able to use DirectByteBuffer (or native pointers) in various places. 
 Unfortunately LZ4 doesn't currently support this operation, despite being JNI 
 based - this means we both have to perform unnecessary copies to de/compress 
 data from DBB, but also we can stall GC as any JNI method operating over a 
 java array using the GetPrimitiveArrayCritical enters a critical section that 
 prevents GC for its duration. This means STWs will be at least as long any 
 running compression/decompression (and no GC will happen until they complete, 
 so it's additive).
 We should temporarily fork (and then resubmit upstream) jpountz-lz4 to 
 support operating over a native pointer, so that we can pass a DBB or a raw 
 pointer we have allocated ourselves. This will help improve performance when 
 flushing the new offheap memtables, as well as enable us to implement 
 CASSANDRA-6726 and finish CASSANDRA-4338.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-11-28 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228397#comment-14228397
 ] 

Benedict commented on CASSANDRA-7438:
-

I suspect segmenting the table at a finer granularity, so that each segment is 
maintained with mutual exclusivity, would achieve better percentiles in both 
cases due to keeping the maximum resize cost down. We could settle for a 
separate LRU-q per segment, even, to keep the complexity of this code down 
significantly - it is unlikely having a global LRU-q is significantly more 
accurate at predicting reuse than ~128 of them. It would also make it much 
easier to improve the replacement strategy beyond LRU, which would likely yield 
a bigger win for performance than any potential loss from reduced concurrency. 
The critical section for reads could be kept sufficiently small that 
competition would be very unlikely with the current state of C*, by performing 
the deserialization outside of it. There's a good chance this would yield a net 
positive performance impact, by reducing the cost per access without increasing 
the cost due to contention measurably (because contention would be infrequent).

 Serializing Row cache alternative (Fully off heap)
 --

 Key: CASSANDRA-7438
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
 Environment: Linux
Reporter: Vijay
Assignee: Vijay
  Labels: performance
 Fix For: 3.0

 Attachments: 0001-CASSANDRA-7438.patch, tests.zip


 Currently SerializingCache is partially off heap, keys are still stored in 
 JVM heap as BB, 
 * There is a higher GC costs for a reasonably big cache.
 * Some users have used the row cache efficiently in production for better 
 results, but this requires careful tunning.
 * Overhead in Memory for the cache entries are relatively high.
 So the proposal for this ticket is to move the LRU cache logic completely off 
 heap and use JNI to interact with cache. We might want to ensure that the new 
 implementation match the existing API's (ICache), and the implementation 
 needs to have safe memory access, low overhead in memory and less memcpy's 
 (As much as possible).
 We might also want to make this cache configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-11-28 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228397#comment-14228397
 ] 

Benedict edited comment on CASSANDRA-7438 at 11/28/14 5:06 PM:
---

I suspect segmenting the table at a coarser granularity, so that each segment 
is maintained with mutual exclusivity, would achieve better percentiles in both 
cases due to keeping the maximum resize cost down. We could settle for a 
separate LRU-q per segment, even, to keep the complexity of this code down 
significantly - it is unlikely having a global LRU-q is significantly more 
accurate at predicting reuse than ~128 of them. It would also make it much 
easier to improve the replacement strategy beyond LRU, which would likely yield 
a bigger win for performance than any potential loss from reduced concurrency. 
The critical section for reads could be kept sufficiently small that 
competition would be very unlikely with the current state of C*, by performing 
the deserialization outside of it. There's a good chance this would yield a net 
positive performance impact, by reducing the cost per access without increasing 
the cost due to contention measurably (because contention would be infrequent).

edit: coarser, not finer. i.e., a la j.u.c.CHM


was (Author: benedict):
I suspect segmenting the table at a finer granularity, so that each segment is 
maintained with mutual exclusivity, would achieve better percentiles in both 
cases due to keeping the maximum resize cost down. We could settle for a 
separate LRU-q per segment, even, to keep the complexity of this code down 
significantly - it is unlikely having a global LRU-q is significantly more 
accurate at predicting reuse than ~128 of them. It would also make it much 
easier to improve the replacement strategy beyond LRU, which would likely yield 
a bigger win for performance than any potential loss from reduced concurrency. 
The critical section for reads could be kept sufficiently small that 
competition would be very unlikely with the current state of C*, by performing 
the deserialization outside of it. There's a good chance this would yield a net 
positive performance impact, by reducing the cost per access without increasing 
the cost due to contention measurably (because contention would be infrequent).

 Serializing Row cache alternative (Fully off heap)
 --

 Key: CASSANDRA-7438
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
 Environment: Linux
Reporter: Vijay
Assignee: Vijay
  Labels: performance
 Fix For: 3.0

 Attachments: 0001-CASSANDRA-7438.patch, tests.zip


 Currently SerializingCache is partially off heap, keys are still stored in 
 JVM heap as BB, 
 * There is a higher GC costs for a reasonably big cache.
 * Some users have used the row cache efficiently in production for better 
 results, but this requires careful tunning.
 * Overhead in Memory for the cache entries are relatively high.
 So the proposal for this ticket is to move the LRU cache logic completely off 
 heap and use JNI to interact with cache. We might want to ensure that the new 
 implementation match the existing API's (ICache), and the implementation 
 needs to have safe memory access, low overhead in memory and less memcpy's 
 (As much as possible).
 We might also want to make this cache configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-11-28 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228563#comment-14228563
 ] 

Benedict commented on CASSANDRA-7438:
-

[~aweisberg]: In my experience segments tend to be imperfectly distributed, so 
whilst there is bunching of resizes simply because they take so long, with real 
work going on at the same time they should be a _little_ spread out. Though 
with murmur3 the distribution may be significantly more uniform than my prior 
experiments. Either way, they're performed in parallel (without coordination) 
if they coincide, so it's still an improvement.

[~vijay2...@yahoo.com]: When I talk about complexity, I mean the difficulties 
of concurrent programming magnified without the normal tools. For instance, 
there are the following concerns:

* We have a spin-lock - admittedly one that should _generally_ be uncontended, 
but on a grow or a small map this is certainly not the case, which could result 
in really problematic behaviour. Pure spin locks should not be used outside of 
the kernel. 
* The queue is maintained by a separate thread that requires signalling if it 
isn't currently performing work - which, in a real C* instance where the cost 
of linking the queue item is a fraction of the other work done to service a 
request means we are likely to incur a costly unpark() for a majority of 
operations
* Reads can interleave with put/replace/remove and abort the removal of an item 
from the queue, resulting in a memory leak. 
* We perform the grow on a separate thread, but prevent all reader _or_ writer 
threads from making progress by taking the locks for all buckets immediately.
* Freeing of oldSegments is still dangerous, it's just probabilistically less 
likely to happen.
* During a grow, we can lose puts because we unlock the old segments, so with 
the right (again, unlikely) interleaving of events a writer can think the old 
table is still valid
* When growing, we only double the size of the backing table, however since 
grows happen in the background the updater can get ahead, meaning we remain 
behind and multiply the constant factor overheads, collisions and contention 
until total size tails off.

These are only the obvious problems that spring to mind from 15m perusing the 
code, I'm sure there are others. This kind of stuff is really hard, and the 
approach I'm suggesting is comparatively a doddle to get right, and is likely 
faster to boot.

I'm not sure I understand your concern with segmentation creating complexity 
with the hashing... I'm proposing the exact method used by CHM. We have an 
excellent hash algorithm to distribute the data over the segments: murmurhash3. 
Although we need to be careful to not use the bits that don't have the correct 
entropy for selecting a segment. It's really no more than a two-tier hash 
table. The user doesn't need to know anything about this.

 Serializing Row cache alternative (Fully off heap)
 --

 Key: CASSANDRA-7438
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
 Environment: Linux
Reporter: Vijay
Assignee: Vijay
  Labels: performance
 Fix For: 3.0

 Attachments: 0001-CASSANDRA-7438.patch, tests.zip


 Currently SerializingCache is partially off heap, keys are still stored in 
 JVM heap as BB, 
 * There is a higher GC costs for a reasonably big cache.
 * Some users have used the row cache efficiently in production for better 
 results, but this requires careful tunning.
 * Overhead in Memory for the cache entries are relatively high.
 So the proposal for this ticket is to move the LRU cache logic completely off 
 heap and use JNI to interact with cache. We might want to ensure that the new 
 implementation match the existing API's (ICache), and the implementation 
 needs to have safe memory access, low overhead in memory and less memcpy's 
 (As much as possible).
 We might also want to make this cache configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-11-28 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228563#comment-14228563
 ] 

Benedict edited comment on CASSANDRA-7438 at 11/29/14 12:23 AM:


[~aweisberg]: In my experience segments tend to be imperfectly distributed, so 
whilst there is bunching of resizes simply because they take so long, with real 
work going on at the same time they should be a _little_ spread out. Though 
with murmur3 the distribution may be significantly more uniform than my prior 
experiments. Either way, they're performed in parallel (without coordination) 
if they coincide, and are each a fraction of the size, so it's still an 
improvement.

[~vijay2...@yahoo.com]: When I talk about complexity, I mean the difficulties 
of concurrent programming magnified without the normal tools. For instance, 
there are the following concerns:

* We have a spin-lock - admittedly one that should _generally_ be uncontended, 
but on a grow or a small map this is certainly not the case, which could result 
in really problematic behaviour. Pure spin locks should not be used outside of 
the kernel. 
* The queue is maintained by a separate thread that requires signalling if it 
isn't currently performing work - which, in a real C* instance where the cost 
of linking the queue item is a fraction of the other work done to service a 
request means we are likely to incur a costly unpark() for a majority of 
operations
* Reads can interleave with put/replace/remove and abort the removal of an item 
from the queue, resulting in a memory leak. 
* We perform the grow on a separate thread, but prevent all reader _or_ writer 
threads from making progress by taking the locks for all buckets immediately.
* Freeing of oldSegments is still dangerous, it's just probabilistically less 
likely to happen.
* During a grow, we can lose puts because we unlock the old segments, so with 
the right (again, unlikely) interleaving of events a writer can think the old 
table is still valid
* When growing, we only double the size of the backing table, however since 
grows happen in the background the updater can get ahead, meaning we remain 
behind and multiply the constant factor overheads, collisions and contention 
until total size tails off.

These are only the obvious problems that spring to mind from 15m perusing the 
code, I'm sure there are others. This kind of stuff is really hard, and the 
approach I'm suggesting is comparatively a doddle to get right, and is likely 
faster to boot.

I'm not sure I understand your concern with segmentation creating complexity 
with the hashing... I'm proposing the exact method used by CHM. We have an 
excellent hash algorithm to distribute the data over the segments: murmurhash3. 
Although we need to be careful to not use the bits that don't have the correct 
entropy for selecting a segment. 

Think of it as simply implementing an off-heap LinkedHashMap, wrapping it in a 
lock, and having an array of them. The user doesn't need to know anything about 
this.


was (Author: benedict):
[~aweisberg]: In my experience segments tend to be imperfectly distributed, so 
whilst there is bunching of resizes simply because they take so long, with real 
work going on at the same time they should be a _little_ spread out. Though 
with murmur3 the distribution may be significantly more uniform than my prior 
experiments. Either way, they're performed in parallel (without coordination) 
if they coincide, so it's still an improvement.

[~vijay2...@yahoo.com]: When I talk about complexity, I mean the difficulties 
of concurrent programming magnified without the normal tools. For instance, 
there are the following concerns:

* We have a spin-lock - admittedly one that should _generally_ be uncontended, 
but on a grow or a small map this is certainly not the case, which could result 
in really problematic behaviour. Pure spin locks should not be used outside of 
the kernel. 
* The queue is maintained by a separate thread that requires signalling if it 
isn't currently performing work - which, in a real C* instance where the cost 
of linking the queue item is a fraction of the other work done to service a 
request means we are likely to incur a costly unpark() for a majority of 
operations
* Reads can interleave with put/replace/remove and abort the removal of an item 
from the queue, resulting in a memory leak. 
* We perform the grow on a separate thread, but prevent all reader _or_ writer 
threads from making progress by taking the locks for all buckets immediately.
* Freeing of oldSegments is still dangerous, it's just probabilistically less 
likely to happen.
* During a grow, we can lose puts because we unlock the old segments, so with 
the right (again, unlikely) interleaving of events a writer can think the old 
table is still valid
* When growing, we only double the size of the backing 

[jira] [Commented] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-11-28 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228575#comment-14228575
 ] 

Benedict commented on CASSANDRA-7438:
-

bq. I am 100% sure

Never be 100% sure with concurrency, please :)

bq. test case again plz. I don't think this can happen too. I spend a lot of 
time testing the exact scenario.

You have too much faith in tests. You are testing under ideal conditions - two 
of the race conditions I highlighted will only rear their heads infrequently, 
most likely when the system is under uncharacteristic load causing very choppy 
scheduling. Analysis of the code is paramount. I will not produce a test case 
as I do not have time, however I will give you an interleaving of events that 
would trigger one of them.

Thread A is deleting an item, and is in LRUC.invalidate(), Thread B is looking 
up the same item, in LRUC.get().
A: 187: map.remove()
B: 154 :map.get()
A: 191: queue.deleteFromQueue()
B: 158: queue.addToQueue()

In particular, addToQueue() sets the markAsDeleted flag to false, undoing the 
prior work of deleteFromQueue.

bq. Thread is only signalled if they are not performing operation. I am lost.

It will generally not be performing an operation, because its work will be 
faster than any of the producers can produce work in normal C* operation.

 Serializing Row cache alternative (Fully off heap)
 --

 Key: CASSANDRA-7438
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
 Environment: Linux
Reporter: Vijay
Assignee: Vijay
  Labels: performance
 Fix For: 3.0

 Attachments: 0001-CASSANDRA-7438.patch, tests.zip


 Currently SerializingCache is partially off heap, keys are still stored in 
 JVM heap as BB, 
 * There is a higher GC costs for a reasonably big cache.
 * Some users have used the row cache efficiently in production for better 
 results, but this requires careful tunning.
 * Overhead in Memory for the cache entries are relatively high.
 So the proposal for this ticket is to move the LRU cache logic completely off 
 heap and use JNI to interact with cache. We might want to ensure that the new 
 implementation match the existing API's (ICache), and the implementation 
 needs to have safe memory access, low overhead in memory and less memcpy's 
 (As much as possible).
 We might also want to make this cache configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-11-29 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228693#comment-14228693
 ] 

Benedict commented on CASSANDRA-7438:
-

Invert those two statements and the behaviour is still broken.

B: 154 :map.get()
A: 187: map.remove()
A: 191: queue.deleteFromQueue()
B: 158: queue.addToQueue()

 Serializing Row cache alternative (Fully off heap)
 --

 Key: CASSANDRA-7438
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
 Environment: Linux
Reporter: Vijay
Assignee: Vijay
  Labels: performance
 Fix For: 3.0

 Attachments: 0001-CASSANDRA-7438.patch, tests.zip


 Currently SerializingCache is partially off heap, keys are still stored in 
 JVM heap as BB, 
 * There is a higher GC costs for a reasonably big cache.
 * Some users have used the row cache efficiently in production for better 
 results, but this requires careful tunning.
 * Overhead in Memory for the cache entries are relatively high.
 So the proposal for this ticket is to move the LRU cache logic completely off 
 heap and use JNI to interact with cache. We might want to ensure that the new 
 implementation match the existing API's (ICache), and the implementation 
 needs to have safe memory access, low overhead in memory and less memcpy's 
 (As much as possible).
 We might also want to make this cache configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-11-29 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228693#comment-14228693
 ] 

Benedict edited comment on CASSANDRA-7438 at 11/29/14 9:40 AM:
---

Good point! But invert those two statements and the behaviour is still broken.

B: 154 :map.get()
A: 187: map.remove()
A: 191: queue.deleteFromQueue()
B: 158: queue.addToQueue()


was (Author: benedict):
Invert those two statements and the behaviour is still broken.

B: 154 :map.get()
A: 187: map.remove()
A: 191: queue.deleteFromQueue()
B: 158: queue.addToQueue()

 Serializing Row cache alternative (Fully off heap)
 --

 Key: CASSANDRA-7438
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
 Environment: Linux
Reporter: Vijay
Assignee: Vijay
  Labels: performance
 Fix For: 3.0

 Attachments: 0001-CASSANDRA-7438.patch, tests.zip


 Currently SerializingCache is partially off heap, keys are still stored in 
 JVM heap as BB, 
 * There is a higher GC costs for a reasonably big cache.
 * Some users have used the row cache efficiently in production for better 
 results, but this requires careful tunning.
 * Overhead in Memory for the cache entries are relatively high.
 So the proposal for this ticket is to move the LRU cache logic completely off 
 heap and use JNI to interact with cache. We might want to ensure that the new 
 implementation match the existing API's (ICache), and the implementation 
 needs to have safe memory access, low overhead in memory and less memcpy's 
 (As much as possible).
 We might also want to make this cache configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7203) Flush (and Compact) High Traffic Partitions Separately

2014-11-29 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228729#comment-14228729
 ] 

Benedict commented on CASSANDRA-7203:
-

[~jbellis]: Are we sure that's a good policy? It's generally accepted that a 
lot of work (esp. that involving people, e.g. Netflix, Apple) follows a 
zipfian/extreme distribution. If we can avoid the most voluminous customers 
from degrading performance for everybody, that's surely a pretty big win? I'm 
not suggesting this be attacked immediately, but in the medium-to-long term it 
seems like a pretty decent yield - and could be applied on both read and write. 
If you have 1% of your data appearing in ~100% of sstables, but the other 99% 
appearing in only ~1% of your sstables, you're compacting an order of magnitude 
more often than you might otherwise need to.

Perhaps [~jasobrown] and [~kohlisankalp] have an idea of how realistic this 
scenario is?

 Flush (and Compact) High Traffic Partitions Separately
 --

 Key: CASSANDRA-7203
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7203
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
  Labels: compaction, performance

 An idea possibly worth exploring is the use of streaming count-min sketches 
 to collect data over the up-time of a server to estimating the velocity of 
 different partitions, so that high-volume partitions can be flushed 
 separately on the assumption that they will be much smaller in number, thus 
 reducing write amplification by permitting compaction independently of any 
 low-velocity data.
 Whilst the idea is reasonably straight forward, it seems that the biggest 
 problem here will be defining any success metric. Obviously any workload 
 following an exponential/zipf/extreme distribution is likely to benefit from 
 such an approach, but whether or not that would translate in real terms is 
 another matter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6976) Determining replicas to query is very slow with large numbers of nodes or vnodes

2014-11-29 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228737#comment-14228737
 ] 

Benedict commented on CASSANDRA-6976:
-

[~jbellis] [~aweisberg] I have a few remaining concerns, although I agree this 
isn't _super_ pressing:

* the benchmark as tested will have perfect L1 cache occupancy, which in a real 
scenario is unlikely
* the benchmarks did not account for: (all of which should have a negative 
impact on the runtime on getRangeSlice itself)
** running with dynamic snitch (that is being updated simultaneously)
** running with network topology snitch underneath the dynamic snitch, and/or 
by itself
** running with, say, 3+ DCs, RF=3

the benchmark looks like it ran with simplesnitch, RF=1, 1 DC - i.e., ideal 
conditions.

This won't likely make an order of magnitude difference, but I guess the 
question is if we care about being tremendously slow for full table scans for 
_small_ tables. Programmatically fetching the entire contents of a lookup 
table, for instance, would be badly affected by this behaviour even without the 
changes I propose to the methodology.

 Determining replicas to query is very slow with large numbers of nodes or 
 vnodes
 

 Key: CASSANDRA-6976
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6976
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Benedict
Assignee: Ariel Weisberg
  Labels: performance
 Attachments: GetRestrictedRanges.java, jmh_output.txt, 
 jmh_output_murmur3.txt, make_jmh_work.patch


 As described in CASSANDRA-6906, this can be ~100ms for a relatively small 
 cluster with vnodes, which is longer than it will spend in transit on the 
 network. This should be much faster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6976) Determining replicas to query is very slow with large numbers of nodes or vnodes

2014-11-30 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14229269#comment-14229269
 ] 

Benedict commented on CASSANDRA-6976:
-

On second thoughts, ignore that sentiment entirely. We don't really have any 
concept of a lookup table, and we'll have to address that directly when we 
introduce enum types which is a better place. 

I guess what really bugs me about this, and what I assumed would be related to 
the problem (but patently can't given the default behaviour) is that after 
calculating natural endpoints, we then sort them (based on a couple of hashmap 
lookups for each end point) for every token range, and also for every single 
normal query. This sort is performed over RF*DC items in either case, even for 
queries routed directly to the owning node with CL ONE. I was hoping we'd fix 
that as a result of this work, since that's a lot of duplicated effort, but 
that hardly seems sensible now. What we definitely _should_ do, though, is make 
sure we're (in general) benchmarking behaviour over common config, as our 
default test configuration is not at all representative.

 Determining replicas to query is very slow with large numbers of nodes or 
 vnodes
 

 Key: CASSANDRA-6976
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6976
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Benedict
Assignee: Ariel Weisberg
  Labels: performance
 Attachments: GetRestrictedRanges.java, jmh_output.txt, 
 jmh_output_murmur3.txt, make_jmh_work.patch


 As described in CASSANDRA-6906, this can be ~100ms for a relatively small 
 cluster with vnodes, which is longer than it will spend in transit on the 
 network. This should be much faster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8341) Expose time spent in each thread pool

2014-12-01 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14229780#comment-14229780
 ] 

Benedict commented on CASSANDRA-8341:
-

SEPWorker already grabs the nanoTime on exiting and entering its spin phase, so 
tracking this would be pretty much free (we'd need to check it once if we 
swapped the executor we're working on without entering a spinning state). 
Flushing pent up data is pretty trivial; you can set a max time to buffer, so 
it ensures it's never more than a few seconds (or millis) out of date, say. 
Enough to keep the cost too small to measure.

I'm a little dubious about tracking two completely different properties as the 
same thing though. CPUTime cannot be composed with nanoTime sensibly, so we 
either want to track one or the other across all executors. Since the other 
executors are all the ones that do infrequent expensive work (which is 
explicitly why they haven't been transitioned to SEP), tracking nanoTime on 
them won't be an appreciable cost.

 Expose time spent in each thread pool
 -

 Key: CASSANDRA-8341
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8341
 Project: Cassandra
  Issue Type: New Feature
  Components: Core
Reporter: Chris Lohfink
Priority: Minor
  Labels: metrics
 Attachments: 8341.patch, 8341v2.txt


 Can increment a counter with time spent in each queue.  This can provide 
 context on how much time is spent percentage wise in each stage.  
 Additionally can be used with littles law in future if ever want to try to 
 tune the size of the pools.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8341) Expose time spent in each thread pool

2014-12-01 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14229805#comment-14229805
 ] 

Benedict commented on CASSANDRA-8341:
-

Ah, that's a good question: are we talking about queue latency or time spent 
processing each queue? The two are very different, and it sounded like we were 
discussing the latter, but the ticket description does sound more like the 
former.

 Expose time spent in each thread pool
 -

 Key: CASSANDRA-8341
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8341
 Project: Cassandra
  Issue Type: New Feature
  Components: Core
Reporter: Chris Lohfink
Priority: Minor
  Labels: metrics
 Attachments: 8341.patch, 8341v2.txt


 Can increment a counter with time spent in each queue.  This can provide 
 context on how much time is spent percentage wise in each stage.  
 Additionally can be used with littles law in future if ever want to try to 
 tune the size of the pools.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7688) Add data sizing to a system table

2014-12-01 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14229818#comment-14229818
 ] 

Benedict commented on CASSANDRA-7688:
-

This is a fundamentally difficult problem, and to be answered accurately 
basically requires a full compaction. We can track or estimate this data for 
any given sstable easily, and we can estimate the number of overlapping 
partitions between two sstables (though the accuracy I'm unsure of if we 
composed this data across many sstables), but we cannot say how many rows 
within each overlapping partition overlap. The best we could do is probably 
sample some overlapping partitions to see what proportion of row overlap tends 
to prevail, and hope it is representative; if we assume a normal distribution 
of overlap ratio we could return error bounds.

I don't think it's likely this data could be maintained live, at least not 
accurately, or not without significant cost. It would be an on-demand 
calculation that would be moderately expensive. 

 Add data sizing to a system table
 -

 Key: CASSANDRA-7688
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7688
 Project: Cassandra
  Issue Type: New Feature
Reporter: Jeremiah Jordan
 Fix For: 2.1.3


 Currently you can't implement something similar to describe_splits_ex purely 
 from the a native protocol driver.  
 https://datastax-oss.atlassian.net/browse/JAVA-312 is open to expose easily 
 getting ownership information to a client in the java-driver.  But you still 
 need the data sizing part to get splits of a given size.  We should add the 
 sizing information to a system table so that native clients can get to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7688) Add data sizing to a system table

2014-12-01 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14229831#comment-14229831
 ] 

Benedict commented on CASSANDRA-7688:
-

I'm talking about estimates. We cannot likely even estimate without pretty 
significant cost. Sampling column counts is pretty easy, but knowing how many 
cql rows there are for any merged row is not. There are tricks to make it 
easier, but there are datasets for which the tricks will not work, and any 
estimate would be complete guesswork without sampling the data.

 Add data sizing to a system table
 -

 Key: CASSANDRA-7688
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7688
 Project: Cassandra
  Issue Type: New Feature
Reporter: Jeremiah Jordan
 Fix For: 2.1.3


 Currently you can't implement something similar to describe_splits_ex purely 
 from the a native protocol driver.  
 https://datastax-oss.atlassian.net/browse/JAVA-312 is open to expose easily 
 getting ownership information to a client in the java-driver.  But you still 
 need the data sizing part to get splits of a given size.  We should add the 
 sizing information to a system table so that native clients can get to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8341) Expose time spent in each thread pool

2014-12-01 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14229872#comment-14229872
 ] 

Benedict commented on CASSANDRA-8341:
-

That is difficult, since we have stages that perform work that does not consume 
CPU. The RPC stage (for thrift or cql) both spend the majority of their time 
_waiting_ for the relevant work stage to complete. The proposed approaches 
would count this as busy time. The read and write stages also can block on IO, 
the former more often than the latter, but in either case we would count 
erroneously.


 Expose time spent in each thread pool
 -

 Key: CASSANDRA-8341
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8341
 Project: Cassandra
  Issue Type: New Feature
  Components: Core
Reporter: Chris Lohfink
Priority: Minor
  Labels: metrics
 Attachments: 8341.patch, 8341v2.txt


 Can increment a counter with time spent in each queue.  This can provide 
 context on how much time is spent percentage wise in each stage.  
 Additionally can be used with littles law in future if ever want to try to 
 tune the size of the pools.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6976) Determining replicas to query is very slow with large numbers of nodes or vnodes

2014-12-01 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230181#comment-14230181
 ] 

Benedict commented on CASSANDRA-6976:
-

bq. I recall someone on the Mechanical Sympathy group pointing out that you can 
warm an entire last level cache in some small amount of time, I think it was 
30ish milliseconds. I can't find the post and I could be very wrong, but it was 
definitely milliseconds. My guess is that in the big picture cache effects 
aren't changing the narrative that this takes 10s to 100s of milliseconds.

Sure it does - if an action that is likely memory bound (like this one - after 
all, it does very little in the way of computation and doesn't touch any disk) 
takes time X with a warmed cache, and only touches data that can fit in cache, 
it will take X*K with a cold cache for some K (significantly)  1 - and in real 
operation, especially with many tokens, there is a quite reasonable likelihood 
of a cold cache given the lack of locality and amount of data as the cluster 
grows. This is actually one possibility for improving this behaviour, if we 
cared at all - ensuring the number of cache lines touched is kept low, working 
with primitives for the token ranges and inet addresses to reduce the constant 
factors. This would also improve the normal code paths, not just range slices.

bq. If it is slow, what is the solution? Even if we lazily materialize the 
ranges the run time of fetching batches of results dominates the in-memory 
compute of getRestrictedRanges. When we talked use cases it seems like people 
would using paging programmatically so only console users would see this poor 
performance outside of the lookup table use case you mentioned.

For a lookup (i.e. small) table query, or a range query that can be serviced 
entirely by the local node, it is quite unlikely that the fetching would 
dominate when talking about timescales = 1ms.

bq. I didn't quite follow this. Are you talking about getLiveSortedEndpoints 
called from getRangeSlice? I haven't dug deep enough into getRangeSlice to tell 
you where the time in that goes exactly. I would have to do it again and insert 
some probes. I assumed it was dominated by sending remote requests.

Yes - for your benchmark it would not have spent any much time here, since the 
sort would be a no-op and the list a single entry, but as the number of data 
centres and replication factor grows, and with use of NetworkTopologyStrategy, 
this could be a significant time expenditure. It will also on the aggregate 
affect a certain percentage of cpu time spent on all queries. However since the 
sort order is actually pretty consistent, sorting only when the sort order 
changes would be a way to eliminate this cost.

bq. Benchmarking in what scope? This microbenchmark, defaults for workloads in 
cstar, tribal knowledge when doing performance work?

Like I said, please do feel to drop this particular line of enquiry for the 
moment, since even with all of the above I doubt this is a pressing matter. But 
I don't think this is the end of the topic entirely - at some point this cost 
will be a more measurable percentage of.work done. But these kinds of costs are 
simply not a part of any of our current benchmarking methodology since our 
default configs avoid the code paths entirely (either by having no DCs, low RF, 
low node count, no tokens, and SimpleStrategy), and that is something we should 
address. 

In the meantime it might be worth having a simple short-circuit path for 
queries that may be answered by the local node only, though.

 Determining replicas to query is very slow with large numbers of nodes or 
 vnodes
 

 Key: CASSANDRA-6976
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6976
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Benedict
Assignee: Ariel Weisberg
  Labels: performance
 Attachments: GetRestrictedRanges.java, jmh_output.txt, 
 jmh_output_murmur3.txt, make_jmh_work.patch


 As described in CASSANDRA-6906, this can be ~100ms for a relatively small 
 cluster with vnodes, which is longer than it will spend in transit on the 
 network. This should be much faster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6976) Determining replicas to query is very slow with large numbers of nodes or vnodes

2014-12-01 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230280#comment-14230280
 ] 

Benedict commented on CASSANDRA-6976:
-

bq.  I don't see a reason to drop it just because the ticket got caught up in 
implementation details and not the user facing issue we want to address.

Well, given the test case that originally produced this concern almost 
certainly had the same methodology you had, I suspect you did indeed track down 
the problem to a non-warm JVM

bq. The entire thing runs in 60 milliseconds with 2000 tokens. That is 2x the 
time to warm up the cache (assuming a correct number for warmup). 

You're assuming that (1) the cache stays warm in normal operation and (2) that 
the warmup figures you have are for similar data distributions and (3) the 
warmup is simply a matter of presence in cache, rather than likelihood of 
eviction (4) all this behaviour has no negative impact outside of the method 
itself. But, like I said, I agree it won't likely make an order of magnitude 
difference by itself. Especially not with current state of C*.

bq. Range queries are slow because they produce a lot of ranges.

Did we determine that if the _result_ is a narrow range the performance is 
significantly faster? Because this stemmed from a situation where the entire 
contents were known to be node-local (because the data was local only, it 
wasn't actually distributed). I wouldn't be at all surprised if it was fine, 
given the likely cause you tracked down, but I don't think we actually 
demonstrated that?

bq. What queries could identify that this shortcut is possible?

I am referring here to the more general case of getLiveSortedEndpoints, which 
is used much more widely. But, like I said, I raised this largely because of a 
general bugging that this whole area of code has many inefficiencies, not 
because it is likely they really matter. The only thing actionable is that we 
*should* take steps to ensure our default (and common) test and benchmark 
configs more accurately represent real cluster configs because we simply do not 
exercise these codepaths right now from a performance perspective.

 Determining replicas to query is very slow with large numbers of nodes or 
 vnodes
 

 Key: CASSANDRA-6976
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6976
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Benedict
Assignee: Ariel Weisberg
  Labels: performance
 Attachments: GetRestrictedRanges.java, jmh_output.txt, 
 jmh_output_murmur3.txt, make_jmh_work.patch


 As described in CASSANDRA-6906, this can be ~100ms for a relatively small 
 cluster with vnodes, which is longer than it will spend in transit on the 
 network. This should be much faster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7203) Flush (and Compact) High Traffic Partitions Separately

2014-12-02 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231203#comment-14231203
 ] 

Benedict commented on CASSANDRA-7203:
-

I was _mostly_ hoping to get your and [~kohlisankalp]'s views on _if those 
workload skews occur_. Then we could at some point later get into the nitty 
gritty of if it would be worth it :-)

The idea wouldn't really be to special case anything except flush, and to 
depend on (and implement after) `improvements we have either envisaged or could 
later envisage to avoid compacting sstables with low predicted overlap of 
partitions. i.e. it would have the potential to improve the benefit of such 
schemes, by increasing the number of sstable pairings they can rule out.

 Flush (and Compact) High Traffic Partitions Separately
 --

 Key: CASSANDRA-7203
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7203
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
  Labels: compaction, performance

 An idea possibly worth exploring is the use of streaming count-min sketches 
 to collect data over the up-time of a server to estimating the velocity of 
 different partitions, so that high-volume partitions can be flushed 
 separately on the assumption that they will be much smaller in number, thus 
 reducing write amplification by permitting compaction independently of any 
 low-velocity data.
 Whilst the idea is reasonably straight forward, it seems that the biggest 
 problem here will be defining any success metric. Obviously any workload 
 following an exponential/zipf/extreme distribution is likely to benefit from 
 such an approach, but whether or not that would translate in real terms is 
 another matter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8018) Cassandra seems to insert twice in custom PerColumnSecondaryIndex

2014-12-02 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231249#comment-14231249
 ] 

Benedict commented on CASSANDRA-8018:
-

Good catch. A few nits on the patch, but I'll make them and commit:

* iff is never a typo, it means if and only if
* we should remove the call from inside addNewKey, rather than outside it, as 
that is the call that was originally meant to be removed. this way all of the 
calls happen in the same logical unit of code

 Cassandra seems to insert twice in custom PerColumnSecondaryIndex
 -

 Key: CASSANDRA-8018
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8018
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Pavel Chlupacek
Assignee: Benjamin Lerer
 Fix For: 2.1.3

 Attachments: CASSANDRA-8018.txt


 When inserting data into Cassandra 2.1.0 into table with custom secondary 
 index, the Cell is inserted twice, if inserting new entry into row with same 
 rowId, but different cluster index columns. 
 
 CREATE KEYSPACE fulltext WITH replication = {'class': 'SimpleStrategy',  
 'replication_factor' : 1};
 CREATE TABLE fulltext.test ( id uuid, name text, name2 text, json varchar, 
 lucene text, primary key ( id , name));
 sCREATE CUSTOM INDEX lucene_idx on fulltext.test(lucene) using 
 'com.spinoco.fulltext.cassandra.TestIndex'; 
 // this causes only one insert
  insertInto(fulltext,test)
   .value(id, id1.uuid)
   .value(name, goosh1) 
   .value(json, TestContent.message1.asJson)
 // this causes 2 inserts to be done 
  insertInto(fulltext,test)
 .value(id, id1.uuid)
 .value(name, goosh2)
 .value(json, TestContent.message2.asJson)
 /// stacktraces for inserts (always same, for 1st and 2nd insert)
 custom indexer stacktraces and then
   at 
 org.apache.cassandra.db.index.SecondaryIndexManager$StandardUpdater.insert(SecondaryIndexManager.java:707)
   at 
 org.apache.cassandra.db.AtomicBTreeColumns$ColumnUpdater.apply(AtomicBTreeColumns.java:344)
   at 
 org.apache.cassandra.db.AtomicBTreeColumns$ColumnUpdater.apply(AtomicBTreeColumns.java:319)
   at 
 org.apache.cassandra.utils.btree.NodeBuilder.addNewKey(NodeBuilder.java:323)
   at 
 org.apache.cassandra.utils.btree.NodeBuilder.update(NodeBuilder.java:191)
   at org.apache.cassandra.utils.btree.Builder.update(Builder.java:74)
   at org.apache.cassandra.utils.btree.BTree.update(BTree.java:186)
   at 
 org.apache.cassandra.db.AtomicBTreeColumns.addAllWithSizeDelta(AtomicBTreeColumns.java:189)
   at org.apache.cassandra.db.Memtable.put(Memtable.java:194)
   at 
 org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1142)
   at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:394)
   at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:351)
   at org.apache.cassandra.db.Mutation.apply(Mutation.java:214)
   at 
 org.apache.cassandra.service.StorageProxy$7.runMayThrow(StorageProxy.java:970)
   at 
 org.apache.cassandra.service.StorageProxy$LocalMutationRunnable.run(StorageProxy.java:2080)
   at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
   at 
 org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:163)
   at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:103)
   at java.lang.Thread.run(Thread.java:744)
  Note that cell, rowkey and Group in public abstract void 
 insert(ByteBuffer rowKey, Cell col, OpOrder.Group opGroup); are having for 
 both successive calls same identity 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7882) Memtable slab allocation should scale logarithmically to improve occupancy rate

2014-12-02 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231374#comment-14231374
 ] 

Benedict commented on CASSANDRA-7882:
-

Hi Jay,

I've been away for the past two months, so sorry this got left by the wayside 
in the meantime. I'll get around to reviewing it shortly.

 Memtable slab allocation should scale logarithmically to improve occupancy 
 rate
 ---

 Key: CASSANDRA-7882
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7882
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jay Patel
Assignee: Jay Patel
  Labels: performance
 Fix For: 2.1.3

 Attachments: trunk-7882.txt


 CASSANDRA-5935 allows option to disable region-based allocation for on-heap 
 memtables but there is no option to disable it for off-heap memtables 
 (memtable_allocation_type: offheap_objects). 
 Disabling region-based allocation will allow us to pack more tables in the 
 schema since minimum of 1MB region won't be allocated per table. Downside can 
 be more fragmentation which should be controllable by using better allocator 
 like JEMalloc.
 How about below option in yaml?:
 memtable_allocation_type: unslabbed_offheap_objects
 Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation

2014-12-02 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231425#comment-14231425
 ] 

Benedict commented on CASSANDRA-7032:
-

It's plain old statistics. Have a look at the java code I attached that 
simulates and reports the level of imbalance. Currently we randomly assign the 
tokens, and this results in some nodes happening to fall with all of their 
token ranges narrow vs the other existing tokens, and others wider.

Consistent hashing is what Riak uses to achieve balance, which is one approach. 
Rendezvous hashing is another. But these would likely involve changing the 
tokens of every node in the cluster on adding a new node. This would be 
acceptable, but I expect with the amount of state space to work with we can 
design an algorithm that guarantees low bounds of imbalance without having to 
change the tokens assigned to any existing nodes.

 Improve vnode allocation
 

 Key: CASSANDRA-7032
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Branimir Lambov
  Labels: performance, vnodes
 Fix For: 3.0

 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java


 It's been known for a little while that random vnode allocation causes 
 hotspots of ownership. It should be possible to improve dramatically on this 
 with deterministic allocation. I have quickly thrown together a simple greedy 
 algorithm that allocates vnodes efficiently, and will repair hotspots in a 
 randomly allocated cluster gradually as more nodes are added, and also 
 ensures that token ranges are fairly evenly spread between nodes (somewhat 
 tunably so). The allocation still permits slight discrepancies in ownership, 
 but it is bound by the inverse of the size of the cluster (as opposed to 
 random allocation, which strangely gets worse as the cluster size increases). 
 I'm sure there is a decent dynamic programming solution to this that would be 
 even better.
 If on joining the ring a new node were to CAS a shared table where a 
 canonical allocation of token ranges lives after running this (or a similar) 
 algorithm, we could then get guaranteed bounds on the ownership distribution 
 in a cluster. This will also help for CASSANDRA-6696.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation

2014-12-02 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231433#comment-14231433
 ] 

Benedict commented on CASSANDRA-7032:
-

I should note that the dovetailing with CASSANDRA-6696 is very important. 
Acceptable imbalance _per node_ is actually not _too_ tricky to deliver. But 
ensuring each disk on each node will have a fair share of the pie is a little 
harder

 Improve vnode allocation
 

 Key: CASSANDRA-7032
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Branimir Lambov
  Labels: performance, vnodes
 Fix For: 3.0

 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java


 It's been known for a little while that random vnode allocation causes 
 hotspots of ownership. It should be possible to improve dramatically on this 
 with deterministic allocation. I have quickly thrown together a simple greedy 
 algorithm that allocates vnodes efficiently, and will repair hotspots in a 
 randomly allocated cluster gradually as more nodes are added, and also 
 ensures that token ranges are fairly evenly spread between nodes (somewhat 
 tunably so). The allocation still permits slight discrepancies in ownership, 
 but it is bound by the inverse of the size of the cluster (as opposed to 
 random allocation, which strangely gets worse as the cluster size increases). 
 I'm sure there is a decent dynamic programming solution to this that would be 
 even better.
 If on joining the ring a new node were to CAS a shared table where a 
 canonical allocation of token ranges lives after running this (or a similar) 
 algorithm, we could then get guaranteed bounds on the ownership distribution 
 in a cluster. This will also help for CASSANDRA-6696.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7203) Flush (and Compact) High Traffic Partitions Separately

2014-12-02 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232045#comment-14232045
 ] 

Benedict commented on CASSANDRA-7203:
-

It wasn't intended to be an immediate focus, I just wanted an idea if such data 
distributions occurred to see if it might _ever_ be worth investigating. But I 
can see I'm fighting a losing battle!

 Flush (and Compact) High Traffic Partitions Separately
 --

 Key: CASSANDRA-7203
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7203
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
  Labels: compaction, performance

 An idea possibly worth exploring is the use of streaming count-min sketches 
 to collect data over the up-time of a server to estimating the velocity of 
 different partitions, so that high-volume partitions can be flushed 
 separately on the assumption that they will be much smaller in number, thus 
 reducing write amplification by permitting compaction independently of any 
 low-velocity data.
 Whilst the idea is reasonably straight forward, it seems that the biggest 
 problem here will be defining any success metric. Obviously any workload 
 following an exponential/zipf/extreme distribution is likely to benefit from 
 such an approach, but whether or not that would translate in real terms is 
 another matter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8411) Cassandra stress tool fails with NotStrictlyPositiveException on example profiles

2014-12-03 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232818#comment-14232818
 ] 

Benedict commented on CASSANDRA-8411:
-

Looks likely to be a trivial bug when providing n=1  with all other parameters 
default - since this is a stress tool, I don't think anybody has tried running 
it with only 1 insert before!

 Cassandra stress tool fails with NotStrictlyPositiveException on example 
 profiles
 -

 Key: CASSANDRA-8411
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8411
 Project: Cassandra
  Issue Type: Bug
  Components: Tools
 Environment: Linux Centos
Reporter: Igor Meltser
Priority: Critical

 Trying to run stress tool with provided profile fails:
 dsc-cassandra-2.1.2/tools $ ./bin/cassandra-stress user n=1 
 profile=cqlstress-example.yaml ops\(insert=1\) -node 
 INFO  06:21:35 Using data-center name 'datacenter1' for 
 DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct 
 datacenter name with DCAwareRoundRobinPolicy constructor)
 Connected to cluster: Benchmark Cluster
 INFO  06:21:35 New Cassandra host /:9042 added
 Datatacenter: datacenter1; Host: /.; Rack: rack1
 Datatacenter: datacenter1; Host: /; Rack: rack1
 Datatacenter: datacenter1; Host: marcus14-p/; Rack: rack1
 INFO  06:21:35 New Cassandra host marcus14-p/:9042 added
 INFO  06:21:35 New Cassandra host /:9042 added
 Created schema. Sleeping 3s for propagation.
 Exception in thread main 
 org.apache.commons.math3.exception.NotStrictlyPositiveException: standard 
 deviation (0)
 at 
 org.apache.commons.math3.distribution.NormalDistribution.init(NormalDistribution.java:108)
 at 
 org.apache.cassandra.stress.settings.OptionDistribution$GaussianFactory.get(OptionDistribution.java:418)
 at 
 org.apache.cassandra.stress.generate.SeedManager.init(SeedManager.java:59)
 at 
 org.apache.cassandra.stress.settings.SettingsCommandUser.getFactory(SettingsCommandUser.java:78)
 at org.apache.cassandra.stress.StressAction.run(StressAction.java:61)
 at org.apache.cassandra.stress.Stress.main(Stress.java:109)
 The tool is 2.1.2 version, but the tested Cassandra is 2.0.8 version



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-8411) Cassandra stress tool fails with NotStrictlyPositiveException on example profiles

2014-12-03 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-8411:

Priority: Trivial  (was: Critical)

 Cassandra stress tool fails with NotStrictlyPositiveException on example 
 profiles
 -

 Key: CASSANDRA-8411
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8411
 Project: Cassandra
  Issue Type: Bug
  Components: Tools
 Environment: Linux Centos
Reporter: Igor Meltser
Priority: Trivial

 Trying to run stress tool with provided profile fails:
 dsc-cassandra-2.1.2/tools $ ./bin/cassandra-stress user n=1 
 profile=cqlstress-example.yaml ops\(insert=1\) -node 
 INFO  06:21:35 Using data-center name 'datacenter1' for 
 DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct 
 datacenter name with DCAwareRoundRobinPolicy constructor)
 Connected to cluster: Benchmark Cluster
 INFO  06:21:35 New Cassandra host /:9042 added
 Datatacenter: datacenter1; Host: /.; Rack: rack1
 Datatacenter: datacenter1; Host: /; Rack: rack1
 Datatacenter: datacenter1; Host: marcus14-p/; Rack: rack1
 INFO  06:21:35 New Cassandra host marcus14-p/:9042 added
 INFO  06:21:35 New Cassandra host /:9042 added
 Created schema. Sleeping 3s for propagation.
 Exception in thread main 
 org.apache.commons.math3.exception.NotStrictlyPositiveException: standard 
 deviation (0)
 at 
 org.apache.commons.math3.distribution.NormalDistribution.init(NormalDistribution.java:108)
 at 
 org.apache.cassandra.stress.settings.OptionDistribution$GaussianFactory.get(OptionDistribution.java:418)
 at 
 org.apache.cassandra.stress.generate.SeedManager.init(SeedManager.java:59)
 at 
 org.apache.cassandra.stress.settings.SettingsCommandUser.getFactory(SettingsCommandUser.java:78)
 at org.apache.cassandra.stress.StressAction.run(StressAction.java:61)
 at org.apache.cassandra.stress.Stress.main(Stress.java:109)
 The tool is 2.1.2 version, but the tested Cassandra is 2.0.8 version



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

2014-12-03 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232833#comment-14232833
 ] 

Benedict commented on CASSANDRA-7438:
-

re: hash bits:

there's not really a dramatic benefit to using more than 32-bits. We will 
always use the upper bits for the segment and the lower bits for the bucket, 
for which 4B items is plenty, although we don't have proper entropy for all the 
bits; we may have only 28-bits of good collision free-ness; we will want to 
rehash the murmur hash to ensure this is spread evenly to avoid a grow boundary 
consistently failing to reduce collisions. 

The one advantage of having some spare hash bits is that we can use these to 
avoid running a potentially expensive comparison on a large key until high 
confidence we've found the correct item - and as the number of unused hash bits 
for indexing dwindle, the value of this goes up. But the number of instances 
where this helps will be vanishingly small, since the head of the key will be 
on the same cache line and a hash collision and key prefix collision is pretty 
unlikely. It might be more significant if we were to use open-address hashing, 
as we would have excellent locality and reduce the number of expected cache 
misses for a lookup. But this won't be measurable above the cache serialization 
costs. We do already have these hash bits calculated in c*, typically. We also 
are unlikely to notice the overhead - allocations are likely to have ~16 bytes 
of overhead, be padded to the nearest 8 or 16 bytes, and a row has a lot of 
bumpf to encode. I doubt there will be any variation in storage costs from 
using all 64 bits.

i.e., whatever floats your boat

 Serializing Row cache alternative (Fully off heap)
 --

 Key: CASSANDRA-7438
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
 Environment: Linux
Reporter: Vijay
Assignee: Vijay
  Labels: performance
 Fix For: 3.0

 Attachments: 0001-CASSANDRA-7438.patch, tests.zip


 Currently SerializingCache is partially off heap, keys are still stored in 
 JVM heap as BB, 
 * There is a higher GC costs for a reasonably big cache.
 * Some users have used the row cache efficiently in production for better 
 results, but this requires careful tunning.
 * Overhead in Memory for the cache entries are relatively high.
 So the proposal for this ticket is to move the LRU cache logic completely off 
 heap and use JNI to interact with cache. We might want to ensure that the new 
 implementation match the existing API's (ICache), and the implementation 
 needs to have safe memory access, low overhead in memory and less memcpy's 
 (As much as possible).
 We might also want to make this cache configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8383) Memtable flush may expire records from the commit log that are in a later memtable

2014-12-03 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232824#comment-14232824
 ] 

Benedict commented on CASSANDRA-8383:
-

bq. Does this deserve a regression test? 

bq. We should also introduce a commit log correctness stress test, so we can 
reproduce this, be certain it is fixed, and so we can be sure to avoid this or 
similar scenarios in future.

Yes, absolutely. However I have been tasked with other pressing things - I only 
took time out to file and address this because it is an obvious and dangerous 
potential failure of correctness. We should file a follow up ticket for 
introducing rigorous randomized testing to tease out any potential correctness 
issues from this codepath, which either can be looked at immediately by 
somebody else, or I can take a look at once my current workload is dealt with. 
But doing this well requires a bit of time and focus, which I didn't want 
holding up a fix.

 Memtable flush may expire records from the commit log that are in a later 
 memtable
 --

 Key: CASSANDRA-8383
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8383
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Benedict
Assignee: Benedict
Priority: Critical
  Labels: commitlog
 Fix For: 2.1.3


 This is a pretty obvious bug with any care of thought, so not sure how I 
 managed to introduce it. We use OpOrder to ensure all writes to a memtable 
 have finished before flushing, however we also use this OpOrder to direct 
 writes to the correct memtable. However this is insufficient, since the 
 OpOrder is only a partial order; an operation from the future (i.e. for the 
 next memtable) could still interleave with the past operations in such a 
 way that they grab a CL entry inbetween the past operations. Since we 
 simply take the max ReplayPosition of those in the past, this would mean any 
 interleaved future operations would be expired even though they haven't been 
 persisted to disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (CASSANDRA-8412) Cassandra stress tool fails with NotStrictlyPositiveException on example profiles

2014-12-03 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict resolved CASSANDRA-8412.
-
Resolution: Duplicate

 Cassandra stress tool fails with NotStrictlyPositiveException on example 
 profiles
 -

 Key: CASSANDRA-8412
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8412
 Project: Cassandra
  Issue Type: Bug
  Components: Tools
 Environment: Linux Centos
Reporter: Igor Meltser
Priority: Critical
  Labels: stress, tools

 Trying to run stress tool with provided profile fails:
 dsc-cassandra-2.1.2/tools $ ./bin/cassandra-stress user n=1 
 profile=cqlstress-example.yaml ops\(insert=1\) -node 
 INFO  06:21:35 Using data-center name 'datacenter1' for 
 DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct 
 datacenter name with DCAwareRoundRobinPolicy constructor)
 Connected to cluster: Benchmark Cluster
 INFO  06:21:35 New Cassandra host /:9042 added
 Datatacenter: datacenter1; Host: /.; Rack: rack1
 Datatacenter: datacenter1; Host: /; Rack: rack1
 Datatacenter: datacenter1; Host: ./; Rack: rack1
 INFO  06:21:35 New Cassandra host ./:9042 added
 INFO  06:21:35 New Cassandra host /:9042 added
 Created schema. Sleeping 3s for propagation.
 Exception in thread main 
 org.apache.commons.math3.exception.NotStrictlyPositiveException: standard 
 deviation (0)
 at 
 org.apache.commons.math3.distribution.NormalDistribution.init(NormalDistribution.java:108)
 at 
 org.apache.cassandra.stress.settings.OptionDistribution$GaussianFactory.get(OptionDistribution.java:418)
 at 
 org.apache.cassandra.stress.generate.SeedManager.init(SeedManager.java:59)
 at 
 org.apache.cassandra.stress.settings.SettingsCommandUser.getFactory(SettingsCommandUser.java:78)
 at org.apache.cassandra.stress.StressAction.run(StressAction.java:61)
 at org.apache.cassandra.stress.Stress.main(Stress.java:109)
 The tool is 2.1.2 version, but the tested Cassandra is 2.0.8 version



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation

2014-12-03 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232857#comment-14232857
 ] 

Benedict commented on CASSANDRA-7032:
-

Well, NetworkTopologyStrategy already enforces some degree of balance across 
racks, and absolutely guarantees balance across DCs as far as replication 
ownership is concerned. It _would_ be nice to migrate this behaviour to the 
token selection so that we could reason about ownership a bit more clearly (NTS 
might enforce our general ownership constraints, but having a predictably cheap 
generation strategy for end points would be great, as the amount of state 
necessary to route queries could shrink dramatically. if we could rely on a 
sequence of adjacent tokens ensuring these properties, for instance), but a 
simpler goal of simply ensuring that for any given arbitrary slice of the 
global token range, all nodes have a share of the range that is within epsilon 
of perfect, should be more than sufficient.

TL;DR; our goal should probably be: for any given arbitrary slice of the 
global token range, all nodes have a share of the range that is within epsilon* 
of perfect

* with epsilon probably inversely proportional to the size of the slice

 Improve vnode allocation
 

 Key: CASSANDRA-7032
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Branimir Lambov
  Labels: performance, vnodes
 Fix For: 3.0

 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java


 It's been known for a little while that random vnode allocation causes 
 hotspots of ownership. It should be possible to improve dramatically on this 
 with deterministic allocation. I have quickly thrown together a simple greedy 
 algorithm that allocates vnodes efficiently, and will repair hotspots in a 
 randomly allocated cluster gradually as more nodes are added, and also 
 ensures that token ranges are fairly evenly spread between nodes (somewhat 
 tunably so). The allocation still permits slight discrepancies in ownership, 
 but it is bound by the inverse of the size of the cluster (as opposed to 
 random allocation, which strangely gets worse as the cluster size increases). 
 I'm sure there is a decent dynamic programming solution to this that would be 
 even better.
 If on joining the ring a new node were to CAS a shared table where a 
 canonical allocation of token ranges lives after running this (or a similar) 
 algorithm, we could then get guaranteed bounds on the ownership distribution 
 in a cluster. This will also help for CASSANDRA-6696.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-7032) Improve vnode allocation

2014-12-03 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232857#comment-14232857
 ] 

Benedict edited comment on CASSANDRA-7032 at 12/3/14 10:38 AM:
---

Well, NetworkTopologyStrategy already enforces some degree of balance across 
racks, and absolutely guarantees balance across DCs as far as replication 
ownership is concerned. It _would_ be nice to migrate this behaviour to the 
token selection so that we could reason about ownership a bit more clearly (NTS 
might enforce our general ownership constraints, but having a predictably cheap 
generation strategy for end points would be great, as the amount of state 
necessary to route queries could shrink dramatically. if we could rely on a 
sequence of adjacent tokens ensuring these properties, for instance), but a 
simpler goal of simply ensuring that for any given arbitrary slice of the 
global token range, all nodes have a share of the range that is within epsilon 
of perfect, should be more than sufficient.

TL;DR; our goal should probably be: for any given arbitrary slice of the 
global token range, all nodes have a share of the range that is within epsilon* 
of perfect

\* with epsilon probably inversely proportional to the size of the slice


was (Author: benedict):
Well, NetworkTopologyStrategy already enforces some degree of balance across 
racks, and absolutely guarantees balance across DCs as far as replication 
ownership is concerned. It _would_ be nice to migrate this behaviour to the 
token selection so that we could reason about ownership a bit more clearly (NTS 
might enforce our general ownership constraints, but having a predictably cheap 
generation strategy for end points would be great, as the amount of state 
necessary to route queries could shrink dramatically. if we could rely on a 
sequence of adjacent tokens ensuring these properties, for instance), but a 
simpler goal of simply ensuring that for any given arbitrary slice of the 
global token range, all nodes have a share of the range that is within epsilon 
of perfect, should be more than sufficient.

TL;DR; our goal should probably be: for any given arbitrary slice of the 
global token range, all nodes have a share of the range that is within epsilon* 
of perfect

* with epsilon probably inversely proportional to the size of the slice

 Improve vnode allocation
 

 Key: CASSANDRA-7032
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Branimir Lambov
  Labels: performance, vnodes
 Fix For: 3.0

 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java


 It's been known for a little while that random vnode allocation causes 
 hotspots of ownership. It should be possible to improve dramatically on this 
 with deterministic allocation. I have quickly thrown together a simple greedy 
 algorithm that allocates vnodes efficiently, and will repair hotspots in a 
 randomly allocated cluster gradually as more nodes are added, and also 
 ensures that token ranges are fairly evenly spread between nodes (somewhat 
 tunably so). The allocation still permits slight discrepancies in ownership, 
 but it is bound by the inverse of the size of the cluster (as opposed to 
 random allocation, which strangely gets worse as the cluster size increases). 
 I'm sure there is a decent dynamic programming solution to this that would be 
 even better.
 If on joining the ring a new node were to CAS a shared table where a 
 canonical allocation of token ranges lives after running this (or a similar) 
 algorithm, we could then get guaranteed bounds on the ownership distribution 
 in a cluster. This will also help for CASSANDRA-6696.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-8411) Cassandra stress tool fails with NotStrictlyPositiveException on example profiles

2014-12-03 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232818#comment-14232818
 ] 

Benedict edited comment on CASSANDRA-8411 at 12/3/14 10:45 AM:
---

Looks likely to be a trivial bug when providing n=1  with all other parameters 
default - since this is a stress tool, I don't think anybody has tried running 
it with only 1 insert before!

If you want it to work in the meantime, try providing n=1000, say


was (Author: benedict):
Looks likely to be a trivial bug when providing n=1  with all other parameters 
default - since this is a stress tool, I don't think anybody has tried running 
it with only 1 insert before!

 Cassandra stress tool fails with NotStrictlyPositiveException on example 
 profiles
 -

 Key: CASSANDRA-8411
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8411
 Project: Cassandra
  Issue Type: Bug
  Components: Tools
 Environment: Linux Centos
Reporter: Igor Meltser
Priority: Trivial

 Trying to run stress tool with provided profile fails:
 dsc-cassandra-2.1.2/tools $ ./bin/cassandra-stress user n=1 
 profile=cqlstress-example.yaml ops\(insert=1\) -node 
 INFO  06:21:35 Using data-center name 'datacenter1' for 
 DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct 
 datacenter name with DCAwareRoundRobinPolicy constructor)
 Connected to cluster: Benchmark Cluster
 INFO  06:21:35 New Cassandra host /:9042 added
 Datatacenter: datacenter1; Host: /.; Rack: rack1
 Datatacenter: datacenter1; Host: /; Rack: rack1
 Datatacenter: datacenter1; Host: /; Rack: rack1
 INFO  06:21:35 New Cassandra host /:9042 added
 INFO  06:21:35 New Cassandra host /:9042 added
 Created schema. Sleeping 3s for propagation.
 Exception in thread main 
 org.apache.commons.math3.exception.NotStrictlyPositiveException: standard 
 deviation (0)
 at 
 org.apache.commons.math3.distribution.NormalDistribution.init(NormalDistribution.java:108)
 at 
 org.apache.cassandra.stress.settings.OptionDistribution$GaussianFactory.get(OptionDistribution.java:418)
 at 
 org.apache.cassandra.stress.generate.SeedManager.init(SeedManager.java:59)
 at 
 org.apache.cassandra.stress.settings.SettingsCommandUser.getFactory(SettingsCommandUser.java:78)
 at org.apache.cassandra.stress.StressAction.run(StressAction.java:61)
 at org.apache.cassandra.stress.Stress.main(Stress.java:109)
 The tool is 2.1.2 version, but the tested Cassandra is 2.0.8 version



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-8413) Bloom filter false positive ratio is not honoured

2014-12-03 Thread Benedict (JIRA)
Benedict created CASSANDRA-8413:
---

 Summary: Bloom filter false positive ratio is not honoured
 Key: CASSANDRA-8413
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8413
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Benedict
 Fix For: 2.0.12, 2.1.3


Whilst thinking about CASSANDRA-7438 and hash bits, I realised we have a 
problem with sabotaging our bloom filters when using the murmur3 partitioner. I 
have performed a very quick test to confirm this risk is real.

Since a typical cluster uses the same murmur3 hash for partitioning as we do 
for bloom filter lookups, and we own a contiguous range, we can guarantee that 
the top X bits collide for all keys on the node. This translates into poor 
bloom filter distribution. I quickly hacked LongBloomFilterTest to simulate the 
problem, and the result in these tests is _up to_ a doubling of the actual 
false positive ratio. The actual change will depend on the key distribution, 
the number of keys, the false positive ratio, the number of nodes, the token 
distribution, etc. But seems to be a real problem for non-vnode clusters of at 
least ~128 nodes in size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-8413) Bloom filter false positive ratio is not honoured

2014-12-03 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-8413:

Attachment: 8413.hack.txt

 Bloom filter false positive ratio is not honoured
 -

 Key: CASSANDRA-8413
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8413
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Benedict
 Fix For: 2.0.12, 2.1.3

 Attachments: 8413.hack.txt


 Whilst thinking about CASSANDRA-7438 and hash bits, I realised we have a 
 problem with sabotaging our bloom filters when using the murmur3 partitioner. 
 I have performed a very quick test to confirm this risk is real.
 Since a typical cluster uses the same murmur3 hash for partitioning as we do 
 for bloom filter lookups, and we own a contiguous range, we can guarantee 
 that the top X bits collide for all keys on the node. This translates into 
 poor bloom filter distribution. I quickly hacked LongBloomFilterTest to 
 simulate the problem, and the result in these tests is _up to_ a doubling of 
 the actual false positive ratio. The actual change will depend on the key 
 distribution, the number of keys, the false positive ratio, the number of 
 nodes, the token distribution, etc. But seems to be a real problem for 
 non-vnode clusters of at least ~128 nodes in size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (CASSANDRA-7882) Memtable slab allocation should scale logarithmically to improve occupancy rate

2014-12-03 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict reassigned CASSANDRA-7882:
---

Assignee: Benedict  (was: Jay Patel)

 Memtable slab allocation should scale logarithmically to improve occupancy 
 rate
 ---

 Key: CASSANDRA-7882
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7882
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jay Patel
Assignee: Benedict
  Labels: performance
 Fix For: 2.1.3

 Attachments: trunk-7882.txt


 CASSANDRA-5935 allows option to disable region-based allocation for on-heap 
 memtables but there is no option to disable it for off-heap memtables 
 (memtable_allocation_type: offheap_objects). 
 Disabling region-based allocation will allow us to pack more tables in the 
 schema since minimum of 1MB region won't be allocated per table. Downside can 
 be more fragmentation which should be controllable by using better allocator 
 like JEMalloc.
 How about below option in yaml?:
 memtable_allocation_type: unslabbed_offheap_objects
 Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7882) Memtable slab allocation should scale logarithmically to improve occupancy rate

2014-12-03 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233170#comment-14233170
 ] 

Benedict commented on CASSANDRA-7882:
-

I've posted a variant of the patch 
[here|https://github.com/belliottsmith/cassandra/tree/7882-nativeallocator]

There are a few changes, a couple unrelated just cleaning up the class: 

# removed the unslabbed and regionCount variables, as they weren't used for 
anything important
# removed the nextRegionSize variable: it wasn't being maintained atomically, 
but just as importantly it's messy to do it separately:
#* instead of setting a full region to null, we swap it straight to a new 
region, using the prior region to determine the size of the new region
#* we ensure the new region size is at least large enough to hold the 
allocation we're inserting
# we cap the size of each race allocated queue to 8 entries, as this should 
permit plenty of leeway for avoiding heavy competition thrashing the allocator, 
but not so much that we have a lot of primarily unused memory

There is one issue, though, which is if this should make it into 2.1, or wait 
until 3.0. I'm pretty comfortable either way, but my gut feeling is others will 
prefer it wait until 3.0. [~jbellis], what's your view?

 Memtable slab allocation should scale logarithmically to improve occupancy 
 rate
 ---

 Key: CASSANDRA-7882
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7882
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jay Patel
Assignee: Jay Patel
  Labels: performance
 Fix For: 2.1.3

 Attachments: trunk-7882.txt


 CASSANDRA-5935 allows option to disable region-based allocation for on-heap 
 memtables but there is no option to disable it for off-heap memtables 
 (memtable_allocation_type: offheap_objects). 
 Disabling region-based allocation will allow us to pack more tables in the 
 schema since minimum of 1MB region won't be allocated per table. Downside can 
 be more fragmentation which should be controllable by using better allocator 
 like JEMalloc.
 How about below option in yaml?:
 memtable_allocation_type: unslabbed_offheap_objects
 Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-7882) Memtable slab allocation should scale logarithmically to improve occupancy rate

2014-12-03 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-7882:

Reviewer: Jay Patel  (was: Benedict)

 Memtable slab allocation should scale logarithmically to improve occupancy 
 rate
 ---

 Key: CASSANDRA-7882
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7882
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jay Patel
Assignee: Benedict
  Labels: performance
 Fix For: 2.1.3

 Attachments: trunk-7882.txt


 CASSANDRA-5935 allows option to disable region-based allocation for on-heap 
 memtables but there is no option to disable it for off-heap memtables 
 (memtable_allocation_type: offheap_objects). 
 Disabling region-based allocation will allow us to pack more tables in the 
 schema since minimum of 1MB region won't be allocated per table. Downside can 
 be more fragmentation which should be controllable by using better allocator 
 like JEMalloc.
 How about below option in yaml?:
 memtable_allocation_type: unslabbed_offheap_objects
 Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-8414) Avoid loops over array backed iterators that call iter.remove()

2014-12-03 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-8414:

Summary: Avoid loops over array backed iterators that call iter.remove()  
(was: Compaction is O(n^2) when deleting lots of tombstones)

 Avoid loops over array backed iterators that call iter.remove()
 ---

 Key: CASSANDRA-8414
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8414
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Richard Low

 I noticed from sampling that sometimes compaction spends almost all of its 
 time in iter.remove() in ColumnFamilyStore.removeDeletedStandard. It turns 
 out that the cf object is using ArrayBackedSortedColumns, so deletes are from 
 an ArrayList. If the majority of your columns are GCable tombstones then this 
 is O(n^2). The data structure should be changed or a copy made to avoid this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8414) Avoid loops over array backed iterators that call iter.remove()

2014-12-03 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233485#comment-14233485
 ] 

Benedict commented on CASSANDRA-8414:
-

I've edited the title because it's not quite that compaction is O(n^2), but 
that certain operations within a partition are. It's also not limited to just 
that specific method. The best solution is probably to introduce a special 
deletion iterator on which a call to remove() simply sets a corresponding bit 
to 1; once we exhaust the iterator we commit the deletes in one pass.

 Avoid loops over array backed iterators that call iter.remove()
 ---

 Key: CASSANDRA-8414
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8414
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Richard Low

 I noticed from sampling that sometimes compaction spends almost all of its 
 time in iter.remove() in ColumnFamilyStore.removeDeletedStandard. It turns 
 out that the cf object is using ArrayBackedSortedColumns, so deletes are from 
 an ArrayList. If the majority of your columns are GCable tombstones then this 
 is O(n^2). The data structure should be changed or a copy made to avoid this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-8414) Avoid loops over array backed iterators that call iter.remove()

2014-12-03 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-8414:

Labels: performance  (was: )

 Avoid loops over array backed iterators that call iter.remove()
 ---

 Key: CASSANDRA-8414
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8414
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Richard Low
  Labels: performance

 I noticed from sampling that sometimes compaction spends almost all of its 
 time in iter.remove() in ColumnFamilyStore.removeDeletedStandard. It turns 
 out that the cf object is using ArrayBackedSortedColumns, so deletes are from 
 an ArrayList. If the majority of your columns are GCable tombstones then this 
 is O(n^2). The data structure should be changed or a copy made to avoid this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-8414) Avoid loops over array backed iterators that call iter.remove()

2014-12-03 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-8414:

Fix Version/s: 2.1.3

 Avoid loops over array backed iterators that call iter.remove()
 ---

 Key: CASSANDRA-8414
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8414
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Richard Low
  Labels: performance
 Fix For: 2.1.3


 I noticed from sampling that sometimes compaction spends almost all of its 
 time in iter.remove() in ColumnFamilyStore.removeDeletedStandard. It turns 
 out that the cf object is using ArrayBackedSortedColumns, so deletes are from 
 an ArrayList. If the majority of your columns are GCable tombstones then this 
 is O(n^2). The data structure should be changed or a copy made to avoid this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-8431) Stress should validate the results of queries in user profile mode

2014-12-05 Thread Benedict (JIRA)
Benedict created CASSANDRA-8431:
---

 Summary: Stress should validate the results of queries in user 
profile mode
 Key: CASSANDRA-8431
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8431
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Benedict


CASSANDRA-8429 was exhibited by the validation logic in stress. However at the 
moment the new-fangled profile driven user mode doesn't perform any 
validation. So as we default more and more to the new approach we will be less 
and less likely to spot correctness issues.

Introducing validation logic here could be tricky, since we can support 
arbitrary user queries. However we could support a query mode where only the 
columns and number of cql rows to fetch are defined, for which we could 
calculate the exact result set we expect. There would be complications with 
insertions that proceed out-of-order, but we could either not support this 
mode, or have a validation mode that just ensures a superset of the data we 
know to be inserted has been.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-8434) L0 should have a separate configurable bloom filter false positive ratio

2014-12-06 Thread Benedict (JIRA)
Benedict created CASSANDRA-8434:
---

 Summary: L0 should have a separate configurable bloom filter false 
positive ratio
 Key: CASSANDRA-8434
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8434
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
 Fix For: 2.0.12, 2.1.3


In follow up to CASSANDRA-5371. We now perform size-tiered file selection for 
compaction if L0 gets too far behind, however as far as I can tell we stick 
with the CF configured false positive ratio, likely inflating substantially the 
number of files we visit on average until L0 is cleaned up. Having a a 
different bf fp for L0 would solve this problem without introducing any 
significant burden when L0 is not overloaded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8409) Node generating a huge number of tiny sstable_activity flushes

2014-12-08 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237708#comment-14237708
 ] 

Benedict commented on CASSANDRA-8409:
-

The full system log would be heplful for diagnosis. However if there are a lot 
of competing updates to a single partition (e.g. lots of non-batch inserts into 
a single partition key) then it's possible CASSANDRA-8018 could have triggered 
this. By applying the update function twice, we would screw up our memory count 
if the update fails to apply due to competition. If this happened often enough 
we could get to a situation where the cleaner is incapable of generating a task 
that will clean enough memory, so tries to flush on every allocation.

 Node generating a huge number of tiny sstable_activity flushes
 --

 Key: CASSANDRA-8409
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8409
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: Cassandra 2.1.0, Oracle JDK 1.8.0_25, Ubuntu 12.04
Reporter: Fred Wulff
 Fix For: 2.1.3

 Attachments: system-sstable_activity-ka-67802-Data.db


 On one of my nodes, I’m seeing hundreds per second of “INFO  21:28:05 
 Enqueuing flush of sstable_activity: 0 (0%) on-heap, 33 (0%) off-heap”. 
 tpstats shows a steadily climbing # of pending 
 MemtableFlushWriter/MemtablePostFlush until the node OOMs. When the flushes 
 actually happen the sstable written is invariably 121 bytes. I’m writing 
 pretty aggressively to one of my user tables (sev.mdb_group_pit), but that 
 table's flushing behavior seems reasonable.
 tpstats:
 {quote}
 frew@hostname:~/s_dist/apache-cassandra-2.1.0$ bin/nodetool -h hostname 
 tpstats
 Pool NameActive   Pending  Completed   Blocked  All 
 time blocked
 MutationStage   128  4429  36810 0
  0
 ReadStage 0 0   1205 0
  0
 RequestResponseStage  0 0  24910 0
  0
 ReadRepairStage   0 0 26 0
  0
 CounterMutationStage  0 0  0 0
  0
 MiscStage 0 0  0 0
  0
 HintedHandoff 2 2  9 0
  0
 GossipStage   0 0   5157 0
  0
 CacheCleanupExecutor  0 0  0 0
  0
 InternalResponseStage 0 0  0 0
  0
 CommitLogArchiver 0 0  0 0
  0
 CompactionExecutor428429 0
  0
 ValidationExecutor0 0  0 0
  0
 MigrationStage0 0  0 0
  0
 AntiEntropyStage  0 0  0 0
  0
 PendingRangeCalculator0 0 11 0
  0
 MemtableFlushWriter   8 38644   8987 0
  0
 MemtablePostFlush 1 38940   8735 0
  0
 MemtableReclaimMemory 0 0   8987 0
  0
 Message type   Dropped
 READ 0
 RANGE_SLICE  0
 _TRACE   0
 MUTATION 10457
 COUNTER_MUTATION 0
 BINARY   0
 REQUEST_RESPONSE 0
 PAGED_RANGE  0
 READ_REPAIR208
 {quote}
 I've attached one of the produced sstables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7873) Replace AbstractRowResolver.replies with collection with tailored properties

2014-12-08 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237740#comment-14237740
 ] 

Benedict commented on CASSANDRA-7873:
-

mea culpa. thanks

 Replace AbstractRowResolver.replies with collection with tailored properties
 

 Key: CASSANDRA-7873
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7873
 Project: Cassandra
  Issue Type: Bug
 Environment: OSX and Ubuntu 14.04
Reporter: Philip Thompson
Assignee: Benedict
 Fix For: 3.0

 Attachments: 7873.21.txt, 7873.trunk.txt, 7873.txt, 7873_fixup.txt


 The dtest auth_test.py:TestAuth.system_auth_ks_is_alterable_test is failing 
 on trunk only with the following stack trace:
 {code}
 Unexpected error in node1 node log:
 ERROR [Thrift:1] 2014-09-03 15:48:08,389 CustomTThreadPoolServer.java:219 - 
 Error occurred during processing of message.
 java.util.ConcurrentModificationException: null
   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859) 
 ~[na:1.7.0_65]
   at java.util.ArrayList$Itr.next(ArrayList.java:831) ~[na:1.7.0_65]
   at 
 org.apache.cassandra.service.RowDigestResolver.resolve(RowDigestResolver.java:71)
  ~[main/:na]
   at 
 org.apache.cassandra.service.RowDigestResolver.resolve(RowDigestResolver.java:28)
  ~[main/:na]
   at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:110) 
 ~[main/:na]
   at 
 org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:144)
  ~[main/:na]
   at 
 org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1228) 
 ~[main/:na]
   at 
 org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1154) 
 ~[main/:na]
   at 
 org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:256)
  ~[main/:na]
   at 
 org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:212)
  ~[main/:na]
   at org.apache.cassandra.auth.Auth.selectUser(Auth.java:257) ~[main/:na]
   at org.apache.cassandra.auth.Auth.isExistingUser(Auth.java:76) 
 ~[main/:na]
   at org.apache.cassandra.service.ClientState.login(ClientState.java:178) 
 ~[main/:na]
   at 
 org.apache.cassandra.thrift.CassandraServer.login(CassandraServer.java:1486) 
 ~[main/:na]
   at 
 org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3579)
  ~[thrift/:na]
   at 
 org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3563)
  ~[thrift/:na]
   at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) 
 ~[libthrift-0.9.1.jar:0.9.1]
   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) 
 ~[libthrift-0.9.1.jar:0.9.1]
   at 
 org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:201)
  ~[main/:na]
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  [na:1.7.0_65]
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  [na:1.7.0_65]
   at java.lang.Thread.run(Thread.java:745) [na:1.7.0_65]
 {code}
 That exception is thrown when the following query is sent:
 {code}
 SELECT strategy_options
   FROM system.schema_keyspaces
   WHERE keyspace_name = 'system_auth'
 {code}
 The test alters the RF of the system_auth keyspace, then shuts down and 
 restarts the cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8308) Windows: Commitlog access violations on unit tests

2014-12-08 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238128#comment-14238128
 ] 

Benedict commented on CASSANDRA-8308:
-

Sure. Will review tomorrow.

 Windows: Commitlog access violations on unit tests
 --

 Key: CASSANDRA-8308
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8308
 Project: Cassandra
  Issue Type: Bug
Reporter: Joshua McKenzie
Assignee: Joshua McKenzie
Priority: Minor
  Labels: Windows
 Fix For: 3.0

 Attachments: 8308_v1.txt


 We have four unit tests failing on trunk on Windows, all with 
 FileSystemException's related to the SchemaLoader:
 {noformat}
 [junit] Test 
 org.apache.cassandra.db.compaction.DateTieredCompactionStrategyTest FAILED
 [junit] Test org.apache.cassandra.cql3.ThriftCompatibilityTest FAILED
 [junit] Test org.apache.cassandra.io.sstable.SSTableRewriterTest FAILED
 [junit] Test org.apache.cassandra.repair.LocalSyncTaskTest FAILED
 {noformat}
 Example error:
 {noformat}
 [junit] Caused by: java.nio.file.FileSystemException: 
 build\test\cassandra\commitlog;0\CommitLog-5-1415908745965.log: The process 
 cannot access the file because it is being used by another process.
 [junit]
 [junit] at 
 sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86)
 [junit] at 
 sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
 [junit] at 
 sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102)
 [junit] at 
 sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:269)
 [junit] at 
 sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
 [junit] at java.nio.file.Files.delete(Files.java:1079)
 [junit] at 
 org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:125)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-8308) Windows: Commitlog access violations on unit tests

2014-12-08 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-8308:

Reviewer: Benedict

 Windows: Commitlog access violations on unit tests
 --

 Key: CASSANDRA-8308
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8308
 Project: Cassandra
  Issue Type: Bug
Reporter: Joshua McKenzie
Assignee: Joshua McKenzie
Priority: Minor
  Labels: Windows
 Fix For: 3.0

 Attachments: 8308_v1.txt


 We have four unit tests failing on trunk on Windows, all with 
 FileSystemException's related to the SchemaLoader:
 {noformat}
 [junit] Test 
 org.apache.cassandra.db.compaction.DateTieredCompactionStrategyTest FAILED
 [junit] Test org.apache.cassandra.cql3.ThriftCompatibilityTest FAILED
 [junit] Test org.apache.cassandra.io.sstable.SSTableRewriterTest FAILED
 [junit] Test org.apache.cassandra.repair.LocalSyncTaskTest FAILED
 {noformat}
 Example error:
 {noformat}
 [junit] Caused by: java.nio.file.FileSystemException: 
 build\test\cassandra\commitlog;0\CommitLog-5-1415908745965.log: The process 
 cannot access the file because it is being used by another process.
 [junit]
 [junit] at 
 sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86)
 [junit] at 
 sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
 [junit] at 
 sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102)
 [junit] at 
 sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:269)
 [junit] at 
 sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
 [junit] at java.nio.file.Files.delete(Files.java:1079)
 [junit] at 
 org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:125)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8308) Windows: Commitlog access violations on unit tests

2014-12-09 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239291#comment-14239291
 ] 

Benedict commented on CASSANDRA-8308:
-

* channel.truncate() is not equivalent to raf.setLength(), and we want the 
length to be set upfront to somewhat ensure contiguity
* would be nice to extract the is linux decision to an enum, embed it in 
FBUtilities where we already have an isUnix() method (and an OPERATING_SYSTEM 
property, that could be converted to the enum)

 Windows: Commitlog access violations on unit tests
 --

 Key: CASSANDRA-8308
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8308
 Project: Cassandra
  Issue Type: Bug
Reporter: Joshua McKenzie
Assignee: Joshua McKenzie
Priority: Minor
  Labels: Windows
 Fix For: 3.0

 Attachments: 8308_v1.txt


 We have four unit tests failing on trunk on Windows, all with 
 FileSystemException's related to the SchemaLoader:
 {noformat}
 [junit] Test 
 org.apache.cassandra.db.compaction.DateTieredCompactionStrategyTest FAILED
 [junit] Test org.apache.cassandra.cql3.ThriftCompatibilityTest FAILED
 [junit] Test org.apache.cassandra.io.sstable.SSTableRewriterTest FAILED
 [junit] Test org.apache.cassandra.repair.LocalSyncTaskTest FAILED
 {noformat}
 Example error:
 {noformat}
 [junit] Caused by: java.nio.file.FileSystemException: 
 build\test\cassandra\commitlog;0\CommitLog-5-1415908745965.log: The process 
 cannot access the file because it is being used by another process.
 [junit]
 [junit] at 
 sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86)
 [junit] at 
 sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
 [junit] at 
 sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102)
 [junit] at 
 sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:269)
 [junit] at 
 sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
 [junit] at java.nio.file.Files.delete(Files.java:1079)
 [junit] at 
 org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:125)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6993) Windows: remove mmap'ed I/O for index files and force standard file access

2014-12-09 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239307#comment-14239307
 ] 

Benedict commented on CASSANDRA-6993:
-

Replacing isUnix() with !isWindows() is not functionally equivalent; this will 
capture Mac, Solaris, OpenBSD, FreeBSD and others as well, although in many 
situations this actually adequately captures what we want (such as for your 
specific change) it likely won't in all cases. 

As with CASSANRA-8038 this would benefit from sanitising our OS detection. 
Perhaps we could split this out into a minor ticket these both depend upon, as 
we have a bit of a mess right now that permits these sorts of logical 
mismatches to crop up. We should probably group POSIX compliant OSes together, 
and POSIX compliant file systems together, one of which is probably what we 
generally mean when we say isUnix().


 Windows: remove mmap'ed I/O for index files and force standard file access
 --

 Key: CASSANDRA-6993
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6993
 Project: Cassandra
  Issue Type: Improvement
Reporter: Joshua McKenzie
Assignee: Joshua McKenzie
Priority: Minor
  Labels: Windows
 Fix For: 3.0, 2.1.3

 Attachments: 6993_2.1_v1.txt, 6993_v1.txt, 6993_v2.txt


 Memory-mapped I/O on Windows causes issues with hard-links; we're unable to 
 delete hard-links to open files with memory-mapped segments even using nio.  
 We'll need to push for close to performance parity between mmap'ed I/O and 
 buffered going forward as the buffered / compressed path offers other 
 benefits.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8414) Avoid loops over array backed iterators that call iter.remove()

2014-12-09 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239312#comment-14239312
 ] 

Benedict commented on CASSANDRA-8414:
-

We should integrate this for 2.1 also, since this behaviour is exhibited still, 
just not in compaction. In 2.1 we should use System.arraycopy and 
removed.nextSetBit though, as the performance will be improved, particularly 
for sparse removes.

 Avoid loops over array backed iterators that call iter.remove()
 ---

 Key: CASSANDRA-8414
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8414
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Richard Low
Assignee: Jimmy Mårdell
  Labels: performance
 Fix For: 2.1.3

 Attachments: cassandra-2.0-8414-1.txt


 I noticed from sampling that sometimes compaction spends almost all of its 
 time in iter.remove() in ColumnFamilyStore.removeDeletedStandard. It turns 
 out that the cf object is using ArrayBackedSortedColumns, so deletes are from 
 an ArrayList. If the majority of your columns are GCable tombstones then this 
 is O(n^2). The data structure should be changed or a copy made to avoid this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-7873) Replace AbstractRowResolver.replies with collection with tailored properties

2014-12-09 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237740#comment-14237740
 ] 

Benedict edited comment on CASSANDRA-7873 at 12/9/14 11:43 AM:
---

mea culpa. thanks

+1


was (Author: benedict):
mea culpa. thanks

 Replace AbstractRowResolver.replies with collection with tailored properties
 

 Key: CASSANDRA-7873
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7873
 Project: Cassandra
  Issue Type: Bug
 Environment: OSX and Ubuntu 14.04
Reporter: Philip Thompson
Assignee: Benedict
 Fix For: 3.0

 Attachments: 7873.21.txt, 7873.trunk.txt, 7873.txt, 7873_fixup.txt


 The dtest auth_test.py:TestAuth.system_auth_ks_is_alterable_test is failing 
 on trunk only with the following stack trace:
 {code}
 Unexpected error in node1 node log:
 ERROR [Thrift:1] 2014-09-03 15:48:08,389 CustomTThreadPoolServer.java:219 - 
 Error occurred during processing of message.
 java.util.ConcurrentModificationException: null
   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859) 
 ~[na:1.7.0_65]
   at java.util.ArrayList$Itr.next(ArrayList.java:831) ~[na:1.7.0_65]
   at 
 org.apache.cassandra.service.RowDigestResolver.resolve(RowDigestResolver.java:71)
  ~[main/:na]
   at 
 org.apache.cassandra.service.RowDigestResolver.resolve(RowDigestResolver.java:28)
  ~[main/:na]
   at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:110) 
 ~[main/:na]
   at 
 org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:144)
  ~[main/:na]
   at 
 org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1228) 
 ~[main/:na]
   at 
 org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1154) 
 ~[main/:na]
   at 
 org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:256)
  ~[main/:na]
   at 
 org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:212)
  ~[main/:na]
   at org.apache.cassandra.auth.Auth.selectUser(Auth.java:257) ~[main/:na]
   at org.apache.cassandra.auth.Auth.isExistingUser(Auth.java:76) 
 ~[main/:na]
   at org.apache.cassandra.service.ClientState.login(ClientState.java:178) 
 ~[main/:na]
   at 
 org.apache.cassandra.thrift.CassandraServer.login(CassandraServer.java:1486) 
 ~[main/:na]
   at 
 org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3579)
  ~[thrift/:na]
   at 
 org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3563)
  ~[thrift/:na]
   at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) 
 ~[libthrift-0.9.1.jar:0.9.1]
   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) 
 ~[libthrift-0.9.1.jar:0.9.1]
   at 
 org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:201)
  ~[main/:na]
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  [na:1.7.0_65]
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  [na:1.7.0_65]
   at java.lang.Thread.run(Thread.java:745) [na:1.7.0_65]
 {code}
 That exception is thrown when the following query is sent:
 {code}
 SELECT strategy_options
   FROM system.schema_keyspaces
   WHERE keyspace_name = 'system_auth'
 {code}
 The test alters the RF of the system_auth keyspace, then shuts down and 
 restarts the cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (CASSANDRA-8312) Use live sstables in snapshot repair if possible

2014-12-09 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict reopened CASSANDRA-8312:
-

It looks to me like this doesn't tidy up after itself properly, at least on 
trunk. It opens an sstable from the snapshot if necessary, references it, and 
then releases only the reference it acquired - not the extra reference that 
would permit its BF etc. to be reclaimed. So this will likely leak significant 
amounts of memory.

 Use live sstables in snapshot repair if possible
 

 Key: CASSANDRA-8312
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8312
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jimmy Mårdell
Assignee: Jimmy Mårdell
Priority: Minor
 Fix For: 2.0.12, 3.0, 2.1.3

 Attachments: cassandra-2.0-8312-1.txt


 Snapshot repair can be very much slower than parallel repairs because of the 
 overhead of opening the SSTables in the snapshot. This is particular true 
 when using LCS, as you typically have many smaller SSTables then.
 I compared parallel and sequential repair on a small range on one of our 
 clusters (2*3 replicas). With parallel repair, this took 22 seconds. With 
 sequential repair (default in 2.0), the same range took 330 seconds! This is 
 an overhead of 330-22*6 = 198 seconds, just opening SSTables (there were 
 1000+ sstables). Also, opening 1000 sstables for many smaller rangers surely 
 causes lots of memory churning.
 The idea would be to list the sstables in the snapshot, but use the 
 corresponding sstables in the live set if it's still available. For almost 
 all sstables, the original one should still exist.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7705) Safer Resource Management

2014-12-09 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239408#comment-14239408
 ] 

Benedict commented on CASSANDRA-7705:
-

I have updated the repository with a rebased version, with some improved 
comments and a debug mode. 

This is essentially free given java's object alignment behaviour and run time 
optimisation (the field doesn't occupy any memory we wouldn't otherwise be 
occupying, and the relevant statements will be optimised away).

 Safer Resource Management
 -

 Key: CASSANDRA-7705
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7705
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Benedict
 Fix For: 3.0


 We've had a spate of bugs recently with bad reference counting. these can 
 have potentially dire consequences, generally either randomly deleting data 
 or giving us infinite loops. 
 Since in 2.1 we only reference count resources that are relatively expensive 
 and infrequently managed (or in places where this safety is probably not as 
 necessary, e.g. SerializingCache), we could without any negative consequences 
 (and only slight code complexity) introduce a safer resource management 
 scheme for these more expensive/infrequent actions.
 Basically, I propose when we want to acquire a resource we allocate an object 
 that manages the reference. This can only be released once; if it is released 
 twice, we fail immediately at the second release, reporting where the bug is 
 (rather than letting it continue fine until the next correct release corrupts 
 the count). The reference counter remains the same, but we obtain guarantees 
 that the reference count itself is never badly maintained, although code 
 using it could mistakenly release its own handle early (typically this is 
 only an issue when cleaning up after a failure, in which case under the new 
 scheme this would be an innocuous error)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-8414) Avoid loops over array backed iterators that call iter.remove()

2014-12-09 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-8414:

Attachment: cassandra-2.0-8414-3.txt

Nice backporting of the better approach.

I've uploaded a tweaked version, the goal of which was just to clean up the 
variable names (and switch to a while loop) so it's more obvious what's 
happening. But while at it I also added use of nextClearBit in tandem with 
nextSetBit, as it's a minor tweak but gives better behaviour with runs of 
adjacent removes.

I haven't properly reviewed otherwise, but it might be worth introducing this 
to CFS.removeDroppedColumns() and SliceQueryFilter.trim(), 

 Avoid loops over array backed iterators that call iter.remove()
 ---

 Key: CASSANDRA-8414
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8414
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Richard Low
Assignee: Jimmy Mårdell
  Labels: performance
 Fix For: 2.1.3

 Attachments: cassandra-2.0-8414-1.txt, cassandra-2.0-8414-2.txt, 
 cassandra-2.0-8414-3.txt


 I noticed from sampling that sometimes compaction spends almost all of its 
 time in iter.remove() in ColumnFamilyStore.removeDeletedStandard. It turns 
 out that the cf object is using ArrayBackedSortedColumns, so deletes are from 
 an ArrayList. If the majority of your columns are GCable tombstones then this 
 is O(n^2). The data structure should be changed or a copy made to avoid this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8312) Use live sstables in snapshot repair if possible

2014-12-09 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239638#comment-14239638
 ] 

Benedict commented on CASSANDRA-8312:
-

bq.  it should be enough to just remove the row sstable.acquireReference()

Yes, agreed. But I'll let Yuki review and make that change since he's more 
familiar with this area of the codebase.

 Use live sstables in snapshot repair if possible
 

 Key: CASSANDRA-8312
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8312
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jimmy Mårdell
Assignee: Jimmy Mårdell
Priority: Minor
 Fix For: 2.0.12, 3.0, 2.1.3

 Attachments: cassandra-2.0-8312-1.txt


 Snapshot repair can be very much slower than parallel repairs because of the 
 overhead of opening the SSTables in the snapshot. This is particular true 
 when using LCS, as you typically have many smaller SSTables then.
 I compared parallel and sequential repair on a small range on one of our 
 clusters (2*3 replicas). With parallel repair, this took 22 seconds. With 
 sequential repair (default in 2.0), the same range took 330 seconds! This is 
 an overhead of 330-22*6 = 198 seconds, just opening SSTables (there were 
 1000+ sstables). Also, opening 1000 sstables for many smaller rangers surely 
 causes lots of memory churning.
 The idea would be to list the sstables in the snapshot, but use the 
 corresponding sstables in the live set if it's still available. For almost 
 all sstables, the original one should still exist.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7882) Memtable slab allocation should scale logarithmically to improve occupancy rate

2014-12-09 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239891#comment-14239891
 ] 

Benedict commented on CASSANDRA-7882:
-

Yes, I don't think that's a problem.

 Memtable slab allocation should scale logarithmically to improve occupancy 
 rate
 ---

 Key: CASSANDRA-7882
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7882
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Jay Patel
Assignee: Benedict
  Labels: performance
 Fix For: 2.1.3

 Attachments: trunk-7882.txt


 CASSANDRA-5935 allows option to disable region-based allocation for on-heap 
 memtables but there is no option to disable it for off-heap memtables 
 (memtable_allocation_type: offheap_objects). 
 Disabling region-based allocation will allow us to pack more tables in the 
 schema since minimum of 1MB region won't be allocated per table. Downside can 
 be more fragmentation which should be controllable by using better allocator 
 like JEMalloc.
 How about below option in yaml?:
 memtable_allocation_type: unslabbed_offheap_objects
 Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8449) Allow zero-copy reads again

2014-12-10 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14240975#comment-14240975
 ] 

Benedict commented on CASSANDRA-8449:
-

Unless we explicitly force all queries to yield a timeout response even if they 
have successfully terminated after the timeout, and we enforce this constraint 
_after_ copying the data to the output buffers (netty and thrift), this is 
guaranteed to return junk data to a user somewhere, sometime. So I am -1 on 
this approach.

 Allow zero-copy reads again
 ---

 Key: CASSANDRA-8449
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8449
 Project: Cassandra
  Issue Type: Improvement
Reporter: T Jake Luciani
Assignee: T Jake Luciani
Priority: Minor
  Labels: performance
 Fix For: 3.0


 We disabled zero-copy reads in CASSANDRA-3179 due to in flight reads 
 accessing a ByteBuffer when the data was unmapped by compaction.  Currently 
 this code path is only used for uncompressed reads.
 The actual bytes are in fact copied to the client output buffers for both 
 netty and thrift before being sent over the wire, so the only issue really is 
 the time it takes to process the read internally.  
 This patch adds a slow network read test and changes the tidy() method to 
 actually delete a sstable once the readTimeout has elapsed giving plenty of 
 time to serialize the read.
 Removing this copy causes significantly less GC on the read path and improves 
 the tail latencies:
 http://cstar.datastax.com/graph?stats=c0c8ce16-7fea-11e4-959d-42010af0688fmetric=gc_countoperation=2_readsmoothing=1show_aggregates=truexmin=0xmax=109.34ymin=0ymax=5.5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8449) Allow zero-copy reads again

2014-12-10 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241004#comment-14241004
 ] 

Benedict commented on CASSANDRA-8449:
-

Depending on how that is implemented. I will go out on a limb and predict it 
will offer no such guarantee, as there will always be a potential race 
condition (easily triggered by e.g. lengthy GC pauses) without enforcing the 
constraint _after_ performing the copy to the transport buffers, which is a 
very specific condition that I don't think is being considered for 
CASSANDRA-7392.

 Allow zero-copy reads again
 ---

 Key: CASSANDRA-8449
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8449
 Project: Cassandra
  Issue Type: Improvement
Reporter: T Jake Luciani
Assignee: T Jake Luciani
Priority: Minor
  Labels: performance
 Fix For: 3.0


 We disabled zero-copy reads in CASSANDRA-3179 due to in flight reads 
 accessing a ByteBuffer when the data was unmapped by compaction.  Currently 
 this code path is only used for uncompressed reads.
 The actual bytes are in fact copied to the client output buffers for both 
 netty and thrift before being sent over the wire, so the only issue really is 
 the time it takes to process the read internally.  
 This patch adds a slow network read test and changes the tidy() method to 
 actually delete a sstable once the readTimeout has elapsed giving plenty of 
 time to serialize the read.
 Removing this copy causes significantly less GC on the read path and improves 
 the tail latencies:
 http://cstar.datastax.com/graph?stats=c0c8ce16-7fea-11e4-959d-42010af0688fmetric=gc_countoperation=2_readsmoothing=1show_aggregates=truexmin=0xmax=109.34ymin=0ymax=5.5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8449) Allow zero-copy reads again

2014-12-10 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241215#comment-14241215
 ] 

Benedict commented on CASSANDRA-8449:
-

CASSANDRA-7705 is really designed for situations where we know there won't be 
loads in-flight; i'd prefer not to reintroduce excessive long-lifetime 
reference counting onto the read critical path (we don't ref count sstable 
readers anymore, since CASSANDRA-6919).

All we're doing here is delaying when we unmap the file until a time it is 
known to be unused, so we could create a global OpOrder that guards against 
this; all requests that hit the node are guarded by the OpOrder for their 
entire duration, and only once _all_ requests that started prior to _thinking_ 
the data is free do we actually free it. Typically I would not want to use this 
approach for guarding operations that could take arbitrarily long, but really 
all we're sacrificing is virtual address space, so being delayed more than you 
expect (even excessively) should not noticeably impact system performance, as 
the OS can choose to drop those pages on the floor, keeping only the mapping 
overhead.

 Allow zero-copy reads again
 ---

 Key: CASSANDRA-8449
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8449
 Project: Cassandra
  Issue Type: Improvement
Reporter: T Jake Luciani
Assignee: T Jake Luciani
Priority: Minor
  Labels: performance
 Fix For: 3.0


 We disabled zero-copy reads in CASSANDRA-3179 due to in flight reads 
 accessing a ByteBuffer when the data was unmapped by compaction.  Currently 
 this code path is only used for uncompressed reads.
 The actual bytes are in fact copied to the client output buffers for both 
 netty and thrift before being sent over the wire, so the only issue really is 
 the time it takes to process the read internally.  
 This patch adds a slow network read test and changes the tidy() method to 
 actually delete a sstable once the readTimeout has elapsed giving plenty of 
 time to serialize the read.
 Removing this copy causes significantly less GC on the read path and improves 
 the tail latencies:
 http://cstar.datastax.com/graph?stats=c0c8ce16-7fea-11e4-959d-42010af0688fmetric=gc_countoperation=2_readsmoothing=1show_aggregates=truexmin=0xmax=109.34ymin=0ymax=5.5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation

2014-12-10 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241349#comment-14241349
 ] 

Benedict commented on CASSANDRA-7032:
-

If you mean for V vnode tokens in ascending order [0..V), and e.g. D disks, the 
disks would own one of the token lists in the set { [dV/D..(d+1)V/D) : 0 = d  
D }, and you guarantee that the owned range of each list is balanced with the 
other lists, this seems pretty analogous to the approach I was describing and 
perfectly reasonable.

The main goal is only that once a range or set of vnode tokens has been 
assigned to a given resource (disk, cpu, node, rack, whatever) that resource 
never needs to reassign its tokens.

 Improve vnode allocation
 

 Key: CASSANDRA-7032
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict
Assignee: Branimir Lambov
  Labels: performance, vnodes
 Fix For: 3.0

 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, 
 TestVNodeAllocation.java


 It's been known for a little while that random vnode allocation causes 
 hotspots of ownership. It should be possible to improve dramatically on this 
 with deterministic allocation. I have quickly thrown together a simple greedy 
 algorithm that allocates vnodes efficiently, and will repair hotspots in a 
 randomly allocated cluster gradually as more nodes are added, and also 
 ensures that token ranges are fairly evenly spread between nodes (somewhat 
 tunably so). The allocation still permits slight discrepancies in ownership, 
 but it is bound by the inverse of the size of the cluster (as opposed to 
 random allocation, which strangely gets worse as the cluster size increases). 
 I'm sure there is a decent dynamic programming solution to this that would be 
 even better.
 If on joining the ring a new node were to CAS a shared table where a 
 canonical allocation of token ranges lives after running this (or a similar) 
 algorithm, we could then get guaranteed bounds on the ownership distribution 
 in a cluster. This will also help for CASSANDRA-6696.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-4139) Add varint encoding to Messaging service

2014-12-10 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241456#comment-14241456
 ] 

Benedict commented on CASSANDRA-4139:
-

We aren't bandwidth constrained for any workloads I'm aware of, so what are we 
hoping to achieve here? 

We already apply compression to the stream, so this will likely only help 
bandwidth consumption for individual small payloads where compression cannot be 
expected to yield much. In such scenarios bandwidth is especially unlikely to 
be a constraint.


 Add varint encoding to Messaging service
 

 Key: CASSANDRA-4139
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4139
 Project: Cassandra
  Issue Type: Sub-task
  Components: Core
Reporter: Vijay
Assignee: Ariel Weisberg
 Fix For: 3.0

 Attachments: 0001-CASSANDRA-4139-v1.patch, 
 0001-CASSANDRA-4139-v2.patch, 0001-CASSANDRA-4139-v4.patch, 
 0002-add-bytes-written-metric.patch, 4139-Test.rtf, 
 ASF.LICENSE.NOT.GRANTED--0001-CASSANDRA-4139-v3.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8449) Allow zero-copy reads again

2014-12-11 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242439#comment-14242439
 ] 

Benedict commented on CASSANDRA-8449:
-

bq. Isn't the existing use of OpOrder technically arbitrarily long due to GC 
for instance

Any delay caused by GC to the termination of an OpOrder.Group is instantaneous 
from the point of view of the waiter, since it is also delayed by GC

Either way, GC is not as arbitrarily long as I was referring to. Mostly I'm 
thinking about network consumers that haven't died but are, perhaps, in the 
process of doing so (GC death spiral), or where the network socket has frozen 
due to some other problem. i.e. where the problem is isolated from the rest of 
the host's functionality, but by being guarded by an OpOrder could conceivably 
cause the problem to infect the whole host's functionality. In reality we can 
probably guard against most of the risk, but I would still be reticent to use 
this scheme with that risk even minimally present without the ramifications 
being constrained as they are here.


 Allow zero-copy reads again
 ---

 Key: CASSANDRA-8449
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8449
 Project: Cassandra
  Issue Type: Improvement
Reporter: T Jake Luciani
Assignee: T Jake Luciani
Priority: Minor
  Labels: performance
 Fix For: 3.0


 We disabled zero-copy reads in CASSANDRA-3179 due to in flight reads 
 accessing a ByteBuffer when the data was unmapped by compaction.  Currently 
 this code path is only used for uncompressed reads.
 The actual bytes are in fact copied to the client output buffers for both 
 netty and thrift before being sent over the wire, so the only issue really is 
 the time it takes to process the read internally.  
 This patch adds a slow network read test and changes the tidy() method to 
 actually delete a sstable once the readTimeout has elapsed giving plenty of 
 time to serialize the read.
 Removing this copy causes significantly less GC on the read path and improves 
 the tail latencies:
 http://cstar.datastax.com/graph?stats=c0c8ce16-7fea-11e4-959d-42010af0688fmetric=gc_countoperation=2_readsmoothing=1show_aggregates=truexmin=0xmax=109.34ymin=0ymax=5.5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8447) Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled

2014-12-11 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242462#comment-14242462
 ] 

Benedict commented on CASSANDRA-8447:
-

[~yangzhe1991]: I don't think your problem is related, since it looks to me 
like you're running 2.1? If so, if you could file another ticket and upload a 
heap dump from one of your smaller nodes, its config yaml, and a full system 
log from startup until the problem was encountered I'll see if I can help 
pinpoint the problem.


 Nodes stuck in CMS GC cycle with very little traffic when compaction is 
 enabled
 ---

 Key: CASSANDRA-8447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8447
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: Cluster size - 4 nodes
 Node size - 12 CPU (hyper threaded to 24 cores), 192 GB RAM, 2 Raid 0 arrays 
 (Data - 10 disk, spinning 10k drives | CL 2 disk, spinning 10k drives)
 OS - RHEL 6.5
 jvm - oracle 1.7.0_71
 Cassandra version 2.0.11
Reporter: jonathan lacefield
 Attachments: Node_with_compaction.png, Node_without_compaction.png, 
 cassandra.yaml, gc.logs.tar.gz, gcinspector_messages.txt, memtable_debug, 
 results.tar.gz, visualvm_screenshot


 Behavior - If autocompaction is enabled, nodes will become unresponsive due 
 to a full Old Gen heap which is not cleared during CMS GC.
 Test methodology - disabled autocompaction on 3 nodes, left autocompaction 
 enabled on 1 node.  Executed different Cassandra stress loads, using write 
 only operations.  Monitored visualvm and jconsole for heap pressure.  
 Captured iostat and dstat for most tests.  Captured heap dump from 50 thread 
 load.  Hints were disabled for testing on all nodes to alleviate GC noise due 
 to hints backing up.
 Data load test through Cassandra stress -  /usr/bin/cassandra-stress  write 
 n=19 -rate threads=different threads tested -schema  
 replication\(factor=3\)  keyspace=Keyspace1 -node all nodes listed
 Data load thread count and results:
 * 1 thread - Still running but looks like the node can sustain this load 
 (approx 500 writes per second per node)
 * 5 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
 measured in the 60 second range (approx 2k writes per second per node)
 * 10 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
 measured in the 60 second range
 * 50 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
 measured in the 60 second range  (approx 10k writes per second per node)
 * 100 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
 measured in the 60 second range  (approx 20k writes per second per node)
 * 200 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
 measured in the 60 second range  (approx 25k writes per second per node)
 Note - the observed behavior was the same for all tests except for the single 
 threaded test.  The single threaded test does not appear to show this 
 behavior.
 Tested different GC and Linux OS settings with a focus on the 50 and 200 
 thread loads.  
 JVM settings tested:
 #  default, out of the box, env-sh settings
 #  10 G Max | 1 G New - default env-sh settings
 #  10 G Max | 1 G New - default env-sh settings
 #* JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=50
 #   20 G Max | 10 G New 
JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC
JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC
JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled
JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8
JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8
JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75
JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly
JVM_OPTS=$JVM_OPTS -XX:+UseTLAB
JVM_OPTS=$JVM_OPTS -XX:+CMSScavengeBeforeRemark
JVM_OPTS=$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=6
JVM_OPTS=$JVM_OPTS -XX:CMSWaitDuration=3
JVM_OPTS=$JVM_OPTS -XX:ParallelGCThreads=12
JVM_OPTS=$JVM_OPTS -XX:ConcGCThreads=12
JVM_OPTS=$JVM_OPTS -XX:+UnlockDiagnosticVMOptions
JVM_OPTS=$JVM_OPTS -XX:+UseGCTaskAffinity
JVM_OPTS=$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs
JVM_OPTS=$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768
JVM_OPTS=$JVM_OPTS -XX:-UseBiasedLocking
 # 20 G Max | 1 G New 
JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC
JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC
JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled
JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8
JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8
JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75
JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly
JVM_OPTS=$JVM_OPTS -XX:+UseTLAB
JVM_OPTS=$JVM_OPTS -XX:+CMSScavengeBeforeRemark
JVM_OPTS=$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=6

[jira] [Created] (CASSANDRA-8459) autocompaction on reads can prevent memtable space reclaimation

2014-12-11 Thread Benedict (JIRA)
Benedict created CASSANDRA-8459:
---

 Summary: autocompaction on reads can prevent memtable space 
reclaimation
 Key: CASSANDRA-8459
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8459
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Benedict
Assignee: Benedict
 Fix For: 2.1.3


Memtable memory reclamation is dependent on reads always making progress, 
however on the collectTimeOrderedData critical path it is possible for the read 
to perform a _write_ inline, and for this write to block waiting for memtable 
space to be reclaimed. However the reclaimation is blocked waiting for this 
read to complete.

There are a number of solutions to this, but the simplest is to make the 
defragmentation happen asynchronously, so the read terminates normally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-8459) autocompaction on reads can prevent memtable space reclaimation

2014-12-11 Thread Benedict (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benedict updated CASSANDRA-8459:

Attachment: 8459.txt

Attaching simple fix.

 autocompaction on reads can prevent memtable space reclaimation
 -

 Key: CASSANDRA-8459
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8459
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Benedict
Assignee: Benedict
 Fix For: 2.1.3

 Attachments: 8459.txt


 Memtable memory reclamation is dependent on reads always making progress, 
 however on the collectTimeOrderedData critical path it is possible for the 
 read to perform a _write_ inline, and for this write to block waiting for 
 memtable space to be reclaimed. However the reclaimation is blocked waiting 
 for this read to complete.
 There are a number of solutions to this, but the simplest is to make the 
 defragmentation happen asynchronously, so the read terminates normally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8447) Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled

2014-12-11 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242502#comment-14242502
 ] 

Benedict commented on CASSANDRA-8447:
-

[~yangzhe1991]: Your thread dump allowed me to trace the problem to 
CASSANDRA-8459.

 Nodes stuck in CMS GC cycle with very little traffic when compaction is 
 enabled
 ---

 Key: CASSANDRA-8447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8447
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: Cluster size - 4 nodes
 Node size - 12 CPU (hyper threaded to 24 cores), 192 GB RAM, 2 Raid 0 arrays 
 (Data - 10 disk, spinning 10k drives | CL 2 disk, spinning 10k drives)
 OS - RHEL 6.5
 jvm - oracle 1.7.0_71
 Cassandra version 2.0.11
Reporter: jonathan lacefield
 Attachments: Node_with_compaction.png, Node_without_compaction.png, 
 cassandra.yaml, gc.logs.tar.gz, gcinspector_messages.txt, memtable_debug, 
 output.svg, results.tar.gz, visualvm_screenshot


 Behavior - If autocompaction is enabled, nodes will become unresponsive due 
 to a full Old Gen heap which is not cleared during CMS GC.
 Test methodology - disabled autocompaction on 3 nodes, left autocompaction 
 enabled on 1 node.  Executed different Cassandra stress loads, using write 
 only operations.  Monitored visualvm and jconsole for heap pressure.  
 Captured iostat and dstat for most tests.  Captured heap dump from 50 thread 
 load.  Hints were disabled for testing on all nodes to alleviate GC noise due 
 to hints backing up.
 Data load test through Cassandra stress -  /usr/bin/cassandra-stress  write 
 n=19 -rate threads=different threads tested -schema  
 replication\(factor=3\)  keyspace=Keyspace1 -node all nodes listed
 Data load thread count and results:
 * 1 thread - Still running but looks like the node can sustain this load 
 (approx 500 writes per second per node)
 * 5 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
 measured in the 60 second range (approx 2k writes per second per node)
 * 10 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
 measured in the 60 second range
 * 50 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
 measured in the 60 second range  (approx 10k writes per second per node)
 * 100 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
 measured in the 60 second range  (approx 20k writes per second per node)
 * 200 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
 measured in the 60 second range  (approx 25k writes per second per node)
 Note - the observed behavior was the same for all tests except for the single 
 threaded test.  The single threaded test does not appear to show this 
 behavior.
 Tested different GC and Linux OS settings with a focus on the 50 and 200 
 thread loads.  
 JVM settings tested:
 #  default, out of the box, env-sh settings
 #  10 G Max | 1 G New - default env-sh settings
 #  10 G Max | 1 G New - default env-sh settings
 #* JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=50
 #   20 G Max | 10 G New 
JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC
JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC
JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled
JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8
JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8
JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75
JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly
JVM_OPTS=$JVM_OPTS -XX:+UseTLAB
JVM_OPTS=$JVM_OPTS -XX:+CMSScavengeBeforeRemark
JVM_OPTS=$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=6
JVM_OPTS=$JVM_OPTS -XX:CMSWaitDuration=3
JVM_OPTS=$JVM_OPTS -XX:ParallelGCThreads=12
JVM_OPTS=$JVM_OPTS -XX:ConcGCThreads=12
JVM_OPTS=$JVM_OPTS -XX:+UnlockDiagnosticVMOptions
JVM_OPTS=$JVM_OPTS -XX:+UseGCTaskAffinity
JVM_OPTS=$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs
JVM_OPTS=$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768
JVM_OPTS=$JVM_OPTS -XX:-UseBiasedLocking
 # 20 G Max | 1 G New 
JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC
JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC
JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled
JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8
JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8
JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75
JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly
JVM_OPTS=$JVM_OPTS -XX:+UseTLAB
JVM_OPTS=$JVM_OPTS -XX:+CMSScavengeBeforeRemark
JVM_OPTS=$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=6
JVM_OPTS=$JVM_OPTS -XX:CMSWaitDuration=3
JVM_OPTS=$JVM_OPTS -XX:ParallelGCThreads=12
JVM_OPTS=$JVM_OPTS -XX:ConcGCThreads=12
JVM_OPTS=$JVM_OPTS -XX:+UnlockDiagnosticVMOptions
JVM_OPTS=$JVM_OPTS -XX:+UseGCTaskAffinity

[jira] [Comment Edited] (CASSANDRA-8447) Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled

2014-12-11 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242502#comment-14242502
 ] 

Benedict edited comment on CASSANDRA-8447 at 12/11/14 1:20 PM:
---

[~yangzhe1991]: Your thread dump allowed me to trace the (your) problem to 
CASSANDRA-8459. This is a 2.1 specific issue, and not related to this ticket/


was (Author: benedict):
[~yangzhe1991]: Your thread dump allowed me to trace the problem to 
CASSANDRA-8459.

 Nodes stuck in CMS GC cycle with very little traffic when compaction is 
 enabled
 ---

 Key: CASSANDRA-8447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8447
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: Cluster size - 4 nodes
 Node size - 12 CPU (hyper threaded to 24 cores), 192 GB RAM, 2 Raid 0 arrays 
 (Data - 10 disk, spinning 10k drives | CL 2 disk, spinning 10k drives)
 OS - RHEL 6.5
 jvm - oracle 1.7.0_71
 Cassandra version 2.0.11
Reporter: jonathan lacefield
 Attachments: Node_with_compaction.png, Node_without_compaction.png, 
 cassandra.yaml, gc.logs.tar.gz, gcinspector_messages.txt, memtable_debug, 
 output.svg, results.tar.gz, visualvm_screenshot


 Behavior - If autocompaction is enabled, nodes will become unresponsive due 
 to a full Old Gen heap which is not cleared during CMS GC.
 Test methodology - disabled autocompaction on 3 nodes, left autocompaction 
 enabled on 1 node.  Executed different Cassandra stress loads, using write 
 only operations.  Monitored visualvm and jconsole for heap pressure.  
 Captured iostat and dstat for most tests.  Captured heap dump from 50 thread 
 load.  Hints were disabled for testing on all nodes to alleviate GC noise due 
 to hints backing up.
 Data load test through Cassandra stress -  /usr/bin/cassandra-stress  write 
 n=19 -rate threads=different threads tested -schema  
 replication\(factor=3\)  keyspace=Keyspace1 -node all nodes listed
 Data load thread count and results:
 * 1 thread - Still running but looks like the node can sustain this load 
 (approx 500 writes per second per node)
 * 5 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
 measured in the 60 second range (approx 2k writes per second per node)
 * 10 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
 measured in the 60 second range
 * 50 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
 measured in the 60 second range  (approx 10k writes per second per node)
 * 100 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
 measured in the 60 second range  (approx 20k writes per second per node)
 * 200 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
 measured in the 60 second range  (approx 25k writes per second per node)
 Note - the observed behavior was the same for all tests except for the single 
 threaded test.  The single threaded test does not appear to show this 
 behavior.
 Tested different GC and Linux OS settings with a focus on the 50 and 200 
 thread loads.  
 JVM settings tested:
 #  default, out of the box, env-sh settings
 #  10 G Max | 1 G New - default env-sh settings
 #  10 G Max | 1 G New - default env-sh settings
 #* JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=50
 #   20 G Max | 10 G New 
JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC
JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC
JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled
JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8
JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8
JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75
JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly
JVM_OPTS=$JVM_OPTS -XX:+UseTLAB
JVM_OPTS=$JVM_OPTS -XX:+CMSScavengeBeforeRemark
JVM_OPTS=$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=6
JVM_OPTS=$JVM_OPTS -XX:CMSWaitDuration=3
JVM_OPTS=$JVM_OPTS -XX:ParallelGCThreads=12
JVM_OPTS=$JVM_OPTS -XX:ConcGCThreads=12
JVM_OPTS=$JVM_OPTS -XX:+UnlockDiagnosticVMOptions
JVM_OPTS=$JVM_OPTS -XX:+UseGCTaskAffinity
JVM_OPTS=$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs
JVM_OPTS=$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768
JVM_OPTS=$JVM_OPTS -XX:-UseBiasedLocking
 # 20 G Max | 1 G New 
JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC
JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC
JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled
JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8
JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8
JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75
JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly
JVM_OPTS=$JVM_OPTS -XX:+UseTLAB
JVM_OPTS=$JVM_OPTS -XX:+CMSScavengeBeforeRemark
JVM_OPTS=$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=6
JVM_OPTS=$JVM_OPTS 

[jira] [Commented] (CASSANDRA-8459) autocompaction on reads can prevent memtable space reclaimation

2014-12-11 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242513#comment-14242513
 ] 

Benedict commented on CASSANDRA-8459:
-

No need, already sussed the problem and attached the fix

 autocompaction on reads can prevent memtable space reclaimation
 -

 Key: CASSANDRA-8459
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8459
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Benedict
Assignee: Benedict
 Fix For: 2.1.3

 Attachments: 8459.txt


 Memtable memory reclamation is dependent on reads always making progress, 
 however on the collectTimeOrderedData critical path it is possible for the 
 read to perform a _write_ inline, and for this write to block waiting for 
 memtable space to be reclaimed. However the reclaimation is blocked waiting 
 for this read to complete.
 There are a number of solutions to this, but the simplest is to make the 
 defragmentation happen asynchronously, so the read terminates normally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8459) autocompaction on reads can prevent memtable space reclaimation

2014-12-11 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242738#comment-14242738
 ] 

Benedict commented on CASSANDRA-8459:
-

It's probably not a *bad idea* for 2.0 as it stops a read touching the write 
path, but it isn't necessary for correctness.

 autocompaction on reads can prevent memtable space reclaimation
 -

 Key: CASSANDRA-8459
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8459
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Benedict
Assignee: Benedict
 Fix For: 2.1.3

 Attachments: 8459.txt


 Memtable memory reclamation is dependent on reads always making progress, 
 however on the collectTimeOrderedData critical path it is possible for the 
 read to perform a _write_ inline, and for this write to block waiting for 
 memtable space to be reclaimed. However the reclaimation is blocked waiting 
 for this read to complete.
 There are a number of solutions to this, but the simplest is to make the 
 defragmentation happen asynchronously, so the read terminates normally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8447) Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled

2014-12-11 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242779#comment-14242779
 ] 

Benedict commented on CASSANDRA-8447:
-

The problem is pretty simple: MeteredFlusher runs on 
StorageService.optionalTasks, and there are other events that can happen on 
here that can take a long time. In particular hint delivery scheduling, which 
is preceded by a blocking compaction of the hints table, during which no 
progress for any other optional tasks may proceed.

MeteredFlusher should have its own dedicated thread, as responding promptly is 
essential; under this workload running every couple of seconds is pretty much 
necessary to avoid rapid catastrophic build up of state in memtables. 

 Nodes stuck in CMS GC cycle with very little traffic when compaction is 
 enabled
 ---

 Key: CASSANDRA-8447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8447
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: Cluster size - 4 nodes
 Node size - 12 CPU (hyper threaded to 24 cores), 192 GB RAM, 2 Raid 0 arrays 
 (Data - 10 disk, spinning 10k drives | CL 2 disk, spinning 10k drives)
 OS - RHEL 6.5
 jvm - oracle 1.7.0_71
 Cassandra version 2.0.11
Reporter: jonathan lacefield
 Attachments: Node_with_compaction.png, Node_without_compaction.png, 
 cassandra.yaml, gc.logs.tar.gz, gcinspector_messages.txt, memtable_debug, 
 output.1.svg, output.2.svg, output.svg, results.tar.gz, visualvm_screenshot


 Behavior - If autocompaction is enabled, nodes will become unresponsive due 
 to a full Old Gen heap which is not cleared during CMS GC.
 Test methodology - disabled autocompaction on 3 nodes, left autocompaction 
 enabled on 1 node.  Executed different Cassandra stress loads, using write 
 only operations.  Monitored visualvm and jconsole for heap pressure.  
 Captured iostat and dstat for most tests.  Captured heap dump from 50 thread 
 load.  Hints were disabled for testing on all nodes to alleviate GC noise due 
 to hints backing up.
 Data load test through Cassandra stress -  /usr/bin/cassandra-stress  write 
 n=19 -rate threads=different threads tested -schema  
 replication\(factor=3\)  keyspace=Keyspace1 -node all nodes listed
 Data load thread count and results:
 * 1 thread - Still running but looks like the node can sustain this load 
 (approx 500 writes per second per node)
 * 5 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
 measured in the 60 second range (approx 2k writes per second per node)
 * 10 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
 measured in the 60 second range
 * 50 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
 measured in the 60 second range  (approx 10k writes per second per node)
 * 100 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
 measured in the 60 second range  (approx 20k writes per second per node)
 * 200 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
 measured in the 60 second range  (approx 25k writes per second per node)
 Note - the observed behavior was the same for all tests except for the single 
 threaded test.  The single threaded test does not appear to show this 
 behavior.
 Tested different GC and Linux OS settings with a focus on the 50 and 200 
 thread loads.  
 JVM settings tested:
 #  default, out of the box, env-sh settings
 #  10 G Max | 1 G New - default env-sh settings
 #  10 G Max | 1 G New - default env-sh settings
 #* JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=50
 #   20 G Max | 10 G New 
JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC
JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC
JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled
JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8
JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8
JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75
JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly
JVM_OPTS=$JVM_OPTS -XX:+UseTLAB
JVM_OPTS=$JVM_OPTS -XX:+CMSScavengeBeforeRemark
JVM_OPTS=$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=6
JVM_OPTS=$JVM_OPTS -XX:CMSWaitDuration=3
JVM_OPTS=$JVM_OPTS -XX:ParallelGCThreads=12
JVM_OPTS=$JVM_OPTS -XX:ConcGCThreads=12
JVM_OPTS=$JVM_OPTS -XX:+UnlockDiagnosticVMOptions
JVM_OPTS=$JVM_OPTS -XX:+UseGCTaskAffinity
JVM_OPTS=$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs
JVM_OPTS=$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768
JVM_OPTS=$JVM_OPTS -XX:-UseBiasedLocking
 # 20 G Max | 1 G New 
JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC
JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC
JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled
JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8
JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8

[jira] [Comment Edited] (CASSANDRA-8447) Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled

2014-12-11 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242779#comment-14242779
 ] 

Benedict edited comment on CASSANDRA-8447 at 12/11/14 4:59 PM:
---

The problem is pretty simple: MeteredFlusher runs on 
StorageService.optionalTasks, and there are other events that can happen on 
here that can take a long time. In particular hint delivery scheduling, which 
is preceded by a blocking compaction of the hints table, during which no 
progress for any other optional tasks may proceed.

MeteredFlusher should have its own dedicated thread, as responding promptly is 
essential; under this workload running every couple of seconds is pretty much 
necessary to avoid rapid catastrophic build up of state in memtables. 

(edit: in case there's any ambiguity, this isn't a hypothesis. the heap dump 
clearly shows optionalTasks blocked waiting on the result of a FutureTask 
executing a runnable defined in CompactionManager (as far as I can tell in 
submitUserDefined); the current live memtable is retaining 6M records at 6Gb of 
retained heap, so MeteredFlusher hasn't had its turn in a long time)


was (Author: benedict):
The problem is pretty simple: MeteredFlusher runs on 
StorageService.optionalTasks, and there are other events that can happen on 
here that can take a long time. In particular hint delivery scheduling, which 
is preceded by a blocking compaction of the hints table, during which no 
progress for any other optional tasks may proceed.

MeteredFlusher should have its own dedicated thread, as responding promptly is 
essential; under this workload running every couple of seconds is pretty much 
necessary to avoid rapid catastrophic build up of state in memtables. 

 Nodes stuck in CMS GC cycle with very little traffic when compaction is 
 enabled
 ---

 Key: CASSANDRA-8447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8447
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: Cluster size - 4 nodes
 Node size - 12 CPU (hyper threaded to 24 cores), 192 GB RAM, 2 Raid 0 arrays 
 (Data - 10 disk, spinning 10k drives | CL 2 disk, spinning 10k drives)
 OS - RHEL 6.5
 jvm - oracle 1.7.0_71
 Cassandra version 2.0.11
Reporter: jonathan lacefield
 Attachments: Node_with_compaction.png, Node_without_compaction.png, 
 cassandra.yaml, gc.logs.tar.gz, gcinspector_messages.txt, memtable_debug, 
 output.1.svg, output.2.svg, output.svg, results.tar.gz, visualvm_screenshot


 Behavior - If autocompaction is enabled, nodes will become unresponsive due 
 to a full Old Gen heap which is not cleared during CMS GC.
 Test methodology - disabled autocompaction on 3 nodes, left autocompaction 
 enabled on 1 node.  Executed different Cassandra stress loads, using write 
 only operations.  Monitored visualvm and jconsole for heap pressure.  
 Captured iostat and dstat for most tests.  Captured heap dump from 50 thread 
 load.  Hints were disabled for testing on all nodes to alleviate GC noise due 
 to hints backing up.
 Data load test through Cassandra stress -  /usr/bin/cassandra-stress  write 
 n=19 -rate threads=different threads tested -schema  
 replication\(factor=3\)  keyspace=Keyspace1 -node all nodes listed
 Data load thread count and results:
 * 1 thread - Still running but looks like the node can sustain this load 
 (approx 500 writes per second per node)
 * 5 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
 measured in the 60 second range (approx 2k writes per second per node)
 * 10 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
 measured in the 60 second range
 * 50 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
 measured in the 60 second range  (approx 10k writes per second per node)
 * 100 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
 measured in the 60 second range  (approx 20k writes per second per node)
 * 200 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
 measured in the 60 second range  (approx 25k writes per second per node)
 Note - the observed behavior was the same for all tests except for the single 
 threaded test.  The single threaded test does not appear to show this 
 behavior.
 Tested different GC and Linux OS settings with a focus on the 50 and 200 
 thread loads.  
 JVM settings tested:
 #  default, out of the box, env-sh settings
 #  10 G Max | 1 G New - default env-sh settings
 #  10 G Max | 1 G New - default env-sh settings
 #* JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=50
 #   20 G Max | 10 G New 
JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC
JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC
JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled

[jira] [Commented] (CASSANDRA-8458) Avoid streaming from tmplink files

2014-12-11 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242832#comment-14242832
 ] 

Benedict commented on CASSANDRA-8458:
-

We could also try and figure out how/why this happens, as it should be able to 
stream safely.

Does it only happen if streaming a range that wraps zero (i.e. from +X, to -Y)?

 Avoid streaming from tmplink files
 --

 Key: CASSANDRA-8458
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8458
 Project: Cassandra
  Issue Type: Bug
Reporter: Marcus Eriksson
Assignee: Marcus Eriksson
 Fix For: 2.1.3


 Looks like we include tmplink sstables in streams in 2.1+, and when we do, 
 sometimes we get this error message on the receiving side: 
 {{java.io.IOException: Corrupt input data, block did not start with 2 byte 
 signature ('ZV') followed by type byte, 2-byte length)}}. I've only seen this 
 happen when a tmplink sstable is included in the stream.
 We can not just exclude the tmplink files when starting the stream - we need 
 to include the original file, which we might miss since we check if the 
 requested stream range intersects the sstable range.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-8458) Avoid streaming from tmplink files

2014-12-11 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242832#comment-14242832
 ] 

Benedict edited comment on CASSANDRA-8458 at 12/11/14 5:45 PM:
---

We could also try and figure out how/why this happens, as it should be able to 
stream safely.

Does it only happen if streaming a range that wraps zero (i.e. from +X, to -Y)? 

edit: To elaborate, I suspect the broken bit is that our dfile/ifile objects 
don't actually truncate the readable range - only our indexed decoratedkey 
range is truncated. In sstable.getPositionsForRanges we just return the end of 
the file if the range goes past the range of the file; in this case we could 
stream partially written data. If so, we could fix by simply making 
sstable.getPositionsForRanges() lookup the start position of the last key in 
the file, and always ensure we leave a key's overlap between the dropped 
sstables and the replacement.


was (Author: benedict):
We could also try and figure out how/why this happens, as it should be able to 
stream safely.

Does it only happen if streaming a range that wraps zero (i.e. from +X, to -Y)?

 Avoid streaming from tmplink files
 --

 Key: CASSANDRA-8458
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8458
 Project: Cassandra
  Issue Type: Bug
Reporter: Marcus Eriksson
Assignee: Marcus Eriksson
 Fix For: 2.1.3


 Looks like we include tmplink sstables in streams in 2.1+, and when we do, 
 sometimes we get this error message on the receiving side: 
 {{java.io.IOException: Corrupt input data, block did not start with 2 byte 
 signature ('ZV') followed by type byte, 2-byte length)}}. I've only seen this 
 happen when a tmplink sstable is included in the stream.
 We can not just exclude the tmplink files when starting the stream - we need 
 to include the original file, which we might miss since we check if the 
 requested stream range intersects the sstable range.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8457) nio MessagingService

2014-12-11 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242859#comment-14242859
 ] 

Benedict commented on CASSANDRA-8457:
-

FTR, I strongly doubt _context switching_ is actually as much of a problem as 
we think, although constraining it is never a bad thing. The big hit we have is 
_thread signalling_ costs, which is a different but related beast. Certainly 
the talking point that raised this was discussing system time spent serving 
context switches which would definitely be referring to signalling, not the 
switching itself.

Now, we do use a BlockingQueue for OutboundTcpConnection which will incur these 
costs, however I strongly suspect the impact will be much lower than predicted 
- especially as the testing done to flag this up was on small clusters with 
RF=1, where these threads would not be being exercised at all. The costs of 
going to the network itself are likely to exceed the context switching costs, 
and naturally permit messages to accumulate in the queue, reducing the number 
of signals actually needed. 

There's then the negative performance implications we have found from small 
numbers of connections under NIO to consider, so that this change could have 
significant downsides for the majority of deployed clusters (although if we get 
batching in the client driver we may see these penalties disappear).

To establish if there's likely a benefit to exploit, we could most likely 
refactor this code comparatively minimally (than rewriting to NIO/Netty) to 
make use of the SharedExecutorPool to establish if such a positive effect is 
indeed to be had, as this would reduce the number of threads in flight to those 
actually serving work on the OTCs. This wouldn't affect the ITC, but I am 
dubious of their contribution. We should probably also actually test if this is 
indeed a problem from clusters at scale performing in-memory CL1 reads.


 nio MessagingService
 

 Key: CASSANDRA-8457
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457
 Project: Cassandra
  Issue Type: New Feature
  Components: Core
Reporter: Jonathan Ellis
Assignee: Ariel Weisberg
  Labels: performance
 Fix For: 3.0


 Thread-per-peer (actually two each incoming and outbound) is a big 
 contributor to context switching, especially for larger clusters.  Let's look 
 at switching to nio, possibly via Netty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-8466) Stress support for treating clients as truly independent entities (separate driver instance)

2014-12-12 Thread Benedict (JIRA)
Benedict created CASSANDRA-8466:
---

 Summary: Stress support for treating clients as truly independent 
entities (separate driver instance)
 Key: CASSANDRA-8466
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8466
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Benedict
 Fix For: 2.1.3


For performance testing purposes, it would be helpful to be able to mimic truly 
independent clients. The easiest way to do this is to use a unique classloader 
for instantiating the driver for each client, which should be a reasonably 
straightforward change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8457) nio MessagingService

2014-12-12 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243924#comment-14243924
 ] 

Benedict commented on CASSANDRA-8457:
-

bq. cstar doesn't support multiple stress clients

Stress could be modified to support simulating true multiple client access; 
i've filed CASSANDRA-8466.

What we really need is to be able to fire up a (much) larger cluster, though. 
With our current hardware probably necessitating multiple VMs per node - say, 
4, giving a viable cluster of 24 which is probably about bare minimum for these 
kinds of tests. This necessarily pollutes the results somewhat since each VM 
will have only half a CPU, and incur extra thread signalling penalties, but 
it's better than nothing. Either that or we get a bunch of cheapo nodes, or we 
add EC2 integration.

[~enigmacurry] any plans in the works to introduce support for large clusters?

 nio MessagingService
 

 Key: CASSANDRA-8457
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457
 Project: Cassandra
  Issue Type: New Feature
  Components: Core
Reporter: Jonathan Ellis
Assignee: Ariel Weisberg
  Labels: performance
 Fix For: 3.0


 Thread-per-peer (actually two each incoming and outbound) is a big 
 contributor to context switching, especially for larger clusters.  Let's look 
 at switching to nio, possibly via Netty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8466) Stress support for treating clients as truly independent entities (separate driver instance)

2014-12-12 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243925#comment-14243925
 ] 

Benedict commented on CASSANDRA-8466:
-

[~mfiguiere] are there any plans to support tuning the number of IO threads 
spawned by the driver? For this ticket it would be extremely sane to limit it 
to just 1.

 Stress support for treating clients as truly independent entities (separate 
 driver instance)
 

 Key: CASSANDRA-8466
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8466
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Benedict
 Fix For: 2.1.3


 For performance testing purposes, it would be helpful to be able to mimic 
 truly independent clients. The easiest way to do this is to use a unique 
 classloader for instantiating the driver for each client, which should be a 
 reasonably straightforward change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8308) Windows: Commitlog access violations on unit tests

2014-12-12 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14244033#comment-14244033
 ] 

Benedict commented on CASSANDRA-8308:
-

bq.  I'm not sure how that's related to this patch

My bad, misread the patch boundary,

Since we're opening/closing an extra file, it might be worth only performing 
the action if the channel isn't the correct size, since it typically will be 
(so, open channel, if incorrect size close it, open raf, set length, reopen 
channel).

I haven't tested the change to introduce strerror - are you confident of it, 
and have you tested it? Might be sensible to split into its own ticket.

Otherwise LGTM

 Windows: Commitlog access violations on unit tests
 --

 Key: CASSANDRA-8308
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8308
 Project: Cassandra
  Issue Type: Bug
Reporter: Joshua McKenzie
Assignee: Joshua McKenzie
Priority: Minor
  Labels: Windows
 Fix For: 3.0

 Attachments: 8308_v1.txt, 8308_v2.txt


 We have four unit tests failing on trunk on Windows, all with 
 FileSystemException's related to the SchemaLoader:
 {noformat}
 [junit] Test 
 org.apache.cassandra.db.compaction.DateTieredCompactionStrategyTest FAILED
 [junit] Test org.apache.cassandra.cql3.ThriftCompatibilityTest FAILED
 [junit] Test org.apache.cassandra.io.sstable.SSTableRewriterTest FAILED
 [junit] Test org.apache.cassandra.repair.LocalSyncTaskTest FAILED
 {noformat}
 Example error:
 {noformat}
 [junit] Caused by: java.nio.file.FileSystemException: 
 build\test\cassandra\commitlog;0\CommitLog-5-1415908745965.log: The process 
 cannot access the file because it is being used by another process.
 [junit]
 [junit] at 
 sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86)
 [junit] at 
 sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
 [junit] at 
 sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102)
 [junit] at 
 sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:269)
 [junit] at 
 sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
 [junit] at java.nio.file.Files.delete(Files.java:1079)
 [junit] at 
 org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:125)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6993) Windows: remove mmap'ed I/O for index files and force standard file access

2014-12-12 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14244036#comment-14244036
 ] 

Benedict commented on CASSANDRA-6993:
-

This wouldn't be sufficient for the procfs check, as Mac (and by default 
FreeBSD) don't have it.

 Windows: remove mmap'ed I/O for index files and force standard file access
 --

 Key: CASSANDRA-6993
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6993
 Project: Cassandra
  Issue Type: Improvement
Reporter: Joshua McKenzie
Assignee: Joshua McKenzie
Priority: Minor
  Labels: Windows
 Fix For: 3.0, 2.1.3

 Attachments: 6993_2.1_v1.txt, 6993_v1.txt, 6993_v2.txt


 Memory-mapped I/O on Windows causes issues with hard-links; we're unable to 
 delete hard-links to open files with memory-mapped segments even using nio.  
 We'll need to push for close to performance parity between mmap'ed I/O and 
 buffered going forward as the buffered / compressed path offers other 
 benefits.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8248) Possible memory leak

2014-12-12 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14244039#comment-14244039
 ] 

Benedict commented on CASSANDRA-8248:
-

+1

 Possible memory leak 
 -

 Key: CASSANDRA-8248
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8248
 Project: Cassandra
  Issue Type: Bug
Reporter: Alexander Sterligov
Assignee: Joshua McKenzie
 Attachments: 8248_v1.txt, thread_dump


 Sometimes during repair cassandra starts to consume more memory than expected.
 Total amount of data on node is about 20GB.
 Size of the data directory is 66GC because of snapshots.
 Top reports: 
 {noformat}
   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 15724 loadbase  20   0  493g  55g  44g S   28 44.2   4043:24 java
 {noformat}
 At the /proc/15724/maps there are a lot of deleted file maps
 {quote}
 7f63a6102000-7f63a6332000 r--s  08:21 9442763
 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db
  (deleted)
 7f63a6332000-7f63a6562000 r--s  08:21 9442763
 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db
  (deleted)
 7f63a6562000-7f63a6792000 r--s  08:21 9442763
 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db
  (deleted)
 7f63a6792000-7f63a69c2000 r--s  08:21 9442763
 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db
  (deleted)
 7f63a69c2000-7f63a6bf2000 r--s  08:21 9442763
 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db
  (deleted)
 7f63a6bf2000-7f63a6e22000 r--s  08:21 9442763
 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db
  (deleted)
 7f63a6e22000-7f63a7052000 r--s  08:21 9442763
 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db
  (deleted)
 7f63a7052000-7f63a7282000 r--s  08:21 9442763
 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db
  (deleted)
 7f63a7282000-7f63a74b2000 r--s  08:21 9442763
 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db
  (deleted)
 7f63a74b2000-7f63a76e2000 r--s  08:21 9442763
 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db
  (deleted)
 7f63a76e2000-7f63a7912000 r--s  08:21 9442763
 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db
  (deleted)
 7f63a7912000-7f63a7b42000 r--s  08:21 9442763
 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db
  (deleted)
 7f63a7b42000-7f63a7d72000 r--s  08:21 9442763
 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db
  (deleted)
 7f63a7d72000-7f63a7fa2000 r--s  08:21 9442763
 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db
  (deleted)
 7f63a7fa2000-7f63a81d2000 r--s  08:21 9442763
 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db
  (deleted)
 7f63a81d2000-7f63a8402000 r--s  08:21 9442763
 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db
  (deleted)
 7f63a8402000-7f63a8622000 r--s  08:21 9442763
 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db
  (deleted)
 7f63a8622000-7f63a8842000 r--s  08:21 9442763
 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db
  (deleted)
 7f63a8842000-7f63a8a62000 r--s  08:21 9442763
 

[jira] [Commented] (CASSANDRA-8466) Stress support for treating clients as truly independent entities (separate driver instance)

2014-12-12 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14244042#comment-14244042
 ] 

Benedict commented on CASSANDRA-8466:
-

An even easier approach suggested by [~omichallat] is to simply open a session 
for each simulated client. This would be a really trivial change.

 Stress support for treating clients as truly independent entities (separate 
 driver instance)
 

 Key: CASSANDRA-8466
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8466
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Benedict
 Fix For: 2.1.3


 For performance testing purposes, it would be helpful to be able to mimic 
 truly independent clients. The easiest way to do this is to use a unique 
 classloader for instantiating the driver for each client, which should be a 
 reasonably straightforward change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-8468) Stress support for multiple asynchronous operations per client

2014-12-12 Thread Benedict (JIRA)
Benedict created CASSANDRA-8468:
---

 Summary: Stress support for multiple asynchronous operations per 
client
 Key: CASSANDRA-8468
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8468
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Benedict


In conjunction with CASSANDRA-8466, this would permit more tunable variation in 
network load generation characteristics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-8469) Stress support for distributed operation, coordinated by a single stress process

2014-12-12 Thread Benedict (JIRA)
Benedict created CASSANDRA-8469:
---

 Summary: Stress support for distributed operation, coordinated by 
a single stress process
 Key: CASSANDRA-8469
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8469
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Benedict


As we test larger clusters, we need to run multiple stress clients (this is 
already the case for many users trialling c*). Baking in (initially simple) 
support for controlling and reporting multiple stress daemons from one command 
line would be extremely helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8447) Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled

2014-12-12 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14244089#comment-14244089
 ] 

Benedict commented on CASSANDRA-8447:
-

In this case the optionalTasks thread was not blocked at the point of taking 
the heap dump, but it appears it was still blocked for several minutes, when it 
needs to run every few seconds. So whilst I cannot guarantee hints were the 
cause of the delay, we can be fairly certain the delay is the problem, and we 
should move metered flusher to its own dedicated thread. Approximately 200x as 
much data accumulated before a flush was triggered than under normal operation.

{noformat}
 INFO [OptionalTasks:1] 2014-12-11 12:29:18,154 MeteredFlusher.java (line 58) 
flushing high-traffic column family CFS(Keyspace='Keyspace1', 
ColumnFamily='Standard1') (estimated 175643600 bytes)
 INFO [OptionalTasks:1] 2014-12-11 12:29:18,155 ColumnFamilyStore.java (line 
794) Enqueuing flush of Memtable-Standard1@1155435229(17589220/175892200 
serialized/live bytes, 399755 ops)
 INFO [OptionalTasks:1] 2014-12-11 12:36:24,642 MeteredFlusher.java (line 69) 
estimated 33071928850 live and 33071449400 flushing bytes used by all memtables
 INFO [OptionalTasks:1] 2014-12-11 12:36:24,642 MeteredFlusher.java (line 92) 
flushing CFS(Keyspace='Keyspace1', ColumnFamily='Standard1') to free up 
33071687000 bytes
 INFO [OptionalTasks:1] 2014-12-11 12:36:24,643 ColumnFamilyStore.java (line 
794) Enqueuing flush of Memtable-Standard1@401833564(3307178160/33071781600 
serialized/live bytes, 75163140 ops)
{noformat}

 Nodes stuck in CMS GC cycle with very little traffic when compaction is 
 enabled
 ---

 Key: CASSANDRA-8447
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8447
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: Cluster size - 4 nodes
 Node size - 12 CPU (hyper threaded to 24 cores), 192 GB RAM, 2 Raid 0 arrays 
 (Data - 10 disk, spinning 10k drives | CL 2 disk, spinning 10k drives)
 OS - RHEL 6.5
 jvm - oracle 1.7.0_71
 Cassandra version 2.0.11
Reporter: jonathan lacefield
 Attachments: Node_with_compaction.png, Node_without_compaction.png, 
 cassandra.yaml, gc.logs.tar.gz, gcinspector_messages.txt, memtable_debug, 
 output.1.svg, output.2.svg, output.svg, results.tar.gz, visualvm_screenshot


 Behavior - If autocompaction is enabled, nodes will become unresponsive due 
 to a full Old Gen heap which is not cleared during CMS GC.
 Test methodology - disabled autocompaction on 3 nodes, left autocompaction 
 enabled on 1 node.  Executed different Cassandra stress loads, using write 
 only operations.  Monitored visualvm and jconsole for heap pressure.  
 Captured iostat and dstat for most tests.  Captured heap dump from 50 thread 
 load.  Hints were disabled for testing on all nodes to alleviate GC noise due 
 to hints backing up.
 Data load test through Cassandra stress -  /usr/bin/cassandra-stress  write 
 n=19 -rate threads=different threads tested -schema  
 replication\(factor=3\)  keyspace=Keyspace1 -node all nodes listed
 Data load thread count and results:
 * 1 thread - Still running but looks like the node can sustain this load 
 (approx 500 writes per second per node)
 * 5 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
 measured in the 60 second range (approx 2k writes per second per node)
 * 10 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
 measured in the 60 second range
 * 50 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
 measured in the 60 second range  (approx 10k writes per second per node)
 * 100 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
 measured in the 60 second range  (approx 20k writes per second per node)
 * 200 threads - Nodes become unresponsive due to full Old Gen Heap.  CMS 
 measured in the 60 second range  (approx 25k writes per second per node)
 Note - the observed behavior was the same for all tests except for the single 
 threaded test.  The single threaded test does not appear to show this 
 behavior.
 Tested different GC and Linux OS settings with a focus on the 50 and 200 
 thread loads.  
 JVM settings tested:
 #  default, out of the box, env-sh settings
 #  10 G Max | 1 G New - default env-sh settings
 #  10 G Max | 1 G New - default env-sh settings
 #* JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=50
 #   20 G Max | 10 G New 
JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC
JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC
JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled
JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8
JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8
JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75
JVM_OPTS=$JVM_OPTS 

[jira] [Commented] (CASSANDRA-8466) Stress support for treating clients as truly independent entities (separate driver instance)

2014-12-12 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14244095#comment-14244095
 ] 

Benedict commented on CASSANDRA-8466:
-

Neither of those settings are honoured for v3 protocols, nor are they honoured 
in the way we would most likely want for v2 protocols

 Stress support for treating clients as truly independent entities (separate 
 driver instance)
 

 Key: CASSANDRA-8466
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8466
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Benedict
 Fix For: 2.1.3


 For performance testing purposes, it would be helpful to be able to mimic 
 truly independent clients. The easiest way to do this is to use a unique 
 classloader for instantiating the driver for each client, which should be a 
 reasonably straightforward change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


<    1   2   3   4   5   6   7   8   9   10   >