[jira] [Resolved] (CASSANDRA-7903) tombstone created upon insert of new row
[ https://issues.apache.org/jira/browse/CASSANDRA-7903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict resolved CASSANDRA-7903. - Resolution: Not a Problem Inserting NULL has exact same semantics as delete, and inserts a tombstone. tombstone created upon insert of new row Key: CASSANDRA-7903 URL: https://issues.apache.org/jira/browse/CASSANDRA-7903 Project: Cassandra Issue Type: Bug Reporter: Thanh Tombstone is created upon insert of new row, depending on how the row is inserted. Simple way to observe this behavior: Using cqlsh: CREATE TABLE users1 ( userid text PRIMARY KEY, first_name text, last_name text); insert into users1 (userid, first_name) values ('a','a'); tracing on; select * from users; Trace results show 1 live cell and 0 tombstone cells created as a result: userid | first_name | last_name ++--- a | a | null (1 rows) … Read 1 live and 0 tombstoned cells | 00:31:31,487 | 10.240.203.201 | 1275 Scanned 1 rows and matched 1 | 00:31:31,487 | 10.240.203.201 | 1328 … Now, insert into users1 (userid, first_name,last_name) values ('b','b',null); select * from users; Trace results show 1 live cell and 1 tombstone cell created as a result: userid | first_name | last_name ++--- a | a | null b | b | null (2 rows) … Read 1 live and 0 tombstoned cells | 00:35:09,357 | 10.240.203.201 | 1243 Read 1 live and 1 tombstoned cells | 00:35:09,357 | 10.240.203.201 | 1383 Scanned 2 rows and matched 2 | 00:35:09,357 | 10.240.203.201 | 1438 … Tombstone is not expected to be created in either case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-7907) Determine how many network threads we need for native transport
Benedict created CASSANDRA-7907: --- Summary: Determine how many network threads we need for native transport Key: CASSANDRA-7907 URL: https://issues.apache.org/jira/browse/CASSANDRA-7907 Project: Cassandra Issue Type: Improvement Reporter: Benedict Priority: Minor With the introduction of CASSANDRA-4718, it is highly likely we can cope with just _one_ network IO thread. We could even try pinning it to a single (optionally configurable) core, and (also optionally) pin all other threads to a different core, so that we can guarantee extremely prompt execution (and if pinned to the correct core the OS uses for managing the network, improve throughput further). Testing this out will be challenging, as we need to simulate clients from lots of IPs. However, it is quite likely this would reduce the percentage of time spent in kernel networking calls, and the amount of context switching. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7719) Add PreparedStatements related metrics
[ https://issues.apache.org/jira/browse/CASSANDRA-7719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127935#comment-14127935 ] Benedict commented on CASSANDRA-7719: - bq. Don't use String.format in logger.trace, use parameterized messages. Like More important than this is to guard the logger.trace() call with logger.isTraceEnabled(), so that it is optimised away when not enabled. If String.format buys you more useful formatting it can be justified, but here it looks like the logger can cope with your params just as well. Add PreparedStatements related metrics -- Key: CASSANDRA-7719 URL: https://issues.apache.org/jira/browse/CASSANDRA-7719 Project: Cassandra Issue Type: New Feature Reporter: Michaël Figuière Assignee: T Jake Luciani Priority: Minor Fix For: 2.1.1 Attachments: 7719.txt Cassandra newcomers often don't understand that they're expected to use PreparedStatements for almost all of their repetitive queries executed in production. It doesn't look like Cassandra currently expose any PreparedStatements related metrics.It would be interesting, and I believe fairly simple, to add several of them to make it possible, in development / management / monitoring tools, to show warnings or alerts related to this bad practice. Thus I would suggest to add the following metrics: * Executed prepared statements count * Executed unprepared statements count * Amount of PreparedStatements that have been registered on the node -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7719) Add PreparedStatements related metrics
[ https://issues.apache.org/jira/browse/CASSANDRA-7719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127999#comment-14127999 ] Benedict commented on CASSANDRA-7719: - Only after constructing an object array and boxing the parameters. It cannot optimise that away since it happens prior to the invocation. Add PreparedStatements related metrics -- Key: CASSANDRA-7719 URL: https://issues.apache.org/jira/browse/CASSANDRA-7719 Project: Cassandra Issue Type: New Feature Reporter: Michaël Figuière Assignee: T Jake Luciani Priority: Minor Fix For: 2.1.1 Attachments: 7719.txt Cassandra newcomers often don't understand that they're expected to use PreparedStatements for almost all of their repetitive queries executed in production. It doesn't look like Cassandra currently expose any PreparedStatements related metrics.It would be interesting, and I believe fairly simple, to add several of them to make it possible, in development / management / monitoring tools, to show warnings or alerts related to this bad practice. Thus I would suggest to add the following metrics: * Executed prepared statements count * Executed unprepared statements count * Amount of PreparedStatements that have been registered on the node -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7907) Determine how many network threads we need for native transport
[ https://issues.apache.org/jira/browse/CASSANDRA-7907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14128651#comment-14128651 ] Benedict commented on CASSANDRA-7907: - The _pinning_ is a secondary concern that I definitely want to leave optional (i.e. implement it, but leave it configurable, and default to off until we collect extensive data on good widely applicable defaults). I don't expect the user to taskset, however; we'd do this for the user, but let them specify the cpu id in the yaml if they think they can do a better job of it. bq. do we have reason to believe that we're bottle-necking on this It's difficult to benchmark networking overheads accurately, but it's a significant portion (perhaps majority) of our cpu time for in-memory workloads. Anything we can do to reduce this we should explore. Determine how many network threads we need for native transport --- Key: CASSANDRA-7907 URL: https://issues.apache.org/jira/browse/CASSANDRA-7907 Project: Cassandra Issue Type: Improvement Reporter: Benedict Priority: Minor With the introduction of CASSANDRA-4718, it is highly likely we can cope with just _one_ network IO thread. We could even try pinning it to a single (optionally configurable) core, and (also optionally) pin all other threads to a different core, so that we can guarantee extremely prompt execution (and if pinned to the correct core the OS uses for managing the network, improve throughput further). Testing this out will be challenging, as we need to simulate clients from lots of IPs. However, it is quite likely this would reduce the percentage of time spent in kernel networking calls, and the amount of context switching. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7907) Determine how many network threads we need for native transport
[ https://issues.apache.org/jira/browse/CASSANDRA-7907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14129314#comment-14129314 ] Benedict commented on CASSANDRA-7907: - bq. I'd want some evidence that pinning to cores is going to give us a measurable benefit before adding it to the code-base bq. _We could even *try* pinning_ Yes, we need to demonstrate an effect. But that is standard practice for performance enhancements :-) We have prior evidence that an effect will be seen, however. Not only from general practice of having done this before in other contexts (including yourself), but [~jasobrown] has done this on Cassandra, I believe as part of his investigations for CASSANDRA-4718, and seen an effect. Determine how many network threads we need for native transport --- Key: CASSANDRA-7907 URL: https://issues.apache.org/jira/browse/CASSANDRA-7907 Project: Cassandra Issue Type: Improvement Reporter: Benedict Priority: Minor With the introduction of CASSANDRA-4718, it is highly likely we can cope with just _one_ network IO thread. We could even try pinning it to a single (optionally configurable) core, and (also optionally) pin all other threads to a different core, so that we can guarantee extremely prompt execution (and if pinned to the correct core the OS uses for managing the network, improve throughput further). Testing this out will be challenging, as we need to simulate clients from lots of IPs. However, it is quite likely this would reduce the percentage of time spent in kernel networking calls, and the amount of context switching. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-7468) Add time-based execution to cassandra-stress
[ https://issues.apache.org/jira/browse/CASSANDRA-7468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7468: Assignee: Benedict (was: Matt Kennedy) Add time-based execution to cassandra-stress Key: CASSANDRA-7468 URL: https://issues.apache.org/jira/browse/CASSANDRA-7468 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Matt Kennedy Assignee: Benedict Priority: Minor Fix For: 2.1.1 Attachments: 7468v2.txt, trunk-7468-rebase.patch, trunk-7468.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7918) Provide graphing tool along with cassandra-stress
[ https://issues.apache.org/jira/browse/CASSANDRA-7918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226766#comment-14226766 ] Benedict commented on CASSANDRA-7918: - My plane journey was spent manically trying various graphing options to give everything you need to assess a branch in one view, and clearly. I'd hate that to go to waste. The new patch as it stands only produces the graphs we've always got - I'd like to see cstar and our bundled tool produce _better graphs_. Each one of the graphs in the gnuplot output is designed to let you see more information; it's all normalised, coloured and scattered so you can distinguish the results at each moment in time and overall. Too often with the web output I have to simply glance at the average to tell what's going on (or guess-and-peck numbers for zooming in), and have to click at each different stat which is laborious (and, let's be honest, we don't do it thoroughly, we just peck at a few... or perhaps I'm lazier than everyone else :)) To elaborate on the alternative, there are ten graphs in one view in the gnuplot version, scaled so you can tell everything they want you to know without clicking once. The left-most of each graph normalises each moment of each run against the base run, so that variability can be easily broken down across the run. The middle graph plots the raw data so you can get a feel for its shape, and the final graph plots the median, quartiles and deciles. The latencies are all plotted with selected scatters / lines to make distinguishing which p-range we're looking at, even when they cross. GC is also plotted specially as a cumulative run, since this tweaks out differences much more clearly also. I have nothing against discarding the gnuplot approach, but I'd like to see whatever solution we produce deliver really great graphs that allow us to make decisions more easily and more accurately. Right now I'd prefer to put the gnuplot work into cstar than the other way around. Though I can tell the hatred for it runs deep! Provide graphing tool along with cassandra-stress - Key: CASSANDRA-7918 URL: https://issues.apache.org/jira/browse/CASSANDRA-7918 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Benedict Assignee: Ryan McGuire Priority: Minor Whilst cstar makes some pretty graphs, they're a little limited and also require you to run your tests through it. It would be useful to be able to graph results from any stress run easily. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8061) tmplink files are not removed
[ https://issues.apache.org/jira/browse/CASSANDRA-8061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226972#comment-14226972 ] Benedict commented on CASSANDRA-8061: - [~JoshuaMcKenzie] nice spot, that's definitely a bug. It would require the partitions to be circa 500K in size, but it couldn't leave a file intact and undeleted, it could only potentially leak a file descriptor. So it's possible it's related to CASSANDRA-8248, but definitely not this. We should probably reopen 8248 and file against that. tmplink files are not removed - Key: CASSANDRA-8061 URL: https://issues.apache.org/jira/browse/CASSANDRA-8061 Project: Cassandra Issue Type: Bug Components: Core Environment: Linux Reporter: Gianluca Borello Assignee: Joshua McKenzie Priority: Critical Fix For: 2.1.3 Attachments: 8061_v1.txt, 8248-thread_dump.txt After installing 2.1.0, I'm experiencing a bunch of tmplink files that are filling my disk. I found https://issues.apache.org/jira/browse/CASSANDRA-7803 and that is very similar, and I confirm it happens both on 2.1.0 as well as from the latest commit on the cassandra-2.1 branch (https://github.com/apache/cassandra/commit/aca80da38c3d86a40cc63d9a122f7d45258e4685 from the cassandra-2.1) Even starting with a clean keyspace, after a few hours I get: {noformat} $ sudo find /raid0 | grep tmplink | xargs du -hs 2.7G /raid0/cassandra/data/draios/protobuf1-ccc6dce04beb11e4abf997b38fbf920b/draios-protobuf1-tmplink-ka-4515-Data.db 13M /raid0/cassandra/data/draios/protobuf1-ccc6dce04beb11e4abf997b38fbf920b/draios-protobuf1-tmplink-ka-4515-Index.db 1.8G /raid0/cassandra/data/draios/protobuf_by_agent1-cd071a304beb11e4abf997b38fbf920b/draios-protobuf_by_agent1-tmplink-ka-1788-Data.db 12M /raid0/cassandra/data/draios/protobuf_by_agent1-cd071a304beb11e4abf997b38fbf920b/draios-protobuf_by_agent1-tmplink-ka-1788-Index.db 5.2M /raid0/cassandra/data/draios/protobuf_by_agent1-cd071a304beb11e4abf997b38fbf920b/draios-protobuf_by_agent1-tmplink-ka-2678-Index.db 822M /raid0/cassandra/data/draios/protobuf_by_agent1-cd071a304beb11e4abf997b38fbf920b/draios-protobuf_by_agent1-tmplink-ka-2678-Data.db 7.3M /raid0/cassandra/data/draios/protobuf_by_agent1-cd071a304beb11e4abf997b38fbf920b/draios-protobuf_by_agent1-tmplink-ka-3283-Index.db 1.2G /raid0/cassandra/data/draios/protobuf_by_agent1-cd071a304beb11e4abf997b38fbf920b/draios-protobuf_by_agent1-tmplink-ka-3283-Data.db 6.7M /raid0/cassandra/data/draios/protobuf_by_agent1-cd071a304beb11e4abf997b38fbf920b/draios-protobuf_by_agent1-tmplink-ka-3951-Index.db 1.1G /raid0/cassandra/data/draios/protobuf_by_agent1-cd071a304beb11e4abf997b38fbf920b/draios-protobuf_by_agent1-tmplink-ka-3951-Data.db 11M /raid0/cassandra/data/draios/protobuf_by_agent1-cd071a304beb11e4abf997b38fbf920b/draios-protobuf_by_agent1-tmplink-ka-4799-Index.db 1.7G /raid0/cassandra/data/draios/protobuf_by_agent1-cd071a304beb11e4abf997b38fbf920b/draios-protobuf_by_agent1-tmplink-ka-4799-Data.db 812K /raid0/cassandra/data/draios/mounted_fs_by_agent1-d7bf3e304beb11e4abf997b38fbf920b/draios-mounted_fs_by_agent1-tmplink-ka-234-Index.db 122M /raid0/cassandra/data/draios/mounted_fs_by_agent1-d7bf3e304beb11e4abf997b38fbf920b/draios-mounted_fs_by_agent1-tmplink-ka-208-Data.db 744K /raid0/cassandra/data/draios/mounted_fs_by_agent1-d7bf3e304beb11e4abf997b38fbf920b/draios-mounted_fs_by_agent1-tmplink-ka-739-Index.db 660K /raid0/cassandra/data/draios/mounted_fs_by_agent1-d7bf3e304beb11e4abf997b38fbf920b/draios-mounted_fs_by_agent1-tmplink-ka-193-Index.db 796K /raid0/cassandra/data/draios/mounted_fs_by_agent1-d7bf3e304beb11e4abf997b38fbf920b/draios-mounted_fs_by_agent1-tmplink-ka-230-Index.db 137M /raid0/cassandra/data/draios/mounted_fs_by_agent1-d7bf3e304beb11e4abf997b38fbf920b/draios-mounted_fs_by_agent1-tmplink-ka-230-Data.db 161M /raid0/cassandra/data/draios/mounted_fs_by_agent1-d7bf3e304beb11e4abf997b38fbf920b/draios-mounted_fs_by_agent1-tmplink-ka-269-Data.db 139M /raid0/cassandra/data/draios/mounted_fs_by_agent1-d7bf3e304beb11e4abf997b38fbf920b/draios-mounted_fs_by_agent1-tmplink-ka-234-Data.db 940K /raid0/cassandra/data/draios/mounted_fs_by_agent1-d7bf3e304beb11e4abf997b38fbf920b/draios-mounted_fs_by_agent1-tmplink-ka-786-Index.db 936K /raid0/cassandra/data/draios/mounted_fs_by_agent1-d7bf3e304beb11e4abf997b38fbf920b/draios-mounted_fs_by_agent1-tmplink-ka-269-Index.db 161M /raid0/cassandra/data/draios/mounted_fs_by_agent1-d7bf3e304beb11e4abf997b38fbf920b/draios-mounted_fs_by_agent1-tmplink-ka-786-Data.db 672K /raid0/cassandra/data/draios/mounted_fs_by_agent1-d7bf3e304beb11e4abf997b38fbf920b/draios-mounted_fs_by_agent1-tmplink-ka-197-Index.db 113M
[jira] [Commented] (CASSANDRA-8325) Cassandra 2.1.x fails to start on FreeBSD (JVM crash)
[ https://issues.apache.org/jira/browse/CASSANDRA-8325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226992#comment-14226992 ] Benedict commented on CASSANDRA-8325: - You might be right. The javadoc does make it quite explicit that this should not be permitted, however the hotspot code in library_call.cpp (inline_unsafe_access and classify_unsafe_addr) _seems_ to indicate it should be valid and behave the same, but it's hard to say for sure without getting the project working better to explore the code more fully. However given that it is commented as not valid usafe, it does seem sensible to change it. But this means a potential performance penalty in one of the most heavily used codepaths. Cassandra 2.1.x fails to start on FreeBSD (JVM crash) - Key: CASSANDRA-8325 URL: https://issues.apache.org/jira/browse/CASSANDRA-8325 Project: Cassandra Issue Type: Bug Environment: FreeBSD 10.0 with openjdk version 1.7.0_71, 64-Bit Server VM Reporter: Leonid Shalupov Attachments: hs_err_pid1856.log, system.log See attached error file after JVM crash {quote} FreeBSD xxx.intellij.net 10.0-RELEASE FreeBSD 10.0-RELEASE #0 r260789: Thu Jan 16 22:34:59 UTC 2014 r...@snap.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64 {quote} {quote} % java -version openjdk version 1.7.0_71 OpenJDK Runtime Environment (build 1.7.0_71-b14) OpenJDK 64-Bit Server VM (build 24.71-b01, mixed mode) {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-8383) Memtable flush may expire records from the commit log that are in a later memtable
Benedict created CASSANDRA-8383: --- Summary: Memtable flush may expire records from the commit log that are in a later memtable Key: CASSANDRA-8383 URL: https://issues.apache.org/jira/browse/CASSANDRA-8383 Project: Cassandra Issue Type: Bug Components: Core Reporter: Benedict Assignee: Benedict Priority: Critical Fix For: 2.1.3 This is a pretty obvious bug with any care of thought, so not sure how I managed to introduce it. We use OpOrder to ensure all writes to a memtable have finished before flushing, however we also use this OpOrder to direct writes to the correct memtable. However this is insufficient, since the OpOrder is only a partial order; an operation from the future (i.e. for the next memtable) could still interleave with the past operations in such a way that they grab a CL entry inbetween the past operations. Since we simply take the max ReplayPosition of those in the past, this would mean any interleaved future operations would be expired even though they haven't been persisted to disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8383) Memtable flush may expire records from the commit log that are in a later memtable
[ https://issues.apache.org/jira/browse/CASSANDRA-8383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227622#comment-14227622 ] Benedict commented on CASSANDRA-8383: - Initial patch [here|https://github.com/belliottsmith/cassandra/tree/8383-bug-clexpirereorder] We should also introduce a commit log correctness stress test, so we can reproduce this, be certain it is fixed, and so we can be sure to avoid this or similar scenarios in future. Memtable flush may expire records from the commit log that are in a later memtable -- Key: CASSANDRA-8383 URL: https://issues.apache.org/jira/browse/CASSANDRA-8383 Project: Cassandra Issue Type: Bug Components: Core Reporter: Benedict Assignee: Benedict Priority: Critical Labels: commitlog Fix For: 2.1.3 This is a pretty obvious bug with any care of thought, so not sure how I managed to introduce it. We use OpOrder to ensure all writes to a memtable have finished before flushing, however we also use this OpOrder to direct writes to the correct memtable. However this is insufficient, since the OpOrder is only a partial order; an operation from the future (i.e. for the next memtable) could still interleave with the past operations in such a way that they grab a CL entry inbetween the past operations. Since we simply take the max ReplayPosition of those in the past, this would mean any interleaved future operations would be expired even though they haven't been persisted to disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8192) AssertionError in Memory.java
[ https://issues.apache.org/jira/browse/CASSANDRA-8192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227821#comment-14227821 ] Benedict commented on CASSANDRA-8192: - If it's the data, it's likely a pretty simple issue of corruption and suboptimal error checking. A compression metadata file is probably zero, so when we allocate memory to store it in, we don't allocate any memory for it. Exactly how it ended up empty is another matter, and is possibly a bug. Try running {code}find . -iname *CompressionInfo.db -size 0{code} in your data directory AssertionError in Memory.java - Key: CASSANDRA-8192 URL: https://issues.apache.org/jira/browse/CASSANDRA-8192 Project: Cassandra Issue Type: Bug Components: Core Environment: Windows-7-32 bit, 3GB RAM, Java 1.7.0_67 Reporter: Andreas Schnitzerling Assignee: Joshua McKenzie Fix For: 2.1.3 Attachments: cassandra.bat, cassandra.yaml, logdata-onlinedata-ka-196504-CompressionInfo.zip, printChunkOffsetErrors.txt, system-compactions_in_progress-ka-47594-CompressionInfo.zip, system-sstable_activity-jb-25-Filter.zip, system.log, system_AssertionTest.log Since update of 1 of 12 nodes from 2.1.0-rel to 2.1.1-rel Exception during start up. {panel:title=system.log} ERROR [SSTableBatchOpen:1] 2014-10-27 09:44:00,079 CassandraDaemon.java:153 - Exception in thread Thread[SSTableBatchOpen:1,5,main] java.lang.AssertionError: null at org.apache.cassandra.io.util.Memory.size(Memory.java:307) ~[apache-cassandra-2.1.1.jar:2.1.1] at org.apache.cassandra.io.compress.CompressionMetadata.init(CompressionMetadata.java:135) ~[apache-cassandra-2.1.1.jar:2.1.1] at org.apache.cassandra.io.compress.CompressionMetadata.create(CompressionMetadata.java:83) ~[apache-cassandra-2.1.1.jar:2.1.1] at org.apache.cassandra.io.util.CompressedSegmentedFile$Builder.metadata(CompressedSegmentedFile.java:50) ~[apache-cassandra-2.1.1.jar:2.1.1] at org.apache.cassandra.io.util.CompressedPoolingSegmentedFile$Builder.complete(CompressedPoolingSegmentedFile.java:48) ~[apache-cassandra-2.1.1.jar:2.1.1] at org.apache.cassandra.io.sstable.SSTableReader.load(SSTableReader.java:766) ~[apache-cassandra-2.1.1.jar:2.1.1] at org.apache.cassandra.io.sstable.SSTableReader.load(SSTableReader.java:725) ~[apache-cassandra-2.1.1.jar:2.1.1] at org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:402) ~[apache-cassandra-2.1.1.jar:2.1.1] at org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:302) ~[apache-cassandra-2.1.1.jar:2.1.1] at org.apache.cassandra.io.sstable.SSTableReader$4.run(SSTableReader.java:438) ~[apache-cassandra-2.1.1.jar:2.1.1] at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) ~[na:1.7.0_55] at java.util.concurrent.FutureTask.run(Unknown Source) ~[na:1.7.0_55] at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [na:1.7.0_55] at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [na:1.7.0_55] at java.lang.Thread.run(Unknown Source) [na:1.7.0_55] {panel} In the attached log you can still see as well CASSANDRA-8069 and CASSANDRA-6283. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (CASSANDRA-8388) java.lang.AssertionError: null
[ https://issues.apache.org/jira/browse/CASSANDRA-8388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict resolved CASSANDRA-8388. - Resolution: Duplicate java.lang.AssertionError: null --- Key: CASSANDRA-8388 URL: https://issues.apache.org/jira/browse/CASSANDRA-8388 Project: Cassandra Issue Type: Bug Reporter: Ilya Komolkin 21:00:10.156 [SSTableBatchOpen:5] ERROR o.a.c.service.CassandraDaemon - Exception in thread Thread[SSTableBatchOpen:5,5,main] java.lang.AssertionError: null at org.apache.cassandra.io.util.Memory.size(Memory.java:307) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.compress.CompressionMetadata.init(CompressionMetadata.java:135) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.compress.CompressionMetadata.create(CompressionMetadata.java:83) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.util.CompressedSegmentedFile$Builder.metadata(CompressedSegmentedFile.java:50) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.util.CompressedPoolingSegmentedFile$Builder.complete(CompressedPoolingSegmentedFile.java:48) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.sstable.SSTableReader.load(SSTableReader.java:766) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.sstable.SSTableReader.load(SSTableReader.java:725) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:402) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:302) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.sstable.SSTableReader$4.run(SSTableReader.java:438) ~[cassandra-all-2.1.1.jar:2.1.1] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_71] at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[na:1.7.0_71] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[na:1.7.0_71] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_71] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_71] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8389) org.apache.cassandra.io.sstable.CorruptSSTableException: org.apache.cassandra.io.compress.CorruptBlockException
[ https://issues.apache.org/jira/browse/CASSANDRA-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228271#comment-14228271 ] Benedict commented on CASSANDRA-8389: - It's not at all clear that this is a bug. Although it is possible, it seems likely the data is genuinely corrupted. What makes you suspect a bug, rather than corruption from hardware failure? org.apache.cassandra.io.sstable.CorruptSSTableException: org.apache.cassandra.io.compress.CorruptBlockException --- Key: CASSANDRA-8389 URL: https://issues.apache.org/jira/browse/CASSANDRA-8389 Project: Cassandra Issue Type: Bug Reporter: Ilya Komolkin 21:43:50.835 [CompactionExecutor:11] ERROR o.a.c.service.CassandraDaemon - Exception in thread Thread[CompactionExecutor:11,1,main] org.apache.cassandra.io.sstable.CorruptSSTableException: org.apache.cassandra.io.compress.CorruptBlockException: (E:\Upsource_12391\data\cassandra\data\kernel\content-a61f1280764611e48c8e4915424c75fe\kernel-content-ka-142-Data.db): corruption detected, chunk at 17288734 of length 65502. at org.apache.cassandra.io.compress.CompressedRandomAccessReader.reBuffer(CompressedRandomAccessReader.java:92) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.compress.CompressedThrottledReader.reBuffer(CompressedThrottledReader.java:41) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.util.RandomAccessReader.read(RandomAccessReader.java:326) ~[cassandra-all-2.1.1.jar:2.1.1] at java.io.RandomAccessFile.readFully(RandomAccessFile.java:444) ~[na:1.7.0_71] at java.io.RandomAccessFile.readFully(RandomAccessFile.java:424) ~[na:1.7.0_71] at org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:351) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:348) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.utils.ByteBufferUtil.readWithLength(ByteBufferUtil.java:311) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.db.ColumnSerializer.deserializeColumnBody(ColumnSerializer.java:132) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.db.OnDiskAtom$Serializer.deserializeFromSSTable(OnDiskAtom.java:86) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.db.AbstractCell$1.computeNext(AbstractCell.java:52) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.db.AbstractCell$1.computeNext(AbstractCell.java:46) ~[cassandra-all-2.1.1.jar:2.1.1] at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) ~[guava-16.0.jar:na] at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) ~[guava-16.0.jar:na] at org.apache.cassandra.io.sstable.SSTableIdentityIterator.hasNext(SSTableIdentityIterator.java:116) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.utils.MergeIterator$OneToOne.computeNext(MergeIterator.java:202) ~[cassandra-all-2.1.1.jar:2.1.1] at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) ~[guava-16.0.jar:na] at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) ~[guava-16.0.jar:na] at com.google.common.collect.Iterators$7.computeNext(Iterators.java:645) ~[guava-16.0.jar:na] at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) ~[guava-16.0.jar:na] at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) ~[guava-16.0.jar:na] at org.apache.cassandra.db.ColumnIndex$Builder.buildForCompaction(ColumnIndex.java:165) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.db.compaction.LazilyCompactedRow.write(LazilyCompactedRow.java:110) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:200) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.sstable.SSTableRewriter.append(SSTableRewriter.java:110) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:183) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) ~[cassandra-all-2.1.1.jar:2.1.1] at
[jira] [Commented] (CASSANDRA-7039) DirectByteBuffer compatible LZ4 methods
[ https://issues.apache.org/jira/browse/CASSANDRA-7039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228324#comment-14228324 ] Benedict commented on CASSANDRA-7039: - Is there much point upgrading without making use of the new API? DirectByteBuffer compatible LZ4 methods --- Key: CASSANDRA-7039 URL: https://issues.apache.org/jira/browse/CASSANDRA-7039 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Priority: Minor Labels: performance Fix For: 3.0 Attachments: 7039.patch As we move more things off-heap, it's becoming more and more essential to be able to use DirectByteBuffer (or native pointers) in various places. Unfortunately LZ4 doesn't currently support this operation, despite being JNI based - this means we both have to perform unnecessary copies to de/compress data from DBB, but also we can stall GC as any JNI method operating over a java array using the GetPrimitiveArrayCritical enters a critical section that prevents GC for its duration. This means STWs will be at least as long any running compression/decompression (and no GC will happen until they complete, so it's additive). We should temporarily fork (and then resubmit upstream) jpountz-lz4 to support operating over a native pointer, so that we can pass a DBB or a raw pointer we have allocated ourselves. This will help improve performance when flushing the new offheap memtables, as well as enable us to implement CASSANDRA-6726 and finish CASSANDRA-4338. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228397#comment-14228397 ] Benedict commented on CASSANDRA-7438: - I suspect segmenting the table at a finer granularity, so that each segment is maintained with mutual exclusivity, would achieve better percentiles in both cases due to keeping the maximum resize cost down. We could settle for a separate LRU-q per segment, even, to keep the complexity of this code down significantly - it is unlikely having a global LRU-q is significantly more accurate at predicting reuse than ~128 of them. It would also make it much easier to improve the replacement strategy beyond LRU, which would likely yield a bigger win for performance than any potential loss from reduced concurrency. The critical section for reads could be kept sufficiently small that competition would be very unlikely with the current state of C*, by performing the deserialization outside of it. There's a good chance this would yield a net positive performance impact, by reducing the cost per access without increasing the cost due to contention measurably (because contention would be infrequent). Serializing Row cache alternative (Fully off heap) -- Key: CASSANDRA-7438 URL: https://issues.apache.org/jira/browse/CASSANDRA-7438 Project: Cassandra Issue Type: Improvement Components: Core Environment: Linux Reporter: Vijay Assignee: Vijay Labels: performance Fix For: 3.0 Attachments: 0001-CASSANDRA-7438.patch, tests.zip Currently SerializingCache is partially off heap, keys are still stored in JVM heap as BB, * There is a higher GC costs for a reasonably big cache. * Some users have used the row cache efficiently in production for better results, but this requires careful tunning. * Overhead in Memory for the cache entries are relatively high. So the proposal for this ticket is to move the LRU cache logic completely off heap and use JNI to interact with cache. We might want to ensure that the new implementation match the existing API's (ICache), and the implementation needs to have safe memory access, low overhead in memory and less memcpy's (As much as possible). We might also want to make this cache configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228397#comment-14228397 ] Benedict edited comment on CASSANDRA-7438 at 11/28/14 5:06 PM: --- I suspect segmenting the table at a coarser granularity, so that each segment is maintained with mutual exclusivity, would achieve better percentiles in both cases due to keeping the maximum resize cost down. We could settle for a separate LRU-q per segment, even, to keep the complexity of this code down significantly - it is unlikely having a global LRU-q is significantly more accurate at predicting reuse than ~128 of them. It would also make it much easier to improve the replacement strategy beyond LRU, which would likely yield a bigger win for performance than any potential loss from reduced concurrency. The critical section for reads could be kept sufficiently small that competition would be very unlikely with the current state of C*, by performing the deserialization outside of it. There's a good chance this would yield a net positive performance impact, by reducing the cost per access without increasing the cost due to contention measurably (because contention would be infrequent). edit: coarser, not finer. i.e., a la j.u.c.CHM was (Author: benedict): I suspect segmenting the table at a finer granularity, so that each segment is maintained with mutual exclusivity, would achieve better percentiles in both cases due to keeping the maximum resize cost down. We could settle for a separate LRU-q per segment, even, to keep the complexity of this code down significantly - it is unlikely having a global LRU-q is significantly more accurate at predicting reuse than ~128 of them. It would also make it much easier to improve the replacement strategy beyond LRU, which would likely yield a bigger win for performance than any potential loss from reduced concurrency. The critical section for reads could be kept sufficiently small that competition would be very unlikely with the current state of C*, by performing the deserialization outside of it. There's a good chance this would yield a net positive performance impact, by reducing the cost per access without increasing the cost due to contention measurably (because contention would be infrequent). Serializing Row cache alternative (Fully off heap) -- Key: CASSANDRA-7438 URL: https://issues.apache.org/jira/browse/CASSANDRA-7438 Project: Cassandra Issue Type: Improvement Components: Core Environment: Linux Reporter: Vijay Assignee: Vijay Labels: performance Fix For: 3.0 Attachments: 0001-CASSANDRA-7438.patch, tests.zip Currently SerializingCache is partially off heap, keys are still stored in JVM heap as BB, * There is a higher GC costs for a reasonably big cache. * Some users have used the row cache efficiently in production for better results, but this requires careful tunning. * Overhead in Memory for the cache entries are relatively high. So the proposal for this ticket is to move the LRU cache logic completely off heap and use JNI to interact with cache. We might want to ensure that the new implementation match the existing API's (ICache), and the implementation needs to have safe memory access, low overhead in memory and less memcpy's (As much as possible). We might also want to make this cache configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228563#comment-14228563 ] Benedict commented on CASSANDRA-7438: - [~aweisberg]: In my experience segments tend to be imperfectly distributed, so whilst there is bunching of resizes simply because they take so long, with real work going on at the same time they should be a _little_ spread out. Though with murmur3 the distribution may be significantly more uniform than my prior experiments. Either way, they're performed in parallel (without coordination) if they coincide, so it's still an improvement. [~vijay2...@yahoo.com]: When I talk about complexity, I mean the difficulties of concurrent programming magnified without the normal tools. For instance, there are the following concerns: * We have a spin-lock - admittedly one that should _generally_ be uncontended, but on a grow or a small map this is certainly not the case, which could result in really problematic behaviour. Pure spin locks should not be used outside of the kernel. * The queue is maintained by a separate thread that requires signalling if it isn't currently performing work - which, in a real C* instance where the cost of linking the queue item is a fraction of the other work done to service a request means we are likely to incur a costly unpark() for a majority of operations * Reads can interleave with put/replace/remove and abort the removal of an item from the queue, resulting in a memory leak. * We perform the grow on a separate thread, but prevent all reader _or_ writer threads from making progress by taking the locks for all buckets immediately. * Freeing of oldSegments is still dangerous, it's just probabilistically less likely to happen. * During a grow, we can lose puts because we unlock the old segments, so with the right (again, unlikely) interleaving of events a writer can think the old table is still valid * When growing, we only double the size of the backing table, however since grows happen in the background the updater can get ahead, meaning we remain behind and multiply the constant factor overheads, collisions and contention until total size tails off. These are only the obvious problems that spring to mind from 15m perusing the code, I'm sure there are others. This kind of stuff is really hard, and the approach I'm suggesting is comparatively a doddle to get right, and is likely faster to boot. I'm not sure I understand your concern with segmentation creating complexity with the hashing... I'm proposing the exact method used by CHM. We have an excellent hash algorithm to distribute the data over the segments: murmurhash3. Although we need to be careful to not use the bits that don't have the correct entropy for selecting a segment. It's really no more than a two-tier hash table. The user doesn't need to know anything about this. Serializing Row cache alternative (Fully off heap) -- Key: CASSANDRA-7438 URL: https://issues.apache.org/jira/browse/CASSANDRA-7438 Project: Cassandra Issue Type: Improvement Components: Core Environment: Linux Reporter: Vijay Assignee: Vijay Labels: performance Fix For: 3.0 Attachments: 0001-CASSANDRA-7438.patch, tests.zip Currently SerializingCache is partially off heap, keys are still stored in JVM heap as BB, * There is a higher GC costs for a reasonably big cache. * Some users have used the row cache efficiently in production for better results, but this requires careful tunning. * Overhead in Memory for the cache entries are relatively high. So the proposal for this ticket is to move the LRU cache logic completely off heap and use JNI to interact with cache. We might want to ensure that the new implementation match the existing API's (ICache), and the implementation needs to have safe memory access, low overhead in memory and less memcpy's (As much as possible). We might also want to make this cache configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228563#comment-14228563 ] Benedict edited comment on CASSANDRA-7438 at 11/29/14 12:23 AM: [~aweisberg]: In my experience segments tend to be imperfectly distributed, so whilst there is bunching of resizes simply because they take so long, with real work going on at the same time they should be a _little_ spread out. Though with murmur3 the distribution may be significantly more uniform than my prior experiments. Either way, they're performed in parallel (without coordination) if they coincide, and are each a fraction of the size, so it's still an improvement. [~vijay2...@yahoo.com]: When I talk about complexity, I mean the difficulties of concurrent programming magnified without the normal tools. For instance, there are the following concerns: * We have a spin-lock - admittedly one that should _generally_ be uncontended, but on a grow or a small map this is certainly not the case, which could result in really problematic behaviour. Pure spin locks should not be used outside of the kernel. * The queue is maintained by a separate thread that requires signalling if it isn't currently performing work - which, in a real C* instance where the cost of linking the queue item is a fraction of the other work done to service a request means we are likely to incur a costly unpark() for a majority of operations * Reads can interleave with put/replace/remove and abort the removal of an item from the queue, resulting in a memory leak. * We perform the grow on a separate thread, but prevent all reader _or_ writer threads from making progress by taking the locks for all buckets immediately. * Freeing of oldSegments is still dangerous, it's just probabilistically less likely to happen. * During a grow, we can lose puts because we unlock the old segments, so with the right (again, unlikely) interleaving of events a writer can think the old table is still valid * When growing, we only double the size of the backing table, however since grows happen in the background the updater can get ahead, meaning we remain behind and multiply the constant factor overheads, collisions and contention until total size tails off. These are only the obvious problems that spring to mind from 15m perusing the code, I'm sure there are others. This kind of stuff is really hard, and the approach I'm suggesting is comparatively a doddle to get right, and is likely faster to boot. I'm not sure I understand your concern with segmentation creating complexity with the hashing... I'm proposing the exact method used by CHM. We have an excellent hash algorithm to distribute the data over the segments: murmurhash3. Although we need to be careful to not use the bits that don't have the correct entropy for selecting a segment. Think of it as simply implementing an off-heap LinkedHashMap, wrapping it in a lock, and having an array of them. The user doesn't need to know anything about this. was (Author: benedict): [~aweisberg]: In my experience segments tend to be imperfectly distributed, so whilst there is bunching of resizes simply because they take so long, with real work going on at the same time they should be a _little_ spread out. Though with murmur3 the distribution may be significantly more uniform than my prior experiments. Either way, they're performed in parallel (without coordination) if they coincide, so it's still an improvement. [~vijay2...@yahoo.com]: When I talk about complexity, I mean the difficulties of concurrent programming magnified without the normal tools. For instance, there are the following concerns: * We have a spin-lock - admittedly one that should _generally_ be uncontended, but on a grow or a small map this is certainly not the case, which could result in really problematic behaviour. Pure spin locks should not be used outside of the kernel. * The queue is maintained by a separate thread that requires signalling if it isn't currently performing work - which, in a real C* instance where the cost of linking the queue item is a fraction of the other work done to service a request means we are likely to incur a costly unpark() for a majority of operations * Reads can interleave with put/replace/remove and abort the removal of an item from the queue, resulting in a memory leak. * We perform the grow on a separate thread, but prevent all reader _or_ writer threads from making progress by taking the locks for all buckets immediately. * Freeing of oldSegments is still dangerous, it's just probabilistically less likely to happen. * During a grow, we can lose puts because we unlock the old segments, so with the right (again, unlikely) interleaving of events a writer can think the old table is still valid * When growing, we only double the size of the backing
[jira] [Commented] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228575#comment-14228575 ] Benedict commented on CASSANDRA-7438: - bq. I am 100% sure Never be 100% sure with concurrency, please :) bq. test case again plz. I don't think this can happen too. I spend a lot of time testing the exact scenario. You have too much faith in tests. You are testing under ideal conditions - two of the race conditions I highlighted will only rear their heads infrequently, most likely when the system is under uncharacteristic load causing very choppy scheduling. Analysis of the code is paramount. I will not produce a test case as I do not have time, however I will give you an interleaving of events that would trigger one of them. Thread A is deleting an item, and is in LRUC.invalidate(), Thread B is looking up the same item, in LRUC.get(). A: 187: map.remove() B: 154 :map.get() A: 191: queue.deleteFromQueue() B: 158: queue.addToQueue() In particular, addToQueue() sets the markAsDeleted flag to false, undoing the prior work of deleteFromQueue. bq. Thread is only signalled if they are not performing operation. I am lost. It will generally not be performing an operation, because its work will be faster than any of the producers can produce work in normal C* operation. Serializing Row cache alternative (Fully off heap) -- Key: CASSANDRA-7438 URL: https://issues.apache.org/jira/browse/CASSANDRA-7438 Project: Cassandra Issue Type: Improvement Components: Core Environment: Linux Reporter: Vijay Assignee: Vijay Labels: performance Fix For: 3.0 Attachments: 0001-CASSANDRA-7438.patch, tests.zip Currently SerializingCache is partially off heap, keys are still stored in JVM heap as BB, * There is a higher GC costs for a reasonably big cache. * Some users have used the row cache efficiently in production for better results, but this requires careful tunning. * Overhead in Memory for the cache entries are relatively high. So the proposal for this ticket is to move the LRU cache logic completely off heap and use JNI to interact with cache. We might want to ensure that the new implementation match the existing API's (ICache), and the implementation needs to have safe memory access, low overhead in memory and less memcpy's (As much as possible). We might also want to make this cache configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228693#comment-14228693 ] Benedict commented on CASSANDRA-7438: - Invert those two statements and the behaviour is still broken. B: 154 :map.get() A: 187: map.remove() A: 191: queue.deleteFromQueue() B: 158: queue.addToQueue() Serializing Row cache alternative (Fully off heap) -- Key: CASSANDRA-7438 URL: https://issues.apache.org/jira/browse/CASSANDRA-7438 Project: Cassandra Issue Type: Improvement Components: Core Environment: Linux Reporter: Vijay Assignee: Vijay Labels: performance Fix For: 3.0 Attachments: 0001-CASSANDRA-7438.patch, tests.zip Currently SerializingCache is partially off heap, keys are still stored in JVM heap as BB, * There is a higher GC costs for a reasonably big cache. * Some users have used the row cache efficiently in production for better results, but this requires careful tunning. * Overhead in Memory for the cache entries are relatively high. So the proposal for this ticket is to move the LRU cache logic completely off heap and use JNI to interact with cache. We might want to ensure that the new implementation match the existing API's (ICache), and the implementation needs to have safe memory access, low overhead in memory and less memcpy's (As much as possible). We might also want to make this cache configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228693#comment-14228693 ] Benedict edited comment on CASSANDRA-7438 at 11/29/14 9:40 AM: --- Good point! But invert those two statements and the behaviour is still broken. B: 154 :map.get() A: 187: map.remove() A: 191: queue.deleteFromQueue() B: 158: queue.addToQueue() was (Author: benedict): Invert those two statements and the behaviour is still broken. B: 154 :map.get() A: 187: map.remove() A: 191: queue.deleteFromQueue() B: 158: queue.addToQueue() Serializing Row cache alternative (Fully off heap) -- Key: CASSANDRA-7438 URL: https://issues.apache.org/jira/browse/CASSANDRA-7438 Project: Cassandra Issue Type: Improvement Components: Core Environment: Linux Reporter: Vijay Assignee: Vijay Labels: performance Fix For: 3.0 Attachments: 0001-CASSANDRA-7438.patch, tests.zip Currently SerializingCache is partially off heap, keys are still stored in JVM heap as BB, * There is a higher GC costs for a reasonably big cache. * Some users have used the row cache efficiently in production for better results, but this requires careful tunning. * Overhead in Memory for the cache entries are relatively high. So the proposal for this ticket is to move the LRU cache logic completely off heap and use JNI to interact with cache. We might want to ensure that the new implementation match the existing API's (ICache), and the implementation needs to have safe memory access, low overhead in memory and less memcpy's (As much as possible). We might also want to make this cache configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7203) Flush (and Compact) High Traffic Partitions Separately
[ https://issues.apache.org/jira/browse/CASSANDRA-7203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228729#comment-14228729 ] Benedict commented on CASSANDRA-7203: - [~jbellis]: Are we sure that's a good policy? It's generally accepted that a lot of work (esp. that involving people, e.g. Netflix, Apple) follows a zipfian/extreme distribution. If we can avoid the most voluminous customers from degrading performance for everybody, that's surely a pretty big win? I'm not suggesting this be attacked immediately, but in the medium-to-long term it seems like a pretty decent yield - and could be applied on both read and write. If you have 1% of your data appearing in ~100% of sstables, but the other 99% appearing in only ~1% of your sstables, you're compacting an order of magnitude more often than you might otherwise need to. Perhaps [~jasobrown] and [~kohlisankalp] have an idea of how realistic this scenario is? Flush (and Compact) High Traffic Partitions Separately -- Key: CASSANDRA-7203 URL: https://issues.apache.org/jira/browse/CASSANDRA-7203 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Labels: compaction, performance An idea possibly worth exploring is the use of streaming count-min sketches to collect data over the up-time of a server to estimating the velocity of different partitions, so that high-volume partitions can be flushed separately on the assumption that they will be much smaller in number, thus reducing write amplification by permitting compaction independently of any low-velocity data. Whilst the idea is reasonably straight forward, it seems that the biggest problem here will be defining any success metric. Obviously any workload following an exponential/zipf/extreme distribution is likely to benefit from such an approach, but whether or not that would translate in real terms is another matter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6976) Determining replicas to query is very slow with large numbers of nodes or vnodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228737#comment-14228737 ] Benedict commented on CASSANDRA-6976: - [~jbellis] [~aweisberg] I have a few remaining concerns, although I agree this isn't _super_ pressing: * the benchmark as tested will have perfect L1 cache occupancy, which in a real scenario is unlikely * the benchmarks did not account for: (all of which should have a negative impact on the runtime on getRangeSlice itself) ** running with dynamic snitch (that is being updated simultaneously) ** running with network topology snitch underneath the dynamic snitch, and/or by itself ** running with, say, 3+ DCs, RF=3 the benchmark looks like it ran with simplesnitch, RF=1, 1 DC - i.e., ideal conditions. This won't likely make an order of magnitude difference, but I guess the question is if we care about being tremendously slow for full table scans for _small_ tables. Programmatically fetching the entire contents of a lookup table, for instance, would be badly affected by this behaviour even without the changes I propose to the methodology. Determining replicas to query is very slow with large numbers of nodes or vnodes Key: CASSANDRA-6976 URL: https://issues.apache.org/jira/browse/CASSANDRA-6976 Project: Cassandra Issue Type: Bug Components: Core Reporter: Benedict Assignee: Ariel Weisberg Labels: performance Attachments: GetRestrictedRanges.java, jmh_output.txt, jmh_output_murmur3.txt, make_jmh_work.patch As described in CASSANDRA-6906, this can be ~100ms for a relatively small cluster with vnodes, which is longer than it will spend in transit on the network. This should be much faster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6976) Determining replicas to query is very slow with large numbers of nodes or vnodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14229269#comment-14229269 ] Benedict commented on CASSANDRA-6976: - On second thoughts, ignore that sentiment entirely. We don't really have any concept of a lookup table, and we'll have to address that directly when we introduce enum types which is a better place. I guess what really bugs me about this, and what I assumed would be related to the problem (but patently can't given the default behaviour) is that after calculating natural endpoints, we then sort them (based on a couple of hashmap lookups for each end point) for every token range, and also for every single normal query. This sort is performed over RF*DC items in either case, even for queries routed directly to the owning node with CL ONE. I was hoping we'd fix that as a result of this work, since that's a lot of duplicated effort, but that hardly seems sensible now. What we definitely _should_ do, though, is make sure we're (in general) benchmarking behaviour over common config, as our default test configuration is not at all representative. Determining replicas to query is very slow with large numbers of nodes or vnodes Key: CASSANDRA-6976 URL: https://issues.apache.org/jira/browse/CASSANDRA-6976 Project: Cassandra Issue Type: Bug Components: Core Reporter: Benedict Assignee: Ariel Weisberg Labels: performance Attachments: GetRestrictedRanges.java, jmh_output.txt, jmh_output_murmur3.txt, make_jmh_work.patch As described in CASSANDRA-6906, this can be ~100ms for a relatively small cluster with vnodes, which is longer than it will spend in transit on the network. This should be much faster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8341) Expose time spent in each thread pool
[ https://issues.apache.org/jira/browse/CASSANDRA-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14229780#comment-14229780 ] Benedict commented on CASSANDRA-8341: - SEPWorker already grabs the nanoTime on exiting and entering its spin phase, so tracking this would be pretty much free (we'd need to check it once if we swapped the executor we're working on without entering a spinning state). Flushing pent up data is pretty trivial; you can set a max time to buffer, so it ensures it's never more than a few seconds (or millis) out of date, say. Enough to keep the cost too small to measure. I'm a little dubious about tracking two completely different properties as the same thing though. CPUTime cannot be composed with nanoTime sensibly, so we either want to track one or the other across all executors. Since the other executors are all the ones that do infrequent expensive work (which is explicitly why they haven't been transitioned to SEP), tracking nanoTime on them won't be an appreciable cost. Expose time spent in each thread pool - Key: CASSANDRA-8341 URL: https://issues.apache.org/jira/browse/CASSANDRA-8341 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Chris Lohfink Priority: Minor Labels: metrics Attachments: 8341.patch, 8341v2.txt Can increment a counter with time spent in each queue. This can provide context on how much time is spent percentage wise in each stage. Additionally can be used with littles law in future if ever want to try to tune the size of the pools. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8341) Expose time spent in each thread pool
[ https://issues.apache.org/jira/browse/CASSANDRA-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14229805#comment-14229805 ] Benedict commented on CASSANDRA-8341: - Ah, that's a good question: are we talking about queue latency or time spent processing each queue? The two are very different, and it sounded like we were discussing the latter, but the ticket description does sound more like the former. Expose time spent in each thread pool - Key: CASSANDRA-8341 URL: https://issues.apache.org/jira/browse/CASSANDRA-8341 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Chris Lohfink Priority: Minor Labels: metrics Attachments: 8341.patch, 8341v2.txt Can increment a counter with time spent in each queue. This can provide context on how much time is spent percentage wise in each stage. Additionally can be used with littles law in future if ever want to try to tune the size of the pools. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7688) Add data sizing to a system table
[ https://issues.apache.org/jira/browse/CASSANDRA-7688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14229818#comment-14229818 ] Benedict commented on CASSANDRA-7688: - This is a fundamentally difficult problem, and to be answered accurately basically requires a full compaction. We can track or estimate this data for any given sstable easily, and we can estimate the number of overlapping partitions between two sstables (though the accuracy I'm unsure of if we composed this data across many sstables), but we cannot say how many rows within each overlapping partition overlap. The best we could do is probably sample some overlapping partitions to see what proportion of row overlap tends to prevail, and hope it is representative; if we assume a normal distribution of overlap ratio we could return error bounds. I don't think it's likely this data could be maintained live, at least not accurately, or not without significant cost. It would be an on-demand calculation that would be moderately expensive. Add data sizing to a system table - Key: CASSANDRA-7688 URL: https://issues.apache.org/jira/browse/CASSANDRA-7688 Project: Cassandra Issue Type: New Feature Reporter: Jeremiah Jordan Fix For: 2.1.3 Currently you can't implement something similar to describe_splits_ex purely from the a native protocol driver. https://datastax-oss.atlassian.net/browse/JAVA-312 is open to expose easily getting ownership information to a client in the java-driver. But you still need the data sizing part to get splits of a given size. We should add the sizing information to a system table so that native clients can get to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7688) Add data sizing to a system table
[ https://issues.apache.org/jira/browse/CASSANDRA-7688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14229831#comment-14229831 ] Benedict commented on CASSANDRA-7688: - I'm talking about estimates. We cannot likely even estimate without pretty significant cost. Sampling column counts is pretty easy, but knowing how many cql rows there are for any merged row is not. There are tricks to make it easier, but there are datasets for which the tricks will not work, and any estimate would be complete guesswork without sampling the data. Add data sizing to a system table - Key: CASSANDRA-7688 URL: https://issues.apache.org/jira/browse/CASSANDRA-7688 Project: Cassandra Issue Type: New Feature Reporter: Jeremiah Jordan Fix For: 2.1.3 Currently you can't implement something similar to describe_splits_ex purely from the a native protocol driver. https://datastax-oss.atlassian.net/browse/JAVA-312 is open to expose easily getting ownership information to a client in the java-driver. But you still need the data sizing part to get splits of a given size. We should add the sizing information to a system table so that native clients can get to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8341) Expose time spent in each thread pool
[ https://issues.apache.org/jira/browse/CASSANDRA-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14229872#comment-14229872 ] Benedict commented on CASSANDRA-8341: - That is difficult, since we have stages that perform work that does not consume CPU. The RPC stage (for thrift or cql) both spend the majority of their time _waiting_ for the relevant work stage to complete. The proposed approaches would count this as busy time. The read and write stages also can block on IO, the former more often than the latter, but in either case we would count erroneously. Expose time spent in each thread pool - Key: CASSANDRA-8341 URL: https://issues.apache.org/jira/browse/CASSANDRA-8341 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Chris Lohfink Priority: Minor Labels: metrics Attachments: 8341.patch, 8341v2.txt Can increment a counter with time spent in each queue. This can provide context on how much time is spent percentage wise in each stage. Additionally can be used with littles law in future if ever want to try to tune the size of the pools. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6976) Determining replicas to query is very slow with large numbers of nodes or vnodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230181#comment-14230181 ] Benedict commented on CASSANDRA-6976: - bq. I recall someone on the Mechanical Sympathy group pointing out that you can warm an entire last level cache in some small amount of time, I think it was 30ish milliseconds. I can't find the post and I could be very wrong, but it was definitely milliseconds. My guess is that in the big picture cache effects aren't changing the narrative that this takes 10s to 100s of milliseconds. Sure it does - if an action that is likely memory bound (like this one - after all, it does very little in the way of computation and doesn't touch any disk) takes time X with a warmed cache, and only touches data that can fit in cache, it will take X*K with a cold cache for some K (significantly) 1 - and in real operation, especially with many tokens, there is a quite reasonable likelihood of a cold cache given the lack of locality and amount of data as the cluster grows. This is actually one possibility for improving this behaviour, if we cared at all - ensuring the number of cache lines touched is kept low, working with primitives for the token ranges and inet addresses to reduce the constant factors. This would also improve the normal code paths, not just range slices. bq. If it is slow, what is the solution? Even if we lazily materialize the ranges the run time of fetching batches of results dominates the in-memory compute of getRestrictedRanges. When we talked use cases it seems like people would using paging programmatically so only console users would see this poor performance outside of the lookup table use case you mentioned. For a lookup (i.e. small) table query, or a range query that can be serviced entirely by the local node, it is quite unlikely that the fetching would dominate when talking about timescales = 1ms. bq. I didn't quite follow this. Are you talking about getLiveSortedEndpoints called from getRangeSlice? I haven't dug deep enough into getRangeSlice to tell you where the time in that goes exactly. I would have to do it again and insert some probes. I assumed it was dominated by sending remote requests. Yes - for your benchmark it would not have spent any much time here, since the sort would be a no-op and the list a single entry, but as the number of data centres and replication factor grows, and with use of NetworkTopologyStrategy, this could be a significant time expenditure. It will also on the aggregate affect a certain percentage of cpu time spent on all queries. However since the sort order is actually pretty consistent, sorting only when the sort order changes would be a way to eliminate this cost. bq. Benchmarking in what scope? This microbenchmark, defaults for workloads in cstar, tribal knowledge when doing performance work? Like I said, please do feel to drop this particular line of enquiry for the moment, since even with all of the above I doubt this is a pressing matter. But I don't think this is the end of the topic entirely - at some point this cost will be a more measurable percentage of.work done. But these kinds of costs are simply not a part of any of our current benchmarking methodology since our default configs avoid the code paths entirely (either by having no DCs, low RF, low node count, no tokens, and SimpleStrategy), and that is something we should address. In the meantime it might be worth having a simple short-circuit path for queries that may be answered by the local node only, though. Determining replicas to query is very slow with large numbers of nodes or vnodes Key: CASSANDRA-6976 URL: https://issues.apache.org/jira/browse/CASSANDRA-6976 Project: Cassandra Issue Type: Bug Components: Core Reporter: Benedict Assignee: Ariel Weisberg Labels: performance Attachments: GetRestrictedRanges.java, jmh_output.txt, jmh_output_murmur3.txt, make_jmh_work.patch As described in CASSANDRA-6906, this can be ~100ms for a relatively small cluster with vnodes, which is longer than it will spend in transit on the network. This should be much faster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6976) Determining replicas to query is very slow with large numbers of nodes or vnodes
[ https://issues.apache.org/jira/browse/CASSANDRA-6976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230280#comment-14230280 ] Benedict commented on CASSANDRA-6976: - bq. I don't see a reason to drop it just because the ticket got caught up in implementation details and not the user facing issue we want to address. Well, given the test case that originally produced this concern almost certainly had the same methodology you had, I suspect you did indeed track down the problem to a non-warm JVM bq. The entire thing runs in 60 milliseconds with 2000 tokens. That is 2x the time to warm up the cache (assuming a correct number for warmup). You're assuming that (1) the cache stays warm in normal operation and (2) that the warmup figures you have are for similar data distributions and (3) the warmup is simply a matter of presence in cache, rather than likelihood of eviction (4) all this behaviour has no negative impact outside of the method itself. But, like I said, I agree it won't likely make an order of magnitude difference by itself. Especially not with current state of C*. bq. Range queries are slow because they produce a lot of ranges. Did we determine that if the _result_ is a narrow range the performance is significantly faster? Because this stemmed from a situation where the entire contents were known to be node-local (because the data was local only, it wasn't actually distributed). I wouldn't be at all surprised if it was fine, given the likely cause you tracked down, but I don't think we actually demonstrated that? bq. What queries could identify that this shortcut is possible? I am referring here to the more general case of getLiveSortedEndpoints, which is used much more widely. But, like I said, I raised this largely because of a general bugging that this whole area of code has many inefficiencies, not because it is likely they really matter. The only thing actionable is that we *should* take steps to ensure our default (and common) test and benchmark configs more accurately represent real cluster configs because we simply do not exercise these codepaths right now from a performance perspective. Determining replicas to query is very slow with large numbers of nodes or vnodes Key: CASSANDRA-6976 URL: https://issues.apache.org/jira/browse/CASSANDRA-6976 Project: Cassandra Issue Type: Bug Components: Core Reporter: Benedict Assignee: Ariel Weisberg Labels: performance Attachments: GetRestrictedRanges.java, jmh_output.txt, jmh_output_murmur3.txt, make_jmh_work.patch As described in CASSANDRA-6906, this can be ~100ms for a relatively small cluster with vnodes, which is longer than it will spend in transit on the network. This should be much faster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7203) Flush (and Compact) High Traffic Partitions Separately
[ https://issues.apache.org/jira/browse/CASSANDRA-7203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231203#comment-14231203 ] Benedict commented on CASSANDRA-7203: - I was _mostly_ hoping to get your and [~kohlisankalp]'s views on _if those workload skews occur_. Then we could at some point later get into the nitty gritty of if it would be worth it :-) The idea wouldn't really be to special case anything except flush, and to depend on (and implement after) `improvements we have either envisaged or could later envisage to avoid compacting sstables with low predicted overlap of partitions. i.e. it would have the potential to improve the benefit of such schemes, by increasing the number of sstable pairings they can rule out. Flush (and Compact) High Traffic Partitions Separately -- Key: CASSANDRA-7203 URL: https://issues.apache.org/jira/browse/CASSANDRA-7203 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Labels: compaction, performance An idea possibly worth exploring is the use of streaming count-min sketches to collect data over the up-time of a server to estimating the velocity of different partitions, so that high-volume partitions can be flushed separately on the assumption that they will be much smaller in number, thus reducing write amplification by permitting compaction independently of any low-velocity data. Whilst the idea is reasonably straight forward, it seems that the biggest problem here will be defining any success metric. Obviously any workload following an exponential/zipf/extreme distribution is likely to benefit from such an approach, but whether or not that would translate in real terms is another matter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8018) Cassandra seems to insert twice in custom PerColumnSecondaryIndex
[ https://issues.apache.org/jira/browse/CASSANDRA-8018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231249#comment-14231249 ] Benedict commented on CASSANDRA-8018: - Good catch. A few nits on the patch, but I'll make them and commit: * iff is never a typo, it means if and only if * we should remove the call from inside addNewKey, rather than outside it, as that is the call that was originally meant to be removed. this way all of the calls happen in the same logical unit of code Cassandra seems to insert twice in custom PerColumnSecondaryIndex - Key: CASSANDRA-8018 URL: https://issues.apache.org/jira/browse/CASSANDRA-8018 Project: Cassandra Issue Type: Bug Components: Core Reporter: Pavel Chlupacek Assignee: Benjamin Lerer Fix For: 2.1.3 Attachments: CASSANDRA-8018.txt When inserting data into Cassandra 2.1.0 into table with custom secondary index, the Cell is inserted twice, if inserting new entry into row with same rowId, but different cluster index columns. CREATE KEYSPACE fulltext WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 1}; CREATE TABLE fulltext.test ( id uuid, name text, name2 text, json varchar, lucene text, primary key ( id , name)); sCREATE CUSTOM INDEX lucene_idx on fulltext.test(lucene) using 'com.spinoco.fulltext.cassandra.TestIndex'; // this causes only one insert insertInto(fulltext,test) .value(id, id1.uuid) .value(name, goosh1) .value(json, TestContent.message1.asJson) // this causes 2 inserts to be done insertInto(fulltext,test) .value(id, id1.uuid) .value(name, goosh2) .value(json, TestContent.message2.asJson) /// stacktraces for inserts (always same, for 1st and 2nd insert) custom indexer stacktraces and then at org.apache.cassandra.db.index.SecondaryIndexManager$StandardUpdater.insert(SecondaryIndexManager.java:707) at org.apache.cassandra.db.AtomicBTreeColumns$ColumnUpdater.apply(AtomicBTreeColumns.java:344) at org.apache.cassandra.db.AtomicBTreeColumns$ColumnUpdater.apply(AtomicBTreeColumns.java:319) at org.apache.cassandra.utils.btree.NodeBuilder.addNewKey(NodeBuilder.java:323) at org.apache.cassandra.utils.btree.NodeBuilder.update(NodeBuilder.java:191) at org.apache.cassandra.utils.btree.Builder.update(Builder.java:74) at org.apache.cassandra.utils.btree.BTree.update(BTree.java:186) at org.apache.cassandra.db.AtomicBTreeColumns.addAllWithSizeDelta(AtomicBTreeColumns.java:189) at org.apache.cassandra.db.Memtable.put(Memtable.java:194) at org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1142) at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:394) at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:351) at org.apache.cassandra.db.Mutation.apply(Mutation.java:214) at org.apache.cassandra.service.StorageProxy$7.runMayThrow(StorageProxy.java:970) at org.apache.cassandra.service.StorageProxy$LocalMutationRunnable.run(StorageProxy.java:2080) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:163) at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:103) at java.lang.Thread.run(Thread.java:744) Note that cell, rowkey and Group in public abstract void insert(ByteBuffer rowKey, Cell col, OpOrder.Group opGroup); are having for both successive calls same identity -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7882) Memtable slab allocation should scale logarithmically to improve occupancy rate
[ https://issues.apache.org/jira/browse/CASSANDRA-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231374#comment-14231374 ] Benedict commented on CASSANDRA-7882: - Hi Jay, I've been away for the past two months, so sorry this got left by the wayside in the meantime. I'll get around to reviewing it shortly. Memtable slab allocation should scale logarithmically to improve occupancy rate --- Key: CASSANDRA-7882 URL: https://issues.apache.org/jira/browse/CASSANDRA-7882 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jay Patel Assignee: Jay Patel Labels: performance Fix For: 2.1.3 Attachments: trunk-7882.txt CASSANDRA-5935 allows option to disable region-based allocation for on-heap memtables but there is no option to disable it for off-heap memtables (memtable_allocation_type: offheap_objects). Disabling region-based allocation will allow us to pack more tables in the schema since minimum of 1MB region won't be allocated per table. Downside can be more fragmentation which should be controllable by using better allocator like JEMalloc. How about below option in yaml?: memtable_allocation_type: unslabbed_offheap_objects Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231425#comment-14231425 ] Benedict commented on CASSANDRA-7032: - It's plain old statistics. Have a look at the java code I attached that simulates and reports the level of imbalance. Currently we randomly assign the tokens, and this results in some nodes happening to fall with all of their token ranges narrow vs the other existing tokens, and others wider. Consistent hashing is what Riak uses to achieve balance, which is one approach. Rendezvous hashing is another. But these would likely involve changing the tokens of every node in the cluster on adding a new node. This would be acceptable, but I expect with the amount of state space to work with we can design an algorithm that guarantees low bounds of imbalance without having to change the tokens assigned to any existing nodes. Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231433#comment-14231433 ] Benedict commented on CASSANDRA-7032: - I should note that the dovetailing with CASSANDRA-6696 is very important. Acceptable imbalance _per node_ is actually not _too_ tricky to deliver. But ensuring each disk on each node will have a fair share of the pie is a little harder Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7203) Flush (and Compact) High Traffic Partitions Separately
[ https://issues.apache.org/jira/browse/CASSANDRA-7203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232045#comment-14232045 ] Benedict commented on CASSANDRA-7203: - It wasn't intended to be an immediate focus, I just wanted an idea if such data distributions occurred to see if it might _ever_ be worth investigating. But I can see I'm fighting a losing battle! Flush (and Compact) High Traffic Partitions Separately -- Key: CASSANDRA-7203 URL: https://issues.apache.org/jira/browse/CASSANDRA-7203 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Labels: compaction, performance An idea possibly worth exploring is the use of streaming count-min sketches to collect data over the up-time of a server to estimating the velocity of different partitions, so that high-volume partitions can be flushed separately on the assumption that they will be much smaller in number, thus reducing write amplification by permitting compaction independently of any low-velocity data. Whilst the idea is reasonably straight forward, it seems that the biggest problem here will be defining any success metric. Obviously any workload following an exponential/zipf/extreme distribution is likely to benefit from such an approach, but whether or not that would translate in real terms is another matter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8411) Cassandra stress tool fails with NotStrictlyPositiveException on example profiles
[ https://issues.apache.org/jira/browse/CASSANDRA-8411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232818#comment-14232818 ] Benedict commented on CASSANDRA-8411: - Looks likely to be a trivial bug when providing n=1 with all other parameters default - since this is a stress tool, I don't think anybody has tried running it with only 1 insert before! Cassandra stress tool fails with NotStrictlyPositiveException on example profiles - Key: CASSANDRA-8411 URL: https://issues.apache.org/jira/browse/CASSANDRA-8411 Project: Cassandra Issue Type: Bug Components: Tools Environment: Linux Centos Reporter: Igor Meltser Priority: Critical Trying to run stress tool with provided profile fails: dsc-cassandra-2.1.2/tools $ ./bin/cassandra-stress user n=1 profile=cqlstress-example.yaml ops\(insert=1\) -node INFO 06:21:35 Using data-center name 'datacenter1' for DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct datacenter name with DCAwareRoundRobinPolicy constructor) Connected to cluster: Benchmark Cluster INFO 06:21:35 New Cassandra host /:9042 added Datatacenter: datacenter1; Host: /.; Rack: rack1 Datatacenter: datacenter1; Host: /; Rack: rack1 Datatacenter: datacenter1; Host: marcus14-p/; Rack: rack1 INFO 06:21:35 New Cassandra host marcus14-p/:9042 added INFO 06:21:35 New Cassandra host /:9042 added Created schema. Sleeping 3s for propagation. Exception in thread main org.apache.commons.math3.exception.NotStrictlyPositiveException: standard deviation (0) at org.apache.commons.math3.distribution.NormalDistribution.init(NormalDistribution.java:108) at org.apache.cassandra.stress.settings.OptionDistribution$GaussianFactory.get(OptionDistribution.java:418) at org.apache.cassandra.stress.generate.SeedManager.init(SeedManager.java:59) at org.apache.cassandra.stress.settings.SettingsCommandUser.getFactory(SettingsCommandUser.java:78) at org.apache.cassandra.stress.StressAction.run(StressAction.java:61) at org.apache.cassandra.stress.Stress.main(Stress.java:109) The tool is 2.1.2 version, but the tested Cassandra is 2.0.8 version -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8411) Cassandra stress tool fails with NotStrictlyPositiveException on example profiles
[ https://issues.apache.org/jira/browse/CASSANDRA-8411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-8411: Priority: Trivial (was: Critical) Cassandra stress tool fails with NotStrictlyPositiveException on example profiles - Key: CASSANDRA-8411 URL: https://issues.apache.org/jira/browse/CASSANDRA-8411 Project: Cassandra Issue Type: Bug Components: Tools Environment: Linux Centos Reporter: Igor Meltser Priority: Trivial Trying to run stress tool with provided profile fails: dsc-cassandra-2.1.2/tools $ ./bin/cassandra-stress user n=1 profile=cqlstress-example.yaml ops\(insert=1\) -node INFO 06:21:35 Using data-center name 'datacenter1' for DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct datacenter name with DCAwareRoundRobinPolicy constructor) Connected to cluster: Benchmark Cluster INFO 06:21:35 New Cassandra host /:9042 added Datatacenter: datacenter1; Host: /.; Rack: rack1 Datatacenter: datacenter1; Host: /; Rack: rack1 Datatacenter: datacenter1; Host: marcus14-p/; Rack: rack1 INFO 06:21:35 New Cassandra host marcus14-p/:9042 added INFO 06:21:35 New Cassandra host /:9042 added Created schema. Sleeping 3s for propagation. Exception in thread main org.apache.commons.math3.exception.NotStrictlyPositiveException: standard deviation (0) at org.apache.commons.math3.distribution.NormalDistribution.init(NormalDistribution.java:108) at org.apache.cassandra.stress.settings.OptionDistribution$GaussianFactory.get(OptionDistribution.java:418) at org.apache.cassandra.stress.generate.SeedManager.init(SeedManager.java:59) at org.apache.cassandra.stress.settings.SettingsCommandUser.getFactory(SettingsCommandUser.java:78) at org.apache.cassandra.stress.StressAction.run(StressAction.java:61) at org.apache.cassandra.stress.Stress.main(Stress.java:109) The tool is 2.1.2 version, but the tested Cassandra is 2.0.8 version -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)
[ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232833#comment-14232833 ] Benedict commented on CASSANDRA-7438: - re: hash bits: there's not really a dramatic benefit to using more than 32-bits. We will always use the upper bits for the segment and the lower bits for the bucket, for which 4B items is plenty, although we don't have proper entropy for all the bits; we may have only 28-bits of good collision free-ness; we will want to rehash the murmur hash to ensure this is spread evenly to avoid a grow boundary consistently failing to reduce collisions. The one advantage of having some spare hash bits is that we can use these to avoid running a potentially expensive comparison on a large key until high confidence we've found the correct item - and as the number of unused hash bits for indexing dwindle, the value of this goes up. But the number of instances where this helps will be vanishingly small, since the head of the key will be on the same cache line and a hash collision and key prefix collision is pretty unlikely. It might be more significant if we were to use open-address hashing, as we would have excellent locality and reduce the number of expected cache misses for a lookup. But this won't be measurable above the cache serialization costs. We do already have these hash bits calculated in c*, typically. We also are unlikely to notice the overhead - allocations are likely to have ~16 bytes of overhead, be padded to the nearest 8 or 16 bytes, and a row has a lot of bumpf to encode. I doubt there will be any variation in storage costs from using all 64 bits. i.e., whatever floats your boat Serializing Row cache alternative (Fully off heap) -- Key: CASSANDRA-7438 URL: https://issues.apache.org/jira/browse/CASSANDRA-7438 Project: Cassandra Issue Type: Improvement Components: Core Environment: Linux Reporter: Vijay Assignee: Vijay Labels: performance Fix For: 3.0 Attachments: 0001-CASSANDRA-7438.patch, tests.zip Currently SerializingCache is partially off heap, keys are still stored in JVM heap as BB, * There is a higher GC costs for a reasonably big cache. * Some users have used the row cache efficiently in production for better results, but this requires careful tunning. * Overhead in Memory for the cache entries are relatively high. So the proposal for this ticket is to move the LRU cache logic completely off heap and use JNI to interact with cache. We might want to ensure that the new implementation match the existing API's (ICache), and the implementation needs to have safe memory access, low overhead in memory and less memcpy's (As much as possible). We might also want to make this cache configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8383) Memtable flush may expire records from the commit log that are in a later memtable
[ https://issues.apache.org/jira/browse/CASSANDRA-8383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232824#comment-14232824 ] Benedict commented on CASSANDRA-8383: - bq. Does this deserve a regression test? bq. We should also introduce a commit log correctness stress test, so we can reproduce this, be certain it is fixed, and so we can be sure to avoid this or similar scenarios in future. Yes, absolutely. However I have been tasked with other pressing things - I only took time out to file and address this because it is an obvious and dangerous potential failure of correctness. We should file a follow up ticket for introducing rigorous randomized testing to tease out any potential correctness issues from this codepath, which either can be looked at immediately by somebody else, or I can take a look at once my current workload is dealt with. But doing this well requires a bit of time and focus, which I didn't want holding up a fix. Memtable flush may expire records from the commit log that are in a later memtable -- Key: CASSANDRA-8383 URL: https://issues.apache.org/jira/browse/CASSANDRA-8383 Project: Cassandra Issue Type: Bug Components: Core Reporter: Benedict Assignee: Benedict Priority: Critical Labels: commitlog Fix For: 2.1.3 This is a pretty obvious bug with any care of thought, so not sure how I managed to introduce it. We use OpOrder to ensure all writes to a memtable have finished before flushing, however we also use this OpOrder to direct writes to the correct memtable. However this is insufficient, since the OpOrder is only a partial order; an operation from the future (i.e. for the next memtable) could still interleave with the past operations in such a way that they grab a CL entry inbetween the past operations. Since we simply take the max ReplayPosition of those in the past, this would mean any interleaved future operations would be expired even though they haven't been persisted to disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (CASSANDRA-8412) Cassandra stress tool fails with NotStrictlyPositiveException on example profiles
[ https://issues.apache.org/jira/browse/CASSANDRA-8412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict resolved CASSANDRA-8412. - Resolution: Duplicate Cassandra stress tool fails with NotStrictlyPositiveException on example profiles - Key: CASSANDRA-8412 URL: https://issues.apache.org/jira/browse/CASSANDRA-8412 Project: Cassandra Issue Type: Bug Components: Tools Environment: Linux Centos Reporter: Igor Meltser Priority: Critical Labels: stress, tools Trying to run stress tool with provided profile fails: dsc-cassandra-2.1.2/tools $ ./bin/cassandra-stress user n=1 profile=cqlstress-example.yaml ops\(insert=1\) -node INFO 06:21:35 Using data-center name 'datacenter1' for DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct datacenter name with DCAwareRoundRobinPolicy constructor) Connected to cluster: Benchmark Cluster INFO 06:21:35 New Cassandra host /:9042 added Datatacenter: datacenter1; Host: /.; Rack: rack1 Datatacenter: datacenter1; Host: /; Rack: rack1 Datatacenter: datacenter1; Host: ./; Rack: rack1 INFO 06:21:35 New Cassandra host ./:9042 added INFO 06:21:35 New Cassandra host /:9042 added Created schema. Sleeping 3s for propagation. Exception in thread main org.apache.commons.math3.exception.NotStrictlyPositiveException: standard deviation (0) at org.apache.commons.math3.distribution.NormalDistribution.init(NormalDistribution.java:108) at org.apache.cassandra.stress.settings.OptionDistribution$GaussianFactory.get(OptionDistribution.java:418) at org.apache.cassandra.stress.generate.SeedManager.init(SeedManager.java:59) at org.apache.cassandra.stress.settings.SettingsCommandUser.getFactory(SettingsCommandUser.java:78) at org.apache.cassandra.stress.StressAction.run(StressAction.java:61) at org.apache.cassandra.stress.Stress.main(Stress.java:109) The tool is 2.1.2 version, but the tested Cassandra is 2.0.8 version -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232857#comment-14232857 ] Benedict commented on CASSANDRA-7032: - Well, NetworkTopologyStrategy already enforces some degree of balance across racks, and absolutely guarantees balance across DCs as far as replication ownership is concerned. It _would_ be nice to migrate this behaviour to the token selection so that we could reason about ownership a bit more clearly (NTS might enforce our general ownership constraints, but having a predictably cheap generation strategy for end points would be great, as the amount of state necessary to route queries could shrink dramatically. if we could rely on a sequence of adjacent tokens ensuring these properties, for instance), but a simpler goal of simply ensuring that for any given arbitrary slice of the global token range, all nodes have a share of the range that is within epsilon of perfect, should be more than sufficient. TL;DR; our goal should probably be: for any given arbitrary slice of the global token range, all nodes have a share of the range that is within epsilon* of perfect * with epsilon probably inversely proportional to the size of the slice Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232857#comment-14232857 ] Benedict edited comment on CASSANDRA-7032 at 12/3/14 10:38 AM: --- Well, NetworkTopologyStrategy already enforces some degree of balance across racks, and absolutely guarantees balance across DCs as far as replication ownership is concerned. It _would_ be nice to migrate this behaviour to the token selection so that we could reason about ownership a bit more clearly (NTS might enforce our general ownership constraints, but having a predictably cheap generation strategy for end points would be great, as the amount of state necessary to route queries could shrink dramatically. if we could rely on a sequence of adjacent tokens ensuring these properties, for instance), but a simpler goal of simply ensuring that for any given arbitrary slice of the global token range, all nodes have a share of the range that is within epsilon of perfect, should be more than sufficient. TL;DR; our goal should probably be: for any given arbitrary slice of the global token range, all nodes have a share of the range that is within epsilon* of perfect \* with epsilon probably inversely proportional to the size of the slice was (Author: benedict): Well, NetworkTopologyStrategy already enforces some degree of balance across racks, and absolutely guarantees balance across DCs as far as replication ownership is concerned. It _would_ be nice to migrate this behaviour to the token selection so that we could reason about ownership a bit more clearly (NTS might enforce our general ownership constraints, but having a predictably cheap generation strategy for end points would be great, as the amount of state necessary to route queries could shrink dramatically. if we could rely on a sequence of adjacent tokens ensuring these properties, for instance), but a simpler goal of simply ensuring that for any given arbitrary slice of the global token range, all nodes have a share of the range that is within epsilon of perfect, should be more than sufficient. TL;DR; our goal should probably be: for any given arbitrary slice of the global token range, all nodes have a share of the range that is within epsilon* of perfect * with epsilon probably inversely proportional to the size of the slice Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-8411) Cassandra stress tool fails with NotStrictlyPositiveException on example profiles
[ https://issues.apache.org/jira/browse/CASSANDRA-8411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232818#comment-14232818 ] Benedict edited comment on CASSANDRA-8411 at 12/3/14 10:45 AM: --- Looks likely to be a trivial bug when providing n=1 with all other parameters default - since this is a stress tool, I don't think anybody has tried running it with only 1 insert before! If you want it to work in the meantime, try providing n=1000, say was (Author: benedict): Looks likely to be a trivial bug when providing n=1 with all other parameters default - since this is a stress tool, I don't think anybody has tried running it with only 1 insert before! Cassandra stress tool fails with NotStrictlyPositiveException on example profiles - Key: CASSANDRA-8411 URL: https://issues.apache.org/jira/browse/CASSANDRA-8411 Project: Cassandra Issue Type: Bug Components: Tools Environment: Linux Centos Reporter: Igor Meltser Priority: Trivial Trying to run stress tool with provided profile fails: dsc-cassandra-2.1.2/tools $ ./bin/cassandra-stress user n=1 profile=cqlstress-example.yaml ops\(insert=1\) -node INFO 06:21:35 Using data-center name 'datacenter1' for DCAwareRoundRobinPolicy (if this is incorrect, please provide the correct datacenter name with DCAwareRoundRobinPolicy constructor) Connected to cluster: Benchmark Cluster INFO 06:21:35 New Cassandra host /:9042 added Datatacenter: datacenter1; Host: /.; Rack: rack1 Datatacenter: datacenter1; Host: /; Rack: rack1 Datatacenter: datacenter1; Host: /; Rack: rack1 INFO 06:21:35 New Cassandra host /:9042 added INFO 06:21:35 New Cassandra host /:9042 added Created schema. Sleeping 3s for propagation. Exception in thread main org.apache.commons.math3.exception.NotStrictlyPositiveException: standard deviation (0) at org.apache.commons.math3.distribution.NormalDistribution.init(NormalDistribution.java:108) at org.apache.cassandra.stress.settings.OptionDistribution$GaussianFactory.get(OptionDistribution.java:418) at org.apache.cassandra.stress.generate.SeedManager.init(SeedManager.java:59) at org.apache.cassandra.stress.settings.SettingsCommandUser.getFactory(SettingsCommandUser.java:78) at org.apache.cassandra.stress.StressAction.run(StressAction.java:61) at org.apache.cassandra.stress.Stress.main(Stress.java:109) The tool is 2.1.2 version, but the tested Cassandra is 2.0.8 version -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-8413) Bloom filter false positive ratio is not honoured
Benedict created CASSANDRA-8413: --- Summary: Bloom filter false positive ratio is not honoured Key: CASSANDRA-8413 URL: https://issues.apache.org/jira/browse/CASSANDRA-8413 Project: Cassandra Issue Type: Bug Components: Core Reporter: Benedict Fix For: 2.0.12, 2.1.3 Whilst thinking about CASSANDRA-7438 and hash bits, I realised we have a problem with sabotaging our bloom filters when using the murmur3 partitioner. I have performed a very quick test to confirm this risk is real. Since a typical cluster uses the same murmur3 hash for partitioning as we do for bloom filter lookups, and we own a contiguous range, we can guarantee that the top X bits collide for all keys on the node. This translates into poor bloom filter distribution. I quickly hacked LongBloomFilterTest to simulate the problem, and the result in these tests is _up to_ a doubling of the actual false positive ratio. The actual change will depend on the key distribution, the number of keys, the false positive ratio, the number of nodes, the token distribution, etc. But seems to be a real problem for non-vnode clusters of at least ~128 nodes in size. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8413) Bloom filter false positive ratio is not honoured
[ https://issues.apache.org/jira/browse/CASSANDRA-8413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-8413: Attachment: 8413.hack.txt Bloom filter false positive ratio is not honoured - Key: CASSANDRA-8413 URL: https://issues.apache.org/jira/browse/CASSANDRA-8413 Project: Cassandra Issue Type: Bug Components: Core Reporter: Benedict Fix For: 2.0.12, 2.1.3 Attachments: 8413.hack.txt Whilst thinking about CASSANDRA-7438 and hash bits, I realised we have a problem with sabotaging our bloom filters when using the murmur3 partitioner. I have performed a very quick test to confirm this risk is real. Since a typical cluster uses the same murmur3 hash for partitioning as we do for bloom filter lookups, and we own a contiguous range, we can guarantee that the top X bits collide for all keys on the node. This translates into poor bloom filter distribution. I quickly hacked LongBloomFilterTest to simulate the problem, and the result in these tests is _up to_ a doubling of the actual false positive ratio. The actual change will depend on the key distribution, the number of keys, the false positive ratio, the number of nodes, the token distribution, etc. But seems to be a real problem for non-vnode clusters of at least ~128 nodes in size. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (CASSANDRA-7882) Memtable slab allocation should scale logarithmically to improve occupancy rate
[ https://issues.apache.org/jira/browse/CASSANDRA-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict reassigned CASSANDRA-7882: --- Assignee: Benedict (was: Jay Patel) Memtable slab allocation should scale logarithmically to improve occupancy rate --- Key: CASSANDRA-7882 URL: https://issues.apache.org/jira/browse/CASSANDRA-7882 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jay Patel Assignee: Benedict Labels: performance Fix For: 2.1.3 Attachments: trunk-7882.txt CASSANDRA-5935 allows option to disable region-based allocation for on-heap memtables but there is no option to disable it for off-heap memtables (memtable_allocation_type: offheap_objects). Disabling region-based allocation will allow us to pack more tables in the schema since minimum of 1MB region won't be allocated per table. Downside can be more fragmentation which should be controllable by using better allocator like JEMalloc. How about below option in yaml?: memtable_allocation_type: unslabbed_offheap_objects Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7882) Memtable slab allocation should scale logarithmically to improve occupancy rate
[ https://issues.apache.org/jira/browse/CASSANDRA-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233170#comment-14233170 ] Benedict commented on CASSANDRA-7882: - I've posted a variant of the patch [here|https://github.com/belliottsmith/cassandra/tree/7882-nativeallocator] There are a few changes, a couple unrelated just cleaning up the class: # removed the unslabbed and regionCount variables, as they weren't used for anything important # removed the nextRegionSize variable: it wasn't being maintained atomically, but just as importantly it's messy to do it separately: #* instead of setting a full region to null, we swap it straight to a new region, using the prior region to determine the size of the new region #* we ensure the new region size is at least large enough to hold the allocation we're inserting # we cap the size of each race allocated queue to 8 entries, as this should permit plenty of leeway for avoiding heavy competition thrashing the allocator, but not so much that we have a lot of primarily unused memory There is one issue, though, which is if this should make it into 2.1, or wait until 3.0. I'm pretty comfortable either way, but my gut feeling is others will prefer it wait until 3.0. [~jbellis], what's your view? Memtable slab allocation should scale logarithmically to improve occupancy rate --- Key: CASSANDRA-7882 URL: https://issues.apache.org/jira/browse/CASSANDRA-7882 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jay Patel Assignee: Jay Patel Labels: performance Fix For: 2.1.3 Attachments: trunk-7882.txt CASSANDRA-5935 allows option to disable region-based allocation for on-heap memtables but there is no option to disable it for off-heap memtables (memtable_allocation_type: offheap_objects). Disabling region-based allocation will allow us to pack more tables in the schema since minimum of 1MB region won't be allocated per table. Downside can be more fragmentation which should be controllable by using better allocator like JEMalloc. How about below option in yaml?: memtable_allocation_type: unslabbed_offheap_objects Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-7882) Memtable slab allocation should scale logarithmically to improve occupancy rate
[ https://issues.apache.org/jira/browse/CASSANDRA-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-7882: Reviewer: Jay Patel (was: Benedict) Memtable slab allocation should scale logarithmically to improve occupancy rate --- Key: CASSANDRA-7882 URL: https://issues.apache.org/jira/browse/CASSANDRA-7882 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jay Patel Assignee: Benedict Labels: performance Fix For: 2.1.3 Attachments: trunk-7882.txt CASSANDRA-5935 allows option to disable region-based allocation for on-heap memtables but there is no option to disable it for off-heap memtables (memtable_allocation_type: offheap_objects). Disabling region-based allocation will allow us to pack more tables in the schema since minimum of 1MB region won't be allocated per table. Downside can be more fragmentation which should be controllable by using better allocator like JEMalloc. How about below option in yaml?: memtable_allocation_type: unslabbed_offheap_objects Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8414) Avoid loops over array backed iterators that call iter.remove()
[ https://issues.apache.org/jira/browse/CASSANDRA-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-8414: Summary: Avoid loops over array backed iterators that call iter.remove() (was: Compaction is O(n^2) when deleting lots of tombstones) Avoid loops over array backed iterators that call iter.remove() --- Key: CASSANDRA-8414 URL: https://issues.apache.org/jira/browse/CASSANDRA-8414 Project: Cassandra Issue Type: Bug Components: Core Reporter: Richard Low I noticed from sampling that sometimes compaction spends almost all of its time in iter.remove() in ColumnFamilyStore.removeDeletedStandard. It turns out that the cf object is using ArrayBackedSortedColumns, so deletes are from an ArrayList. If the majority of your columns are GCable tombstones then this is O(n^2). The data structure should be changed or a copy made to avoid this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8414) Avoid loops over array backed iterators that call iter.remove()
[ https://issues.apache.org/jira/browse/CASSANDRA-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233485#comment-14233485 ] Benedict commented on CASSANDRA-8414: - I've edited the title because it's not quite that compaction is O(n^2), but that certain operations within a partition are. It's also not limited to just that specific method. The best solution is probably to introduce a special deletion iterator on which a call to remove() simply sets a corresponding bit to 1; once we exhaust the iterator we commit the deletes in one pass. Avoid loops over array backed iterators that call iter.remove() --- Key: CASSANDRA-8414 URL: https://issues.apache.org/jira/browse/CASSANDRA-8414 Project: Cassandra Issue Type: Bug Components: Core Reporter: Richard Low I noticed from sampling that sometimes compaction spends almost all of its time in iter.remove() in ColumnFamilyStore.removeDeletedStandard. It turns out that the cf object is using ArrayBackedSortedColumns, so deletes are from an ArrayList. If the majority of your columns are GCable tombstones then this is O(n^2). The data structure should be changed or a copy made to avoid this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8414) Avoid loops over array backed iterators that call iter.remove()
[ https://issues.apache.org/jira/browse/CASSANDRA-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-8414: Labels: performance (was: ) Avoid loops over array backed iterators that call iter.remove() --- Key: CASSANDRA-8414 URL: https://issues.apache.org/jira/browse/CASSANDRA-8414 Project: Cassandra Issue Type: Bug Components: Core Reporter: Richard Low Labels: performance I noticed from sampling that sometimes compaction spends almost all of its time in iter.remove() in ColumnFamilyStore.removeDeletedStandard. It turns out that the cf object is using ArrayBackedSortedColumns, so deletes are from an ArrayList. If the majority of your columns are GCable tombstones then this is O(n^2). The data structure should be changed or a copy made to avoid this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8414) Avoid loops over array backed iterators that call iter.remove()
[ https://issues.apache.org/jira/browse/CASSANDRA-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-8414: Fix Version/s: 2.1.3 Avoid loops over array backed iterators that call iter.remove() --- Key: CASSANDRA-8414 URL: https://issues.apache.org/jira/browse/CASSANDRA-8414 Project: Cassandra Issue Type: Bug Components: Core Reporter: Richard Low Labels: performance Fix For: 2.1.3 I noticed from sampling that sometimes compaction spends almost all of its time in iter.remove() in ColumnFamilyStore.removeDeletedStandard. It turns out that the cf object is using ArrayBackedSortedColumns, so deletes are from an ArrayList. If the majority of your columns are GCable tombstones then this is O(n^2). The data structure should be changed or a copy made to avoid this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-8431) Stress should validate the results of queries in user profile mode
Benedict created CASSANDRA-8431: --- Summary: Stress should validate the results of queries in user profile mode Key: CASSANDRA-8431 URL: https://issues.apache.org/jira/browse/CASSANDRA-8431 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Benedict CASSANDRA-8429 was exhibited by the validation logic in stress. However at the moment the new-fangled profile driven user mode doesn't perform any validation. So as we default more and more to the new approach we will be less and less likely to spot correctness issues. Introducing validation logic here could be tricky, since we can support arbitrary user queries. However we could support a query mode where only the columns and number of cql rows to fetch are defined, for which we could calculate the exact result set we expect. There would be complications with insertions that proceed out-of-order, but we could either not support this mode, or have a validation mode that just ensures a superset of the data we know to be inserted has been. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-8434) L0 should have a separate configurable bloom filter false positive ratio
Benedict created CASSANDRA-8434: --- Summary: L0 should have a separate configurable bloom filter false positive ratio Key: CASSANDRA-8434 URL: https://issues.apache.org/jira/browse/CASSANDRA-8434 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Fix For: 2.0.12, 2.1.3 In follow up to CASSANDRA-5371. We now perform size-tiered file selection for compaction if L0 gets too far behind, however as far as I can tell we stick with the CF configured false positive ratio, likely inflating substantially the number of files we visit on average until L0 is cleaned up. Having a a different bf fp for L0 would solve this problem without introducing any significant burden when L0 is not overloaded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8409) Node generating a huge number of tiny sstable_activity flushes
[ https://issues.apache.org/jira/browse/CASSANDRA-8409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237708#comment-14237708 ] Benedict commented on CASSANDRA-8409: - The full system log would be heplful for diagnosis. However if there are a lot of competing updates to a single partition (e.g. lots of non-batch inserts into a single partition key) then it's possible CASSANDRA-8018 could have triggered this. By applying the update function twice, we would screw up our memory count if the update fails to apply due to competition. If this happened often enough we could get to a situation where the cleaner is incapable of generating a task that will clean enough memory, so tries to flush on every allocation. Node generating a huge number of tiny sstable_activity flushes -- Key: CASSANDRA-8409 URL: https://issues.apache.org/jira/browse/CASSANDRA-8409 Project: Cassandra Issue Type: Bug Components: Core Environment: Cassandra 2.1.0, Oracle JDK 1.8.0_25, Ubuntu 12.04 Reporter: Fred Wulff Fix For: 2.1.3 Attachments: system-sstable_activity-ka-67802-Data.db On one of my nodes, I’m seeing hundreds per second of “INFO 21:28:05 Enqueuing flush of sstable_activity: 0 (0%) on-heap, 33 (0%) off-heap”. tpstats shows a steadily climbing # of pending MemtableFlushWriter/MemtablePostFlush until the node OOMs. When the flushes actually happen the sstable written is invariably 121 bytes. I’m writing pretty aggressively to one of my user tables (sev.mdb_group_pit), but that table's flushing behavior seems reasonable. tpstats: {quote} frew@hostname:~/s_dist/apache-cassandra-2.1.0$ bin/nodetool -h hostname tpstats Pool NameActive Pending Completed Blocked All time blocked MutationStage 128 4429 36810 0 0 ReadStage 0 0 1205 0 0 RequestResponseStage 0 0 24910 0 0 ReadRepairStage 0 0 26 0 0 CounterMutationStage 0 0 0 0 0 MiscStage 0 0 0 0 0 HintedHandoff 2 2 9 0 0 GossipStage 0 0 5157 0 0 CacheCleanupExecutor 0 0 0 0 0 InternalResponseStage 0 0 0 0 0 CommitLogArchiver 0 0 0 0 0 CompactionExecutor428429 0 0 ValidationExecutor0 0 0 0 0 MigrationStage0 0 0 0 0 AntiEntropyStage 0 0 0 0 0 PendingRangeCalculator0 0 11 0 0 MemtableFlushWriter 8 38644 8987 0 0 MemtablePostFlush 1 38940 8735 0 0 MemtableReclaimMemory 0 0 8987 0 0 Message type Dropped READ 0 RANGE_SLICE 0 _TRACE 0 MUTATION 10457 COUNTER_MUTATION 0 BINARY 0 REQUEST_RESPONSE 0 PAGED_RANGE 0 READ_REPAIR208 {quote} I've attached one of the produced sstables. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7873) Replace AbstractRowResolver.replies with collection with tailored properties
[ https://issues.apache.org/jira/browse/CASSANDRA-7873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237740#comment-14237740 ] Benedict commented on CASSANDRA-7873: - mea culpa. thanks Replace AbstractRowResolver.replies with collection with tailored properties Key: CASSANDRA-7873 URL: https://issues.apache.org/jira/browse/CASSANDRA-7873 Project: Cassandra Issue Type: Bug Environment: OSX and Ubuntu 14.04 Reporter: Philip Thompson Assignee: Benedict Fix For: 3.0 Attachments: 7873.21.txt, 7873.trunk.txt, 7873.txt, 7873_fixup.txt The dtest auth_test.py:TestAuth.system_auth_ks_is_alterable_test is failing on trunk only with the following stack trace: {code} Unexpected error in node1 node log: ERROR [Thrift:1] 2014-09-03 15:48:08,389 CustomTThreadPoolServer.java:219 - Error occurred during processing of message. java.util.ConcurrentModificationException: null at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859) ~[na:1.7.0_65] at java.util.ArrayList$Itr.next(ArrayList.java:831) ~[na:1.7.0_65] at org.apache.cassandra.service.RowDigestResolver.resolve(RowDigestResolver.java:71) ~[main/:na] at org.apache.cassandra.service.RowDigestResolver.resolve(RowDigestResolver.java:28) ~[main/:na] at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:110) ~[main/:na] at org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:144) ~[main/:na] at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1228) ~[main/:na] at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1154) ~[main/:na] at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:256) ~[main/:na] at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:212) ~[main/:na] at org.apache.cassandra.auth.Auth.selectUser(Auth.java:257) ~[main/:na] at org.apache.cassandra.auth.Auth.isExistingUser(Auth.java:76) ~[main/:na] at org.apache.cassandra.service.ClientState.login(ClientState.java:178) ~[main/:na] at org.apache.cassandra.thrift.CassandraServer.login(CassandraServer.java:1486) ~[main/:na] at org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3579) ~[thrift/:na] at org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3563) ~[thrift/:na] at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) ~[libthrift-0.9.1.jar:0.9.1] at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) ~[libthrift-0.9.1.jar:0.9.1] at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:201) ~[main/:na] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_65] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_65] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_65] {code} That exception is thrown when the following query is sent: {code} SELECT strategy_options FROM system.schema_keyspaces WHERE keyspace_name = 'system_auth' {code} The test alters the RF of the system_auth keyspace, then shuts down and restarts the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8308) Windows: Commitlog access violations on unit tests
[ https://issues.apache.org/jira/browse/CASSANDRA-8308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238128#comment-14238128 ] Benedict commented on CASSANDRA-8308: - Sure. Will review tomorrow. Windows: Commitlog access violations on unit tests -- Key: CASSANDRA-8308 URL: https://issues.apache.org/jira/browse/CASSANDRA-8308 Project: Cassandra Issue Type: Bug Reporter: Joshua McKenzie Assignee: Joshua McKenzie Priority: Minor Labels: Windows Fix For: 3.0 Attachments: 8308_v1.txt We have four unit tests failing on trunk on Windows, all with FileSystemException's related to the SchemaLoader: {noformat} [junit] Test org.apache.cassandra.db.compaction.DateTieredCompactionStrategyTest FAILED [junit] Test org.apache.cassandra.cql3.ThriftCompatibilityTest FAILED [junit] Test org.apache.cassandra.io.sstable.SSTableRewriterTest FAILED [junit] Test org.apache.cassandra.repair.LocalSyncTaskTest FAILED {noformat} Example error: {noformat} [junit] Caused by: java.nio.file.FileSystemException: build\test\cassandra\commitlog;0\CommitLog-5-1415908745965.log: The process cannot access the file because it is being used by another process. [junit] [junit] at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86) [junit] at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97) [junit] at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102) [junit] at sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:269) [junit] at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103) [junit] at java.nio.file.Files.delete(Files.java:1079) [junit] at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:125) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8308) Windows: Commitlog access violations on unit tests
[ https://issues.apache.org/jira/browse/CASSANDRA-8308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-8308: Reviewer: Benedict Windows: Commitlog access violations on unit tests -- Key: CASSANDRA-8308 URL: https://issues.apache.org/jira/browse/CASSANDRA-8308 Project: Cassandra Issue Type: Bug Reporter: Joshua McKenzie Assignee: Joshua McKenzie Priority: Minor Labels: Windows Fix For: 3.0 Attachments: 8308_v1.txt We have four unit tests failing on trunk on Windows, all with FileSystemException's related to the SchemaLoader: {noformat} [junit] Test org.apache.cassandra.db.compaction.DateTieredCompactionStrategyTest FAILED [junit] Test org.apache.cassandra.cql3.ThriftCompatibilityTest FAILED [junit] Test org.apache.cassandra.io.sstable.SSTableRewriterTest FAILED [junit] Test org.apache.cassandra.repair.LocalSyncTaskTest FAILED {noformat} Example error: {noformat} [junit] Caused by: java.nio.file.FileSystemException: build\test\cassandra\commitlog;0\CommitLog-5-1415908745965.log: The process cannot access the file because it is being used by another process. [junit] [junit] at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86) [junit] at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97) [junit] at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102) [junit] at sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:269) [junit] at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103) [junit] at java.nio.file.Files.delete(Files.java:1079) [junit] at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:125) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8308) Windows: Commitlog access violations on unit tests
[ https://issues.apache.org/jira/browse/CASSANDRA-8308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239291#comment-14239291 ] Benedict commented on CASSANDRA-8308: - * channel.truncate() is not equivalent to raf.setLength(), and we want the length to be set upfront to somewhat ensure contiguity * would be nice to extract the is linux decision to an enum, embed it in FBUtilities where we already have an isUnix() method (and an OPERATING_SYSTEM property, that could be converted to the enum) Windows: Commitlog access violations on unit tests -- Key: CASSANDRA-8308 URL: https://issues.apache.org/jira/browse/CASSANDRA-8308 Project: Cassandra Issue Type: Bug Reporter: Joshua McKenzie Assignee: Joshua McKenzie Priority: Minor Labels: Windows Fix For: 3.0 Attachments: 8308_v1.txt We have four unit tests failing on trunk on Windows, all with FileSystemException's related to the SchemaLoader: {noformat} [junit] Test org.apache.cassandra.db.compaction.DateTieredCompactionStrategyTest FAILED [junit] Test org.apache.cassandra.cql3.ThriftCompatibilityTest FAILED [junit] Test org.apache.cassandra.io.sstable.SSTableRewriterTest FAILED [junit] Test org.apache.cassandra.repair.LocalSyncTaskTest FAILED {noformat} Example error: {noformat} [junit] Caused by: java.nio.file.FileSystemException: build\test\cassandra\commitlog;0\CommitLog-5-1415908745965.log: The process cannot access the file because it is being used by another process. [junit] [junit] at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86) [junit] at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97) [junit] at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102) [junit] at sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:269) [junit] at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103) [junit] at java.nio.file.Files.delete(Files.java:1079) [junit] at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:125) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6993) Windows: remove mmap'ed I/O for index files and force standard file access
[ https://issues.apache.org/jira/browse/CASSANDRA-6993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239307#comment-14239307 ] Benedict commented on CASSANDRA-6993: - Replacing isUnix() with !isWindows() is not functionally equivalent; this will capture Mac, Solaris, OpenBSD, FreeBSD and others as well, although in many situations this actually adequately captures what we want (such as for your specific change) it likely won't in all cases. As with CASSANRA-8038 this would benefit from sanitising our OS detection. Perhaps we could split this out into a minor ticket these both depend upon, as we have a bit of a mess right now that permits these sorts of logical mismatches to crop up. We should probably group POSIX compliant OSes together, and POSIX compliant file systems together, one of which is probably what we generally mean when we say isUnix(). Windows: remove mmap'ed I/O for index files and force standard file access -- Key: CASSANDRA-6993 URL: https://issues.apache.org/jira/browse/CASSANDRA-6993 Project: Cassandra Issue Type: Improvement Reporter: Joshua McKenzie Assignee: Joshua McKenzie Priority: Minor Labels: Windows Fix For: 3.0, 2.1.3 Attachments: 6993_2.1_v1.txt, 6993_v1.txt, 6993_v2.txt Memory-mapped I/O on Windows causes issues with hard-links; we're unable to delete hard-links to open files with memory-mapped segments even using nio. We'll need to push for close to performance parity between mmap'ed I/O and buffered going forward as the buffered / compressed path offers other benefits. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8414) Avoid loops over array backed iterators that call iter.remove()
[ https://issues.apache.org/jira/browse/CASSANDRA-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239312#comment-14239312 ] Benedict commented on CASSANDRA-8414: - We should integrate this for 2.1 also, since this behaviour is exhibited still, just not in compaction. In 2.1 we should use System.arraycopy and removed.nextSetBit though, as the performance will be improved, particularly for sparse removes. Avoid loops over array backed iterators that call iter.remove() --- Key: CASSANDRA-8414 URL: https://issues.apache.org/jira/browse/CASSANDRA-8414 Project: Cassandra Issue Type: Bug Components: Core Reporter: Richard Low Assignee: Jimmy Mårdell Labels: performance Fix For: 2.1.3 Attachments: cassandra-2.0-8414-1.txt I noticed from sampling that sometimes compaction spends almost all of its time in iter.remove() in ColumnFamilyStore.removeDeletedStandard. It turns out that the cf object is using ArrayBackedSortedColumns, so deletes are from an ArrayList. If the majority of your columns are GCable tombstones then this is O(n^2). The data structure should be changed or a copy made to avoid this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7873) Replace AbstractRowResolver.replies with collection with tailored properties
[ https://issues.apache.org/jira/browse/CASSANDRA-7873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237740#comment-14237740 ] Benedict edited comment on CASSANDRA-7873 at 12/9/14 11:43 AM: --- mea culpa. thanks +1 was (Author: benedict): mea culpa. thanks Replace AbstractRowResolver.replies with collection with tailored properties Key: CASSANDRA-7873 URL: https://issues.apache.org/jira/browse/CASSANDRA-7873 Project: Cassandra Issue Type: Bug Environment: OSX and Ubuntu 14.04 Reporter: Philip Thompson Assignee: Benedict Fix For: 3.0 Attachments: 7873.21.txt, 7873.trunk.txt, 7873.txt, 7873_fixup.txt The dtest auth_test.py:TestAuth.system_auth_ks_is_alterable_test is failing on trunk only with the following stack trace: {code} Unexpected error in node1 node log: ERROR [Thrift:1] 2014-09-03 15:48:08,389 CustomTThreadPoolServer.java:219 - Error occurred during processing of message. java.util.ConcurrentModificationException: null at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859) ~[na:1.7.0_65] at java.util.ArrayList$Itr.next(ArrayList.java:831) ~[na:1.7.0_65] at org.apache.cassandra.service.RowDigestResolver.resolve(RowDigestResolver.java:71) ~[main/:na] at org.apache.cassandra.service.RowDigestResolver.resolve(RowDigestResolver.java:28) ~[main/:na] at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:110) ~[main/:na] at org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:144) ~[main/:na] at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1228) ~[main/:na] at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1154) ~[main/:na] at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:256) ~[main/:na] at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:212) ~[main/:na] at org.apache.cassandra.auth.Auth.selectUser(Auth.java:257) ~[main/:na] at org.apache.cassandra.auth.Auth.isExistingUser(Auth.java:76) ~[main/:na] at org.apache.cassandra.service.ClientState.login(ClientState.java:178) ~[main/:na] at org.apache.cassandra.thrift.CassandraServer.login(CassandraServer.java:1486) ~[main/:na] at org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3579) ~[thrift/:na] at org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3563) ~[thrift/:na] at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) ~[libthrift-0.9.1.jar:0.9.1] at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) ~[libthrift-0.9.1.jar:0.9.1] at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:201) ~[main/:na] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_65] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_65] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_65] {code} That exception is thrown when the following query is sent: {code} SELECT strategy_options FROM system.schema_keyspaces WHERE keyspace_name = 'system_auth' {code} The test alters the RF of the system_auth keyspace, then shuts down and restarts the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (CASSANDRA-8312) Use live sstables in snapshot repair if possible
[ https://issues.apache.org/jira/browse/CASSANDRA-8312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict reopened CASSANDRA-8312: - It looks to me like this doesn't tidy up after itself properly, at least on trunk. It opens an sstable from the snapshot if necessary, references it, and then releases only the reference it acquired - not the extra reference that would permit its BF etc. to be reclaimed. So this will likely leak significant amounts of memory. Use live sstables in snapshot repair if possible Key: CASSANDRA-8312 URL: https://issues.apache.org/jira/browse/CASSANDRA-8312 Project: Cassandra Issue Type: Improvement Reporter: Jimmy Mårdell Assignee: Jimmy Mårdell Priority: Minor Fix For: 2.0.12, 3.0, 2.1.3 Attachments: cassandra-2.0-8312-1.txt Snapshot repair can be very much slower than parallel repairs because of the overhead of opening the SSTables in the snapshot. This is particular true when using LCS, as you typically have many smaller SSTables then. I compared parallel and sequential repair on a small range on one of our clusters (2*3 replicas). With parallel repair, this took 22 seconds. With sequential repair (default in 2.0), the same range took 330 seconds! This is an overhead of 330-22*6 = 198 seconds, just opening SSTables (there were 1000+ sstables). Also, opening 1000 sstables for many smaller rangers surely causes lots of memory churning. The idea would be to list the sstables in the snapshot, but use the corresponding sstables in the live set if it's still available. For almost all sstables, the original one should still exist. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7705) Safer Resource Management
[ https://issues.apache.org/jira/browse/CASSANDRA-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239408#comment-14239408 ] Benedict commented on CASSANDRA-7705: - I have updated the repository with a rebased version, with some improved comments and a debug mode. This is essentially free given java's object alignment behaviour and run time optimisation (the field doesn't occupy any memory we wouldn't otherwise be occupying, and the relevant statements will be optimised away). Safer Resource Management - Key: CASSANDRA-7705 URL: https://issues.apache.org/jira/browse/CASSANDRA-7705 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Benedict Fix For: 3.0 We've had a spate of bugs recently with bad reference counting. these can have potentially dire consequences, generally either randomly deleting data or giving us infinite loops. Since in 2.1 we only reference count resources that are relatively expensive and infrequently managed (or in places where this safety is probably not as necessary, e.g. SerializingCache), we could without any negative consequences (and only slight code complexity) introduce a safer resource management scheme for these more expensive/infrequent actions. Basically, I propose when we want to acquire a resource we allocate an object that manages the reference. This can only be released once; if it is released twice, we fail immediately at the second release, reporting where the bug is (rather than letting it continue fine until the next correct release corrupts the count). The reference counter remains the same, but we obtain guarantees that the reference count itself is never badly maintained, although code using it could mistakenly release its own handle early (typically this is only an issue when cleaning up after a failure, in which case under the new scheme this would be an innocuous error) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8414) Avoid loops over array backed iterators that call iter.remove()
[ https://issues.apache.org/jira/browse/CASSANDRA-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-8414: Attachment: cassandra-2.0-8414-3.txt Nice backporting of the better approach. I've uploaded a tweaked version, the goal of which was just to clean up the variable names (and switch to a while loop) so it's more obvious what's happening. But while at it I also added use of nextClearBit in tandem with nextSetBit, as it's a minor tweak but gives better behaviour with runs of adjacent removes. I haven't properly reviewed otherwise, but it might be worth introducing this to CFS.removeDroppedColumns() and SliceQueryFilter.trim(), Avoid loops over array backed iterators that call iter.remove() --- Key: CASSANDRA-8414 URL: https://issues.apache.org/jira/browse/CASSANDRA-8414 Project: Cassandra Issue Type: Bug Components: Core Reporter: Richard Low Assignee: Jimmy Mårdell Labels: performance Fix For: 2.1.3 Attachments: cassandra-2.0-8414-1.txt, cassandra-2.0-8414-2.txt, cassandra-2.0-8414-3.txt I noticed from sampling that sometimes compaction spends almost all of its time in iter.remove() in ColumnFamilyStore.removeDeletedStandard. It turns out that the cf object is using ArrayBackedSortedColumns, so deletes are from an ArrayList. If the majority of your columns are GCable tombstones then this is O(n^2). The data structure should be changed or a copy made to avoid this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8312) Use live sstables in snapshot repair if possible
[ https://issues.apache.org/jira/browse/CASSANDRA-8312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239638#comment-14239638 ] Benedict commented on CASSANDRA-8312: - bq. it should be enough to just remove the row sstable.acquireReference() Yes, agreed. But I'll let Yuki review and make that change since he's more familiar with this area of the codebase. Use live sstables in snapshot repair if possible Key: CASSANDRA-8312 URL: https://issues.apache.org/jira/browse/CASSANDRA-8312 Project: Cassandra Issue Type: Improvement Reporter: Jimmy Mårdell Assignee: Jimmy Mårdell Priority: Minor Fix For: 2.0.12, 3.0, 2.1.3 Attachments: cassandra-2.0-8312-1.txt Snapshot repair can be very much slower than parallel repairs because of the overhead of opening the SSTables in the snapshot. This is particular true when using LCS, as you typically have many smaller SSTables then. I compared parallel and sequential repair on a small range on one of our clusters (2*3 replicas). With parallel repair, this took 22 seconds. With sequential repair (default in 2.0), the same range took 330 seconds! This is an overhead of 330-22*6 = 198 seconds, just opening SSTables (there were 1000+ sstables). Also, opening 1000 sstables for many smaller rangers surely causes lots of memory churning. The idea would be to list the sstables in the snapshot, but use the corresponding sstables in the live set if it's still available. For almost all sstables, the original one should still exist. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7882) Memtable slab allocation should scale logarithmically to improve occupancy rate
[ https://issues.apache.org/jira/browse/CASSANDRA-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239891#comment-14239891 ] Benedict commented on CASSANDRA-7882: - Yes, I don't think that's a problem. Memtable slab allocation should scale logarithmically to improve occupancy rate --- Key: CASSANDRA-7882 URL: https://issues.apache.org/jira/browse/CASSANDRA-7882 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jay Patel Assignee: Benedict Labels: performance Fix For: 2.1.3 Attachments: trunk-7882.txt CASSANDRA-5935 allows option to disable region-based allocation for on-heap memtables but there is no option to disable it for off-heap memtables (memtable_allocation_type: offheap_objects). Disabling region-based allocation will allow us to pack more tables in the schema since minimum of 1MB region won't be allocated per table. Downside can be more fragmentation which should be controllable by using better allocator like JEMalloc. How about below option in yaml?: memtable_allocation_type: unslabbed_offheap_objects Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8449) Allow zero-copy reads again
[ https://issues.apache.org/jira/browse/CASSANDRA-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14240975#comment-14240975 ] Benedict commented on CASSANDRA-8449: - Unless we explicitly force all queries to yield a timeout response even if they have successfully terminated after the timeout, and we enforce this constraint _after_ copying the data to the output buffers (netty and thrift), this is guaranteed to return junk data to a user somewhere, sometime. So I am -1 on this approach. Allow zero-copy reads again --- Key: CASSANDRA-8449 URL: https://issues.apache.org/jira/browse/CASSANDRA-8449 Project: Cassandra Issue Type: Improvement Reporter: T Jake Luciani Assignee: T Jake Luciani Priority: Minor Labels: performance Fix For: 3.0 We disabled zero-copy reads in CASSANDRA-3179 due to in flight reads accessing a ByteBuffer when the data was unmapped by compaction. Currently this code path is only used for uncompressed reads. The actual bytes are in fact copied to the client output buffers for both netty and thrift before being sent over the wire, so the only issue really is the time it takes to process the read internally. This patch adds a slow network read test and changes the tidy() method to actually delete a sstable once the readTimeout has elapsed giving plenty of time to serialize the read. Removing this copy causes significantly less GC on the read path and improves the tail latencies: http://cstar.datastax.com/graph?stats=c0c8ce16-7fea-11e4-959d-42010af0688fmetric=gc_countoperation=2_readsmoothing=1show_aggregates=truexmin=0xmax=109.34ymin=0ymax=5.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8449) Allow zero-copy reads again
[ https://issues.apache.org/jira/browse/CASSANDRA-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241004#comment-14241004 ] Benedict commented on CASSANDRA-8449: - Depending on how that is implemented. I will go out on a limb and predict it will offer no such guarantee, as there will always be a potential race condition (easily triggered by e.g. lengthy GC pauses) without enforcing the constraint _after_ performing the copy to the transport buffers, which is a very specific condition that I don't think is being considered for CASSANDRA-7392. Allow zero-copy reads again --- Key: CASSANDRA-8449 URL: https://issues.apache.org/jira/browse/CASSANDRA-8449 Project: Cassandra Issue Type: Improvement Reporter: T Jake Luciani Assignee: T Jake Luciani Priority: Minor Labels: performance Fix For: 3.0 We disabled zero-copy reads in CASSANDRA-3179 due to in flight reads accessing a ByteBuffer when the data was unmapped by compaction. Currently this code path is only used for uncompressed reads. The actual bytes are in fact copied to the client output buffers for both netty and thrift before being sent over the wire, so the only issue really is the time it takes to process the read internally. This patch adds a slow network read test and changes the tidy() method to actually delete a sstable once the readTimeout has elapsed giving plenty of time to serialize the read. Removing this copy causes significantly less GC on the read path and improves the tail latencies: http://cstar.datastax.com/graph?stats=c0c8ce16-7fea-11e4-959d-42010af0688fmetric=gc_countoperation=2_readsmoothing=1show_aggregates=truexmin=0xmax=109.34ymin=0ymax=5.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8449) Allow zero-copy reads again
[ https://issues.apache.org/jira/browse/CASSANDRA-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241215#comment-14241215 ] Benedict commented on CASSANDRA-8449: - CASSANDRA-7705 is really designed for situations where we know there won't be loads in-flight; i'd prefer not to reintroduce excessive long-lifetime reference counting onto the read critical path (we don't ref count sstable readers anymore, since CASSANDRA-6919). All we're doing here is delaying when we unmap the file until a time it is known to be unused, so we could create a global OpOrder that guards against this; all requests that hit the node are guarded by the OpOrder for their entire duration, and only once _all_ requests that started prior to _thinking_ the data is free do we actually free it. Typically I would not want to use this approach for guarding operations that could take arbitrarily long, but really all we're sacrificing is virtual address space, so being delayed more than you expect (even excessively) should not noticeably impact system performance, as the OS can choose to drop those pages on the floor, keeping only the mapping overhead. Allow zero-copy reads again --- Key: CASSANDRA-8449 URL: https://issues.apache.org/jira/browse/CASSANDRA-8449 Project: Cassandra Issue Type: Improvement Reporter: T Jake Luciani Assignee: T Jake Luciani Priority: Minor Labels: performance Fix For: 3.0 We disabled zero-copy reads in CASSANDRA-3179 due to in flight reads accessing a ByteBuffer when the data was unmapped by compaction. Currently this code path is only used for uncompressed reads. The actual bytes are in fact copied to the client output buffers for both netty and thrift before being sent over the wire, so the only issue really is the time it takes to process the read internally. This patch adds a slow network read test and changes the tidy() method to actually delete a sstable once the readTimeout has elapsed giving plenty of time to serialize the read. Removing this copy causes significantly less GC on the read path and improves the tail latencies: http://cstar.datastax.com/graph?stats=c0c8ce16-7fea-11e4-959d-42010af0688fmetric=gc_countoperation=2_readsmoothing=1show_aggregates=truexmin=0xmax=109.34ymin=0ymax=5.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241349#comment-14241349 ] Benedict commented on CASSANDRA-7032: - If you mean for V vnode tokens in ascending order [0..V), and e.g. D disks, the disks would own one of the token lists in the set { [dV/D..(d+1)V/D) : 0 = d D }, and you guarantee that the owned range of each list is balanced with the other lists, this seems pretty analogous to the approach I was describing and perfectly reasonable. The main goal is only that once a range or set of vnode tokens has been assigned to a given resource (disk, cpu, node, rack, whatever) that resource never needs to reassign its tokens. Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-4139) Add varint encoding to Messaging service
[ https://issues.apache.org/jira/browse/CASSANDRA-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241456#comment-14241456 ] Benedict commented on CASSANDRA-4139: - We aren't bandwidth constrained for any workloads I'm aware of, so what are we hoping to achieve here? We already apply compression to the stream, so this will likely only help bandwidth consumption for individual small payloads where compression cannot be expected to yield much. In such scenarios bandwidth is especially unlikely to be a constraint. Add varint encoding to Messaging service Key: CASSANDRA-4139 URL: https://issues.apache.org/jira/browse/CASSANDRA-4139 Project: Cassandra Issue Type: Sub-task Components: Core Reporter: Vijay Assignee: Ariel Weisberg Fix For: 3.0 Attachments: 0001-CASSANDRA-4139-v1.patch, 0001-CASSANDRA-4139-v2.patch, 0001-CASSANDRA-4139-v4.patch, 0002-add-bytes-written-metric.patch, 4139-Test.rtf, ASF.LICENSE.NOT.GRANTED--0001-CASSANDRA-4139-v3.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8449) Allow zero-copy reads again
[ https://issues.apache.org/jira/browse/CASSANDRA-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242439#comment-14242439 ] Benedict commented on CASSANDRA-8449: - bq. Isn't the existing use of OpOrder technically arbitrarily long due to GC for instance Any delay caused by GC to the termination of an OpOrder.Group is instantaneous from the point of view of the waiter, since it is also delayed by GC Either way, GC is not as arbitrarily long as I was referring to. Mostly I'm thinking about network consumers that haven't died but are, perhaps, in the process of doing so (GC death spiral), or where the network socket has frozen due to some other problem. i.e. where the problem is isolated from the rest of the host's functionality, but by being guarded by an OpOrder could conceivably cause the problem to infect the whole host's functionality. In reality we can probably guard against most of the risk, but I would still be reticent to use this scheme with that risk even minimally present without the ramifications being constrained as they are here. Allow zero-copy reads again --- Key: CASSANDRA-8449 URL: https://issues.apache.org/jira/browse/CASSANDRA-8449 Project: Cassandra Issue Type: Improvement Reporter: T Jake Luciani Assignee: T Jake Luciani Priority: Minor Labels: performance Fix For: 3.0 We disabled zero-copy reads in CASSANDRA-3179 due to in flight reads accessing a ByteBuffer when the data was unmapped by compaction. Currently this code path is only used for uncompressed reads. The actual bytes are in fact copied to the client output buffers for both netty and thrift before being sent over the wire, so the only issue really is the time it takes to process the read internally. This patch adds a slow network read test and changes the tidy() method to actually delete a sstable once the readTimeout has elapsed giving plenty of time to serialize the read. Removing this copy causes significantly less GC on the read path and improves the tail latencies: http://cstar.datastax.com/graph?stats=c0c8ce16-7fea-11e4-959d-42010af0688fmetric=gc_countoperation=2_readsmoothing=1show_aggregates=truexmin=0xmax=109.34ymin=0ymax=5.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8447) Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242462#comment-14242462 ] Benedict commented on CASSANDRA-8447: - [~yangzhe1991]: I don't think your problem is related, since it looks to me like you're running 2.1? If so, if you could file another ticket and upload a heap dump from one of your smaller nodes, its config yaml, and a full system log from startup until the problem was encountered I'll see if I can help pinpoint the problem. Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled --- Key: CASSANDRA-8447 URL: https://issues.apache.org/jira/browse/CASSANDRA-8447 Project: Cassandra Issue Type: Bug Components: Core Environment: Cluster size - 4 nodes Node size - 12 CPU (hyper threaded to 24 cores), 192 GB RAM, 2 Raid 0 arrays (Data - 10 disk, spinning 10k drives | CL 2 disk, spinning 10k drives) OS - RHEL 6.5 jvm - oracle 1.7.0_71 Cassandra version 2.0.11 Reporter: jonathan lacefield Attachments: Node_with_compaction.png, Node_without_compaction.png, cassandra.yaml, gc.logs.tar.gz, gcinspector_messages.txt, memtable_debug, results.tar.gz, visualvm_screenshot Behavior - If autocompaction is enabled, nodes will become unresponsive due to a full Old Gen heap which is not cleared during CMS GC. Test methodology - disabled autocompaction on 3 nodes, left autocompaction enabled on 1 node. Executed different Cassandra stress loads, using write only operations. Monitored visualvm and jconsole for heap pressure. Captured iostat and dstat for most tests. Captured heap dump from 50 thread load. Hints were disabled for testing on all nodes to alleviate GC noise due to hints backing up. Data load test through Cassandra stress - /usr/bin/cassandra-stress write n=19 -rate threads=different threads tested -schema replication\(factor=3\) keyspace=Keyspace1 -node all nodes listed Data load thread count and results: * 1 thread - Still running but looks like the node can sustain this load (approx 500 writes per second per node) * 5 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 2k writes per second per node) * 10 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range * 50 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 10k writes per second per node) * 100 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 20k writes per second per node) * 200 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 25k writes per second per node) Note - the observed behavior was the same for all tests except for the single threaded test. The single threaded test does not appear to show this behavior. Tested different GC and Linux OS settings with a focus on the 50 and 200 thread loads. JVM settings tested: # default, out of the box, env-sh settings # 10 G Max | 1 G New - default env-sh settings # 10 G Max | 1 G New - default env-sh settings #* JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=50 # 20 G Max | 10 G New JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly JVM_OPTS=$JVM_OPTS -XX:+UseTLAB JVM_OPTS=$JVM_OPTS -XX:+CMSScavengeBeforeRemark JVM_OPTS=$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=6 JVM_OPTS=$JVM_OPTS -XX:CMSWaitDuration=3 JVM_OPTS=$JVM_OPTS -XX:ParallelGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:ConcGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:+UnlockDiagnosticVMOptions JVM_OPTS=$JVM_OPTS -XX:+UseGCTaskAffinity JVM_OPTS=$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs JVM_OPTS=$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768 JVM_OPTS=$JVM_OPTS -XX:-UseBiasedLocking # 20 G Max | 1 G New JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly JVM_OPTS=$JVM_OPTS -XX:+UseTLAB JVM_OPTS=$JVM_OPTS -XX:+CMSScavengeBeforeRemark JVM_OPTS=$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=6
[jira] [Created] (CASSANDRA-8459) autocompaction on reads can prevent memtable space reclaimation
Benedict created CASSANDRA-8459: --- Summary: autocompaction on reads can prevent memtable space reclaimation Key: CASSANDRA-8459 URL: https://issues.apache.org/jira/browse/CASSANDRA-8459 Project: Cassandra Issue Type: Bug Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1.3 Memtable memory reclamation is dependent on reads always making progress, however on the collectTimeOrderedData critical path it is possible for the read to perform a _write_ inline, and for this write to block waiting for memtable space to be reclaimed. However the reclaimation is blocked waiting for this read to complete. There are a number of solutions to this, but the simplest is to make the defragmentation happen asynchronously, so the read terminates normally. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8459) autocompaction on reads can prevent memtable space reclaimation
[ https://issues.apache.org/jira/browse/CASSANDRA-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benedict updated CASSANDRA-8459: Attachment: 8459.txt Attaching simple fix. autocompaction on reads can prevent memtable space reclaimation - Key: CASSANDRA-8459 URL: https://issues.apache.org/jira/browse/CASSANDRA-8459 Project: Cassandra Issue Type: Bug Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1.3 Attachments: 8459.txt Memtable memory reclamation is dependent on reads always making progress, however on the collectTimeOrderedData critical path it is possible for the read to perform a _write_ inline, and for this write to block waiting for memtable space to be reclaimed. However the reclaimation is blocked waiting for this read to complete. There are a number of solutions to this, but the simplest is to make the defragmentation happen asynchronously, so the read terminates normally. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8447) Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242502#comment-14242502 ] Benedict commented on CASSANDRA-8447: - [~yangzhe1991]: Your thread dump allowed me to trace the problem to CASSANDRA-8459. Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled --- Key: CASSANDRA-8447 URL: https://issues.apache.org/jira/browse/CASSANDRA-8447 Project: Cassandra Issue Type: Bug Components: Core Environment: Cluster size - 4 nodes Node size - 12 CPU (hyper threaded to 24 cores), 192 GB RAM, 2 Raid 0 arrays (Data - 10 disk, spinning 10k drives | CL 2 disk, spinning 10k drives) OS - RHEL 6.5 jvm - oracle 1.7.0_71 Cassandra version 2.0.11 Reporter: jonathan lacefield Attachments: Node_with_compaction.png, Node_without_compaction.png, cassandra.yaml, gc.logs.tar.gz, gcinspector_messages.txt, memtable_debug, output.svg, results.tar.gz, visualvm_screenshot Behavior - If autocompaction is enabled, nodes will become unresponsive due to a full Old Gen heap which is not cleared during CMS GC. Test methodology - disabled autocompaction on 3 nodes, left autocompaction enabled on 1 node. Executed different Cassandra stress loads, using write only operations. Monitored visualvm and jconsole for heap pressure. Captured iostat and dstat for most tests. Captured heap dump from 50 thread load. Hints were disabled for testing on all nodes to alleviate GC noise due to hints backing up. Data load test through Cassandra stress - /usr/bin/cassandra-stress write n=19 -rate threads=different threads tested -schema replication\(factor=3\) keyspace=Keyspace1 -node all nodes listed Data load thread count and results: * 1 thread - Still running but looks like the node can sustain this load (approx 500 writes per second per node) * 5 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 2k writes per second per node) * 10 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range * 50 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 10k writes per second per node) * 100 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 20k writes per second per node) * 200 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 25k writes per second per node) Note - the observed behavior was the same for all tests except for the single threaded test. The single threaded test does not appear to show this behavior. Tested different GC and Linux OS settings with a focus on the 50 and 200 thread loads. JVM settings tested: # default, out of the box, env-sh settings # 10 G Max | 1 G New - default env-sh settings # 10 G Max | 1 G New - default env-sh settings #* JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=50 # 20 G Max | 10 G New JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly JVM_OPTS=$JVM_OPTS -XX:+UseTLAB JVM_OPTS=$JVM_OPTS -XX:+CMSScavengeBeforeRemark JVM_OPTS=$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=6 JVM_OPTS=$JVM_OPTS -XX:CMSWaitDuration=3 JVM_OPTS=$JVM_OPTS -XX:ParallelGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:ConcGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:+UnlockDiagnosticVMOptions JVM_OPTS=$JVM_OPTS -XX:+UseGCTaskAffinity JVM_OPTS=$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs JVM_OPTS=$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768 JVM_OPTS=$JVM_OPTS -XX:-UseBiasedLocking # 20 G Max | 1 G New JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly JVM_OPTS=$JVM_OPTS -XX:+UseTLAB JVM_OPTS=$JVM_OPTS -XX:+CMSScavengeBeforeRemark JVM_OPTS=$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=6 JVM_OPTS=$JVM_OPTS -XX:CMSWaitDuration=3 JVM_OPTS=$JVM_OPTS -XX:ParallelGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:ConcGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:+UnlockDiagnosticVMOptions JVM_OPTS=$JVM_OPTS -XX:+UseGCTaskAffinity
[jira] [Comment Edited] (CASSANDRA-8447) Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242502#comment-14242502 ] Benedict edited comment on CASSANDRA-8447 at 12/11/14 1:20 PM: --- [~yangzhe1991]: Your thread dump allowed me to trace the (your) problem to CASSANDRA-8459. This is a 2.1 specific issue, and not related to this ticket/ was (Author: benedict): [~yangzhe1991]: Your thread dump allowed me to trace the problem to CASSANDRA-8459. Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled --- Key: CASSANDRA-8447 URL: https://issues.apache.org/jira/browse/CASSANDRA-8447 Project: Cassandra Issue Type: Bug Components: Core Environment: Cluster size - 4 nodes Node size - 12 CPU (hyper threaded to 24 cores), 192 GB RAM, 2 Raid 0 arrays (Data - 10 disk, spinning 10k drives | CL 2 disk, spinning 10k drives) OS - RHEL 6.5 jvm - oracle 1.7.0_71 Cassandra version 2.0.11 Reporter: jonathan lacefield Attachments: Node_with_compaction.png, Node_without_compaction.png, cassandra.yaml, gc.logs.tar.gz, gcinspector_messages.txt, memtable_debug, output.svg, results.tar.gz, visualvm_screenshot Behavior - If autocompaction is enabled, nodes will become unresponsive due to a full Old Gen heap which is not cleared during CMS GC. Test methodology - disabled autocompaction on 3 nodes, left autocompaction enabled on 1 node. Executed different Cassandra stress loads, using write only operations. Monitored visualvm and jconsole for heap pressure. Captured iostat and dstat for most tests. Captured heap dump from 50 thread load. Hints were disabled for testing on all nodes to alleviate GC noise due to hints backing up. Data load test through Cassandra stress - /usr/bin/cassandra-stress write n=19 -rate threads=different threads tested -schema replication\(factor=3\) keyspace=Keyspace1 -node all nodes listed Data load thread count and results: * 1 thread - Still running but looks like the node can sustain this load (approx 500 writes per second per node) * 5 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 2k writes per second per node) * 10 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range * 50 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 10k writes per second per node) * 100 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 20k writes per second per node) * 200 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 25k writes per second per node) Note - the observed behavior was the same for all tests except for the single threaded test. The single threaded test does not appear to show this behavior. Tested different GC and Linux OS settings with a focus on the 50 and 200 thread loads. JVM settings tested: # default, out of the box, env-sh settings # 10 G Max | 1 G New - default env-sh settings # 10 G Max | 1 G New - default env-sh settings #* JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=50 # 20 G Max | 10 G New JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly JVM_OPTS=$JVM_OPTS -XX:+UseTLAB JVM_OPTS=$JVM_OPTS -XX:+CMSScavengeBeforeRemark JVM_OPTS=$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=6 JVM_OPTS=$JVM_OPTS -XX:CMSWaitDuration=3 JVM_OPTS=$JVM_OPTS -XX:ParallelGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:ConcGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:+UnlockDiagnosticVMOptions JVM_OPTS=$JVM_OPTS -XX:+UseGCTaskAffinity JVM_OPTS=$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs JVM_OPTS=$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768 JVM_OPTS=$JVM_OPTS -XX:-UseBiasedLocking # 20 G Max | 1 G New JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly JVM_OPTS=$JVM_OPTS -XX:+UseTLAB JVM_OPTS=$JVM_OPTS -XX:+CMSScavengeBeforeRemark JVM_OPTS=$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=6 JVM_OPTS=$JVM_OPTS
[jira] [Commented] (CASSANDRA-8459) autocompaction on reads can prevent memtable space reclaimation
[ https://issues.apache.org/jira/browse/CASSANDRA-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242513#comment-14242513 ] Benedict commented on CASSANDRA-8459: - No need, already sussed the problem and attached the fix autocompaction on reads can prevent memtable space reclaimation - Key: CASSANDRA-8459 URL: https://issues.apache.org/jira/browse/CASSANDRA-8459 Project: Cassandra Issue Type: Bug Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1.3 Attachments: 8459.txt Memtable memory reclamation is dependent on reads always making progress, however on the collectTimeOrderedData critical path it is possible for the read to perform a _write_ inline, and for this write to block waiting for memtable space to be reclaimed. However the reclaimation is blocked waiting for this read to complete. There are a number of solutions to this, but the simplest is to make the defragmentation happen asynchronously, so the read terminates normally. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8459) autocompaction on reads can prevent memtable space reclaimation
[ https://issues.apache.org/jira/browse/CASSANDRA-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242738#comment-14242738 ] Benedict commented on CASSANDRA-8459: - It's probably not a *bad idea* for 2.0 as it stops a read touching the write path, but it isn't necessary for correctness. autocompaction on reads can prevent memtable space reclaimation - Key: CASSANDRA-8459 URL: https://issues.apache.org/jira/browse/CASSANDRA-8459 Project: Cassandra Issue Type: Bug Components: Core Reporter: Benedict Assignee: Benedict Fix For: 2.1.3 Attachments: 8459.txt Memtable memory reclamation is dependent on reads always making progress, however on the collectTimeOrderedData critical path it is possible for the read to perform a _write_ inline, and for this write to block waiting for memtable space to be reclaimed. However the reclaimation is blocked waiting for this read to complete. There are a number of solutions to this, but the simplest is to make the defragmentation happen asynchronously, so the read terminates normally. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8447) Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242779#comment-14242779 ] Benedict commented on CASSANDRA-8447: - The problem is pretty simple: MeteredFlusher runs on StorageService.optionalTasks, and there are other events that can happen on here that can take a long time. In particular hint delivery scheduling, which is preceded by a blocking compaction of the hints table, during which no progress for any other optional tasks may proceed. MeteredFlusher should have its own dedicated thread, as responding promptly is essential; under this workload running every couple of seconds is pretty much necessary to avoid rapid catastrophic build up of state in memtables. Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled --- Key: CASSANDRA-8447 URL: https://issues.apache.org/jira/browse/CASSANDRA-8447 Project: Cassandra Issue Type: Bug Components: Core Environment: Cluster size - 4 nodes Node size - 12 CPU (hyper threaded to 24 cores), 192 GB RAM, 2 Raid 0 arrays (Data - 10 disk, spinning 10k drives | CL 2 disk, spinning 10k drives) OS - RHEL 6.5 jvm - oracle 1.7.0_71 Cassandra version 2.0.11 Reporter: jonathan lacefield Attachments: Node_with_compaction.png, Node_without_compaction.png, cassandra.yaml, gc.logs.tar.gz, gcinspector_messages.txt, memtable_debug, output.1.svg, output.2.svg, output.svg, results.tar.gz, visualvm_screenshot Behavior - If autocompaction is enabled, nodes will become unresponsive due to a full Old Gen heap which is not cleared during CMS GC. Test methodology - disabled autocompaction on 3 nodes, left autocompaction enabled on 1 node. Executed different Cassandra stress loads, using write only operations. Monitored visualvm and jconsole for heap pressure. Captured iostat and dstat for most tests. Captured heap dump from 50 thread load. Hints were disabled for testing on all nodes to alleviate GC noise due to hints backing up. Data load test through Cassandra stress - /usr/bin/cassandra-stress write n=19 -rate threads=different threads tested -schema replication\(factor=3\) keyspace=Keyspace1 -node all nodes listed Data load thread count and results: * 1 thread - Still running but looks like the node can sustain this load (approx 500 writes per second per node) * 5 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 2k writes per second per node) * 10 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range * 50 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 10k writes per second per node) * 100 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 20k writes per second per node) * 200 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 25k writes per second per node) Note - the observed behavior was the same for all tests except for the single threaded test. The single threaded test does not appear to show this behavior. Tested different GC and Linux OS settings with a focus on the 50 and 200 thread loads. JVM settings tested: # default, out of the box, env-sh settings # 10 G Max | 1 G New - default env-sh settings # 10 G Max | 1 G New - default env-sh settings #* JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=50 # 20 G Max | 10 G New JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly JVM_OPTS=$JVM_OPTS -XX:+UseTLAB JVM_OPTS=$JVM_OPTS -XX:+CMSScavengeBeforeRemark JVM_OPTS=$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=6 JVM_OPTS=$JVM_OPTS -XX:CMSWaitDuration=3 JVM_OPTS=$JVM_OPTS -XX:ParallelGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:ConcGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:+UnlockDiagnosticVMOptions JVM_OPTS=$JVM_OPTS -XX:+UseGCTaskAffinity JVM_OPTS=$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs JVM_OPTS=$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768 JVM_OPTS=$JVM_OPTS -XX:-UseBiasedLocking # 20 G Max | 1 G New JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8
[jira] [Comment Edited] (CASSANDRA-8447) Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242779#comment-14242779 ] Benedict edited comment on CASSANDRA-8447 at 12/11/14 4:59 PM: --- The problem is pretty simple: MeteredFlusher runs on StorageService.optionalTasks, and there are other events that can happen on here that can take a long time. In particular hint delivery scheduling, which is preceded by a blocking compaction of the hints table, during which no progress for any other optional tasks may proceed. MeteredFlusher should have its own dedicated thread, as responding promptly is essential; under this workload running every couple of seconds is pretty much necessary to avoid rapid catastrophic build up of state in memtables. (edit: in case there's any ambiguity, this isn't a hypothesis. the heap dump clearly shows optionalTasks blocked waiting on the result of a FutureTask executing a runnable defined in CompactionManager (as far as I can tell in submitUserDefined); the current live memtable is retaining 6M records at 6Gb of retained heap, so MeteredFlusher hasn't had its turn in a long time) was (Author: benedict): The problem is pretty simple: MeteredFlusher runs on StorageService.optionalTasks, and there are other events that can happen on here that can take a long time. In particular hint delivery scheduling, which is preceded by a blocking compaction of the hints table, during which no progress for any other optional tasks may proceed. MeteredFlusher should have its own dedicated thread, as responding promptly is essential; under this workload running every couple of seconds is pretty much necessary to avoid rapid catastrophic build up of state in memtables. Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled --- Key: CASSANDRA-8447 URL: https://issues.apache.org/jira/browse/CASSANDRA-8447 Project: Cassandra Issue Type: Bug Components: Core Environment: Cluster size - 4 nodes Node size - 12 CPU (hyper threaded to 24 cores), 192 GB RAM, 2 Raid 0 arrays (Data - 10 disk, spinning 10k drives | CL 2 disk, spinning 10k drives) OS - RHEL 6.5 jvm - oracle 1.7.0_71 Cassandra version 2.0.11 Reporter: jonathan lacefield Attachments: Node_with_compaction.png, Node_without_compaction.png, cassandra.yaml, gc.logs.tar.gz, gcinspector_messages.txt, memtable_debug, output.1.svg, output.2.svg, output.svg, results.tar.gz, visualvm_screenshot Behavior - If autocompaction is enabled, nodes will become unresponsive due to a full Old Gen heap which is not cleared during CMS GC. Test methodology - disabled autocompaction on 3 nodes, left autocompaction enabled on 1 node. Executed different Cassandra stress loads, using write only operations. Monitored visualvm and jconsole for heap pressure. Captured iostat and dstat for most tests. Captured heap dump from 50 thread load. Hints were disabled for testing on all nodes to alleviate GC noise due to hints backing up. Data load test through Cassandra stress - /usr/bin/cassandra-stress write n=19 -rate threads=different threads tested -schema replication\(factor=3\) keyspace=Keyspace1 -node all nodes listed Data load thread count and results: * 1 thread - Still running but looks like the node can sustain this load (approx 500 writes per second per node) * 5 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 2k writes per second per node) * 10 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range * 50 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 10k writes per second per node) * 100 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 20k writes per second per node) * 200 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 25k writes per second per node) Note - the observed behavior was the same for all tests except for the single threaded test. The single threaded test does not appear to show this behavior. Tested different GC and Linux OS settings with a focus on the 50 and 200 thread loads. JVM settings tested: # default, out of the box, env-sh settings # 10 G Max | 1 G New - default env-sh settings # 10 G Max | 1 G New - default env-sh settings #* JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=50 # 20 G Max | 10 G New JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled
[jira] [Commented] (CASSANDRA-8458) Avoid streaming from tmplink files
[ https://issues.apache.org/jira/browse/CASSANDRA-8458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242832#comment-14242832 ] Benedict commented on CASSANDRA-8458: - We could also try and figure out how/why this happens, as it should be able to stream safely. Does it only happen if streaming a range that wraps zero (i.e. from +X, to -Y)? Avoid streaming from tmplink files -- Key: CASSANDRA-8458 URL: https://issues.apache.org/jira/browse/CASSANDRA-8458 Project: Cassandra Issue Type: Bug Reporter: Marcus Eriksson Assignee: Marcus Eriksson Fix For: 2.1.3 Looks like we include tmplink sstables in streams in 2.1+, and when we do, sometimes we get this error message on the receiving side: {{java.io.IOException: Corrupt input data, block did not start with 2 byte signature ('ZV') followed by type byte, 2-byte length)}}. I've only seen this happen when a tmplink sstable is included in the stream. We can not just exclude the tmplink files when starting the stream - we need to include the original file, which we might miss since we check if the requested stream range intersects the sstable range. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-8458) Avoid streaming from tmplink files
[ https://issues.apache.org/jira/browse/CASSANDRA-8458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242832#comment-14242832 ] Benedict edited comment on CASSANDRA-8458 at 12/11/14 5:45 PM: --- We could also try and figure out how/why this happens, as it should be able to stream safely. Does it only happen if streaming a range that wraps zero (i.e. from +X, to -Y)? edit: To elaborate, I suspect the broken bit is that our dfile/ifile objects don't actually truncate the readable range - only our indexed decoratedkey range is truncated. In sstable.getPositionsForRanges we just return the end of the file if the range goes past the range of the file; in this case we could stream partially written data. If so, we could fix by simply making sstable.getPositionsForRanges() lookup the start position of the last key in the file, and always ensure we leave a key's overlap between the dropped sstables and the replacement. was (Author: benedict): We could also try and figure out how/why this happens, as it should be able to stream safely. Does it only happen if streaming a range that wraps zero (i.e. from +X, to -Y)? Avoid streaming from tmplink files -- Key: CASSANDRA-8458 URL: https://issues.apache.org/jira/browse/CASSANDRA-8458 Project: Cassandra Issue Type: Bug Reporter: Marcus Eriksson Assignee: Marcus Eriksson Fix For: 2.1.3 Looks like we include tmplink sstables in streams in 2.1+, and when we do, sometimes we get this error message on the receiving side: {{java.io.IOException: Corrupt input data, block did not start with 2 byte signature ('ZV') followed by type byte, 2-byte length)}}. I've only seen this happen when a tmplink sstable is included in the stream. We can not just exclude the tmplink files when starting the stream - we need to include the original file, which we might miss since we check if the requested stream range intersects the sstable range. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242859#comment-14242859 ] Benedict commented on CASSANDRA-8457: - FTR, I strongly doubt _context switching_ is actually as much of a problem as we think, although constraining it is never a bad thing. The big hit we have is _thread signalling_ costs, which is a different but related beast. Certainly the talking point that raised this was discussing system time spent serving context switches which would definitely be referring to signalling, not the switching itself. Now, we do use a BlockingQueue for OutboundTcpConnection which will incur these costs, however I strongly suspect the impact will be much lower than predicted - especially as the testing done to flag this up was on small clusters with RF=1, where these threads would not be being exercised at all. The costs of going to the network itself are likely to exceed the context switching costs, and naturally permit messages to accumulate in the queue, reducing the number of signals actually needed. There's then the negative performance implications we have found from small numbers of connections under NIO to consider, so that this change could have significant downsides for the majority of deployed clusters (although if we get batching in the client driver we may see these penalties disappear). To establish if there's likely a benefit to exploit, we could most likely refactor this code comparatively minimally (than rewriting to NIO/Netty) to make use of the SharedExecutorPool to establish if such a positive effect is indeed to be had, as this would reduce the number of threads in flight to those actually serving work on the OTCs. This wouldn't affect the ITC, but I am dubious of their contribution. We should probably also actually test if this is indeed a problem from clusters at scale performing in-memory CL1 reads. nio MessagingService Key: CASSANDRA-8457 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Assignee: Ariel Weisberg Labels: performance Fix For: 3.0 Thread-per-peer (actually two each incoming and outbound) is a big contributor to context switching, especially for larger clusters. Let's look at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-8466) Stress support for treating clients as truly independent entities (separate driver instance)
Benedict created CASSANDRA-8466: --- Summary: Stress support for treating clients as truly independent entities (separate driver instance) Key: CASSANDRA-8466 URL: https://issues.apache.org/jira/browse/CASSANDRA-8466 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Benedict Fix For: 2.1.3 For performance testing purposes, it would be helpful to be able to mimic truly independent clients. The easiest way to do this is to use a unique classloader for instantiating the driver for each client, which should be a reasonably straightforward change. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243924#comment-14243924 ] Benedict commented on CASSANDRA-8457: - bq. cstar doesn't support multiple stress clients Stress could be modified to support simulating true multiple client access; i've filed CASSANDRA-8466. What we really need is to be able to fire up a (much) larger cluster, though. With our current hardware probably necessitating multiple VMs per node - say, 4, giving a viable cluster of 24 which is probably about bare minimum for these kinds of tests. This necessarily pollutes the results somewhat since each VM will have only half a CPU, and incur extra thread signalling penalties, but it's better than nothing. Either that or we get a bunch of cheapo nodes, or we add EC2 integration. [~enigmacurry] any plans in the works to introduce support for large clusters? nio MessagingService Key: CASSANDRA-8457 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Assignee: Ariel Weisberg Labels: performance Fix For: 3.0 Thread-per-peer (actually two each incoming and outbound) is a big contributor to context switching, especially for larger clusters. Let's look at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8466) Stress support for treating clients as truly independent entities (separate driver instance)
[ https://issues.apache.org/jira/browse/CASSANDRA-8466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243925#comment-14243925 ] Benedict commented on CASSANDRA-8466: - [~mfiguiere] are there any plans to support tuning the number of IO threads spawned by the driver? For this ticket it would be extremely sane to limit it to just 1. Stress support for treating clients as truly independent entities (separate driver instance) Key: CASSANDRA-8466 URL: https://issues.apache.org/jira/browse/CASSANDRA-8466 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Benedict Fix For: 2.1.3 For performance testing purposes, it would be helpful to be able to mimic truly independent clients. The easiest way to do this is to use a unique classloader for instantiating the driver for each client, which should be a reasonably straightforward change. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8308) Windows: Commitlog access violations on unit tests
[ https://issues.apache.org/jira/browse/CASSANDRA-8308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14244033#comment-14244033 ] Benedict commented on CASSANDRA-8308: - bq. I'm not sure how that's related to this patch My bad, misread the patch boundary, Since we're opening/closing an extra file, it might be worth only performing the action if the channel isn't the correct size, since it typically will be (so, open channel, if incorrect size close it, open raf, set length, reopen channel). I haven't tested the change to introduce strerror - are you confident of it, and have you tested it? Might be sensible to split into its own ticket. Otherwise LGTM Windows: Commitlog access violations on unit tests -- Key: CASSANDRA-8308 URL: https://issues.apache.org/jira/browse/CASSANDRA-8308 Project: Cassandra Issue Type: Bug Reporter: Joshua McKenzie Assignee: Joshua McKenzie Priority: Minor Labels: Windows Fix For: 3.0 Attachments: 8308_v1.txt, 8308_v2.txt We have four unit tests failing on trunk on Windows, all with FileSystemException's related to the SchemaLoader: {noformat} [junit] Test org.apache.cassandra.db.compaction.DateTieredCompactionStrategyTest FAILED [junit] Test org.apache.cassandra.cql3.ThriftCompatibilityTest FAILED [junit] Test org.apache.cassandra.io.sstable.SSTableRewriterTest FAILED [junit] Test org.apache.cassandra.repair.LocalSyncTaskTest FAILED {noformat} Example error: {noformat} [junit] Caused by: java.nio.file.FileSystemException: build\test\cassandra\commitlog;0\CommitLog-5-1415908745965.log: The process cannot access the file because it is being used by another process. [junit] [junit] at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86) [junit] at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97) [junit] at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102) [junit] at sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:269) [junit] at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103) [junit] at java.nio.file.Files.delete(Files.java:1079) [junit] at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:125) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6993) Windows: remove mmap'ed I/O for index files and force standard file access
[ https://issues.apache.org/jira/browse/CASSANDRA-6993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14244036#comment-14244036 ] Benedict commented on CASSANDRA-6993: - This wouldn't be sufficient for the procfs check, as Mac (and by default FreeBSD) don't have it. Windows: remove mmap'ed I/O for index files and force standard file access -- Key: CASSANDRA-6993 URL: https://issues.apache.org/jira/browse/CASSANDRA-6993 Project: Cassandra Issue Type: Improvement Reporter: Joshua McKenzie Assignee: Joshua McKenzie Priority: Minor Labels: Windows Fix For: 3.0, 2.1.3 Attachments: 6993_2.1_v1.txt, 6993_v1.txt, 6993_v2.txt Memory-mapped I/O on Windows causes issues with hard-links; we're unable to delete hard-links to open files with memory-mapped segments even using nio. We'll need to push for close to performance parity between mmap'ed I/O and buffered going forward as the buffered / compressed path offers other benefits. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8248) Possible memory leak
[ https://issues.apache.org/jira/browse/CASSANDRA-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14244039#comment-14244039 ] Benedict commented on CASSANDRA-8248: - +1 Possible memory leak - Key: CASSANDRA-8248 URL: https://issues.apache.org/jira/browse/CASSANDRA-8248 Project: Cassandra Issue Type: Bug Reporter: Alexander Sterligov Assignee: Joshua McKenzie Attachments: 8248_v1.txt, thread_dump Sometimes during repair cassandra starts to consume more memory than expected. Total amount of data on node is about 20GB. Size of the data directory is 66GC because of snapshots. Top reports: {noformat} PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 15724 loadbase 20 0 493g 55g 44g S 28 44.2 4043:24 java {noformat} At the /proc/15724/maps there are a lot of deleted file maps {quote} 7f63a6102000-7f63a6332000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a6332000-7f63a6562000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a6562000-7f63a6792000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a6792000-7f63a69c2000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a69c2000-7f63a6bf2000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a6bf2000-7f63a6e22000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a6e22000-7f63a7052000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a7052000-7f63a7282000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a7282000-7f63a74b2000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a74b2000-7f63a76e2000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a76e2000-7f63a7912000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a7912000-7f63a7b42000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a7b42000-7f63a7d72000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a7d72000-7f63a7fa2000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a7fa2000-7f63a81d2000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a81d2000-7f63a8402000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a8402000-7f63a8622000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a8622000-7f63a8842000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a8842000-7f63a8a62000 r--s 08:21 9442763
[jira] [Commented] (CASSANDRA-8466) Stress support for treating clients as truly independent entities (separate driver instance)
[ https://issues.apache.org/jira/browse/CASSANDRA-8466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14244042#comment-14244042 ] Benedict commented on CASSANDRA-8466: - An even easier approach suggested by [~omichallat] is to simply open a session for each simulated client. This would be a really trivial change. Stress support for treating clients as truly independent entities (separate driver instance) Key: CASSANDRA-8466 URL: https://issues.apache.org/jira/browse/CASSANDRA-8466 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Benedict Fix For: 2.1.3 For performance testing purposes, it would be helpful to be able to mimic truly independent clients. The easiest way to do this is to use a unique classloader for instantiating the driver for each client, which should be a reasonably straightforward change. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-8468) Stress support for multiple asynchronous operations per client
Benedict created CASSANDRA-8468: --- Summary: Stress support for multiple asynchronous operations per client Key: CASSANDRA-8468 URL: https://issues.apache.org/jira/browse/CASSANDRA-8468 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict In conjunction with CASSANDRA-8466, this would permit more tunable variation in network load generation characteristics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-8469) Stress support for distributed operation, coordinated by a single stress process
Benedict created CASSANDRA-8469: --- Summary: Stress support for distributed operation, coordinated by a single stress process Key: CASSANDRA-8469 URL: https://issues.apache.org/jira/browse/CASSANDRA-8469 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Benedict As we test larger clusters, we need to run multiple stress clients (this is already the case for many users trialling c*). Baking in (initially simple) support for controlling and reporting multiple stress daemons from one command line would be extremely helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8447) Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14244089#comment-14244089 ] Benedict commented on CASSANDRA-8447: - In this case the optionalTasks thread was not blocked at the point of taking the heap dump, but it appears it was still blocked for several minutes, when it needs to run every few seconds. So whilst I cannot guarantee hints were the cause of the delay, we can be fairly certain the delay is the problem, and we should move metered flusher to its own dedicated thread. Approximately 200x as much data accumulated before a flush was triggered than under normal operation. {noformat} INFO [OptionalTasks:1] 2014-12-11 12:29:18,154 MeteredFlusher.java (line 58) flushing high-traffic column family CFS(Keyspace='Keyspace1', ColumnFamily='Standard1') (estimated 175643600 bytes) INFO [OptionalTasks:1] 2014-12-11 12:29:18,155 ColumnFamilyStore.java (line 794) Enqueuing flush of Memtable-Standard1@1155435229(17589220/175892200 serialized/live bytes, 399755 ops) INFO [OptionalTasks:1] 2014-12-11 12:36:24,642 MeteredFlusher.java (line 69) estimated 33071928850 live and 33071449400 flushing bytes used by all memtables INFO [OptionalTasks:1] 2014-12-11 12:36:24,642 MeteredFlusher.java (line 92) flushing CFS(Keyspace='Keyspace1', ColumnFamily='Standard1') to free up 33071687000 bytes INFO [OptionalTasks:1] 2014-12-11 12:36:24,643 ColumnFamilyStore.java (line 794) Enqueuing flush of Memtable-Standard1@401833564(3307178160/33071781600 serialized/live bytes, 75163140 ops) {noformat} Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled --- Key: CASSANDRA-8447 URL: https://issues.apache.org/jira/browse/CASSANDRA-8447 Project: Cassandra Issue Type: Bug Components: Core Environment: Cluster size - 4 nodes Node size - 12 CPU (hyper threaded to 24 cores), 192 GB RAM, 2 Raid 0 arrays (Data - 10 disk, spinning 10k drives | CL 2 disk, spinning 10k drives) OS - RHEL 6.5 jvm - oracle 1.7.0_71 Cassandra version 2.0.11 Reporter: jonathan lacefield Attachments: Node_with_compaction.png, Node_without_compaction.png, cassandra.yaml, gc.logs.tar.gz, gcinspector_messages.txt, memtable_debug, output.1.svg, output.2.svg, output.svg, results.tar.gz, visualvm_screenshot Behavior - If autocompaction is enabled, nodes will become unresponsive due to a full Old Gen heap which is not cleared during CMS GC. Test methodology - disabled autocompaction on 3 nodes, left autocompaction enabled on 1 node. Executed different Cassandra stress loads, using write only operations. Monitored visualvm and jconsole for heap pressure. Captured iostat and dstat for most tests. Captured heap dump from 50 thread load. Hints were disabled for testing on all nodes to alleviate GC noise due to hints backing up. Data load test through Cassandra stress - /usr/bin/cassandra-stress write n=19 -rate threads=different threads tested -schema replication\(factor=3\) keyspace=Keyspace1 -node all nodes listed Data load thread count and results: * 1 thread - Still running but looks like the node can sustain this load (approx 500 writes per second per node) * 5 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 2k writes per second per node) * 10 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range * 50 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 10k writes per second per node) * 100 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 20k writes per second per node) * 200 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 25k writes per second per node) Note - the observed behavior was the same for all tests except for the single threaded test. The single threaded test does not appear to show this behavior. Tested different GC and Linux OS settings with a focus on the 50 and 200 thread loads. JVM settings tested: # default, out of the box, env-sh settings # 10 G Max | 1 G New - default env-sh settings # 10 G Max | 1 G New - default env-sh settings #* JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=50 # 20 G Max | 10 G New JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75 JVM_OPTS=$JVM_OPTS
[jira] [Commented] (CASSANDRA-8466) Stress support for treating clients as truly independent entities (separate driver instance)
[ https://issues.apache.org/jira/browse/CASSANDRA-8466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14244095#comment-14244095 ] Benedict commented on CASSANDRA-8466: - Neither of those settings are honoured for v3 protocols, nor are they honoured in the way we would most likely want for v2 protocols Stress support for treating clients as truly independent entities (separate driver instance) Key: CASSANDRA-8466 URL: https://issues.apache.org/jira/browse/CASSANDRA-8466 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Benedict Fix For: 2.1.3 For performance testing purposes, it would be helpful to be able to mimic truly independent clients. The easiest way to do this is to use a unique classloader for instantiating the driver for each client, which should be a reasonably straightforward change. -- This message was sent by Atlassian JIRA (v6.3.4#6332)