[jira] [Commented] (CASSANDRA-3228) Add new range scan with clock
[ https://issues.apache.org/jira/browse/CASSANDRA-3228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13109100#comment-13109100 ] Ryan King commented on CASSANDRA-3228: -- the ruby gem, which is actively maintained as these things go, STILL does not have 2ary index query support Not true: https://github.com/fauna/cassandra/pull/59 Add new range scan with clock - Key: CASSANDRA-3228 URL: https://issues.apache.org/jira/browse/CASSANDRA-3228 Project: Cassandra Issue Type: New Feature Components: Core Affects Versions: 0.8.5 Reporter: Todd Nine Priority: Minor Currently, it is not possible to specify minimum clock time on columns when performing range scans. In some situations, such as custom migration or batch processing, it would be helpful to allow the client to specify a minimum clock time. This would only return columns with a clock value = the specified. I.E range scan (rowKey, startVal, endVal, revered, min clock) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3199) Counter write protocol: have the coordinator (instead of first replica) waits for replica responses directly
[ https://issues.apache.org/jira/browse/CASSANDRA-3199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13103760#comment-13103760 ] Ryan King commented on CASSANDRA-3199: -- This may be out of scope for this ticket, but can we differentiate between exceptions in writing to the first replica and in replicating to the others? That might help us do some limited forms of retries with counters. Counter write protocol: have the coordinator (instead of first replica) waits for replica responses directly Key: CASSANDRA-3199 URL: https://issues.apache.org/jira/browse/CASSANDRA-3199 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Sylvain Lebresne Priority: Minor Labels: counters Current counter write protocol is this (where we take the case of write coordinator != first replica): # coordinator forward write request to first replica # first replica write locally and replicate to other replica # first replica waits for enough answers from the other replica to satisfy the consistency level # first replica acks the coordinator that completes the write to the client This ticket proposes to modify this protocol to: # coordinator forward write request to first replica # first replica write locally, acks the coordinator for its own write and replicate to other replica # other replica respond directly to coordinator # once coordinator has enough responses, it completes the write I see 2 advantages to this new protocol: * it should be at tad faster since it parallelizes wire transfer better * it woud make TimeoutException a bit less likely and more importantly, a TimeoutException would much more likely mean that the write hasn't been persisted. Indeed, in the current protocol, once the first replica has send the write to the other replica, it has to wait for the replica answers and answer the coordinator. If it dies during that time, we will return a TimeoutException, even though the first replica died after having done it's main job. The cons is that this adds a bit of complexity. In particular, the other replica would have to answer to the coordinator for a query that has been issued by the first replica. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (CASSANDRA-3151) CLI documentation should explain how to create column families with CompositeType's
CLI documentation should explain how to create column families with CompositeType's --- Key: CASSANDRA-3151 URL: https://issues.apache.org/jira/browse/CASSANDRA-3151 Project: Cassandra Issue Type: Improvement Reporter: Ryan King Priority: Minor -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3096) Test RoundRobinScheduler timeouts
[ https://issues.apache.org/jira/browse/CASSANDRA-3096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094728#comment-13094728 ] Ryan King commented on CASSANDRA-3096: -- looks good, +1 Test RoundRobinScheduler timeouts - Key: CASSANDRA-3096 URL: https://issues.apache.org/jira/browse/CASSANDRA-3096 Project: Cassandra Issue Type: Bug Components: API Reporter: Stu Hood Assignee: Stu Hood Fix For: 1.0 Attachments: 0001-Properly-throw-timeouts-decrement-the-count-of-waiters.txt CASSANDRA-3079 was very hasty, and introduced two bugs that would: 1) cause the scheduler to busywait after a timeout, 2) never actually throw timeouts. This calls for a test. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2319) Promote row index
[ https://issues.apache.org/jira/browse/CASSANDRA-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13088736#comment-13088736 ] Ryan King commented on CASSANDRA-2319: -- I haven't followed that ticket closely, but I think the answer is yes. For wide row use cases this patch lets you eliminate SStables with only the info in the index (because we know what range(s) of columns for a row are in that file). Promote row index - Key: CASSANDRA-2319 URL: https://issues.apache.org/jira/browse/CASSANDRA-2319 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Stu Hood Assignee: Stu Hood Labels: compression, index, timeseries Fix For: 1.0 Attachments: 2319-v1.tgz, 2319-v2.tgz, promotion.pdf, version-f.txt, version-g-lzf.txt, version-g.txt The row index contains entries for configurably sized blocks of a wide row. For a row of appreciable size, the row index ends up directing the third seek (1. index, 2. row index, 3. content) to nearby the first column of a scan. Since the row index is always used for wide rows, and since it contains information that tells us whether or not the 3rd seek is necessary (the column range or name we are trying to slice may not exist in a given sstable), promoting the row index into the sstable index would allow us to drop the maximum number of seeks for wide rows back to 2, and, more importantly, would allow sstables to be eliminated using only the index. An example usecase that benefits greatly from this change is time series data in wide rows, where data is appended to the beginning or end of the row. Our existing compaction strategy gets lucky and clusters the oldest data in the oldest sstables: for queries to recently appended data, we would be able to eliminate wide rows using only the sstable index, rather than needing to seek into the data file to determine that it isn't interesting. For narrow rows, this change would have no effect, as they will not reach the threshold for indexing anyway. A first cut design for this change would look very similar to the file format design proposed on #674: http://wiki.apache.org/cassandra/FileFormatDesignDoc: row keys clustered, column names clustered, and offsets clustered and delta encoded. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2478) Custom protocol/transport
[ https://issues.apache.org/jira/browse/CASSANDRA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13088918#comment-13088918 ] Ryan King commented on CASSANDRA-2478: -- If you want to use netty, I'd suggest considering using finagle on top of it: http://github.com/twitter/finagle. Its written in scala but its very easy to use from java. Custom protocol/transport - Key: CASSANDRA-2478 URL: https://issues.apache.org/jira/browse/CASSANDRA-2478 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Eric Evans Priority: Minor A custom wire protocol would give us the flexibility to optimize for our specific use-cases, and eliminate a troublesome dependency (I'm referring to Thrift, but none of the others would be significantly better). Additionally, RPC is bad fit here, and we'd do better to move in the direction of something that natively supports streaming. I don't think this is as daunting as it might seem initially. Utilizing an existing server framework like Netty, combined with some copy-and-paste of bits from other FLOSS projects would probably get us 80% of the way there. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2478) Custom protocol/transport
[ https://issues.apache.org/jira/browse/CASSANDRA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13089254#comment-13089254 ] Ryan King commented on CASSANDRA-2478: -- Finagle is a library for building protocols that *happens* to come with a few built-in implementations (http, memcached, thrift, etc). It solves a lot of problems that you'd have to re-build on top of netty. Custom protocol/transport - Key: CASSANDRA-2478 URL: https://issues.apache.org/jira/browse/CASSANDRA-2478 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Eric Evans Priority: Minor A custom wire protocol would give us the flexibility to optimize for our specific use-cases, and eliminate a troublesome dependency (I'm referring to Thrift, but none of the others would be significantly better). Additionally, RPC is bad fit here, and we'd do better to move in the direction of something that natively supports streaming. I don't think this is as daunting as it might seem initially. Utilizing an existing server framework like Netty, combined with some copy-and-paste of bits from other FLOSS projects would probably get us 80% of the way there. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2500) Ruby dbi client (for CQL) that conforms to AR:ConnectionAdapter
[ https://issues.apache.org/jira/browse/CASSANDRA-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13085176#comment-13085176 ] Ryan King commented on CASSANDRA-2500: -- What would we need to change in fauna/cassandra? Ruby dbi client (for CQL) that conforms to AR:ConnectionAdapter --- Key: CASSANDRA-2500 URL: https://issues.apache.org/jira/browse/CASSANDRA-2500 Project: Cassandra Issue Type: Task Components: API Reporter: Jon Hermes Assignee: Kelley Reynolds Labels: cql Fix For: 0.8.5 Attachments: 2500.txt, genthriftrb.txt, rbcql-0.0.0.tgz Create a ruby driver for CQL. Lacking something standard (such as py-dbapi), going with something common instead -- RoR ActiveRecord Connection Adapter (http://api.rubyonrails.org/classes/ActiveRecord/ConnectionAdapters/AbstractAdapter.html). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (CASSANDRA-3019) log the keyspace and CF of large rows being compacted
log the keyspace and CF of large rows being compacted - Key: CASSANDRA-3019 URL: https://issues.apache.org/jira/browse/CASSANDRA-3019 Project: Cassandra Issue Type: Improvement Reporter: Ryan King Assignee: Ryan King Priority: Minor If you want to find the large rows it'd help to know the Keyspace and CF to look in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-3019) log the keyspace and CF of large rows being compacted
[ https://issues.apache.org/jira/browse/CASSANDRA-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King updated CASSANDRA-3019: - Attachment: 0001-add-keyspace-and-cf-to-large-row-compaction-logging.patch log the keyspace and CF of large rows being compacted - Key: CASSANDRA-3019 URL: https://issues.apache.org/jira/browse/CASSANDRA-3019 Project: Cassandra Issue Type: Improvement Reporter: Ryan King Assignee: Ryan King Priority: Minor Attachments: 0001-add-keyspace-and-cf-to-large-row-compaction-logging.patch If you want to find the large rows it'd help to know the Keyspace and CF to look in. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3017) add a Message size limit
[ https://issues.apache.org/jira/browse/CASSANDRA-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13083881#comment-13083881 ] Ryan King commented on CASSANDRA-3017: -- I think fatal errors are what we're trying to avoid here. The biggest threat is probably malicious, not accidental (since you need to get the MAGIC and headers in before this length). add a Message size limit Key: CASSANDRA-3017 URL: https://issues.apache.org/jira/browse/CASSANDRA-3017 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Ryan King Priority: Minor Labels: lhf Attachments: 0001-use-the-thrift-max-message-size-for-inter-node-messa.patch We protect the server from allocating huge buffers for malformed message with the Thrift frame size (CASSANDRA-475). But we don't have similar protection for the inter-node Message objects. Adding this would be good to deal with malicious adversaries as well as a malfunctioning cluster participant. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2915) Lucene based Secondary Indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13080008#comment-13080008 ] Ryan King commented on CASSANDRA-2915: -- Regarding realtime search, hasn't our (twitter's) realtime search branch been merged into lucene trunk? Whenever that's available we should get real realtime results. Lucene based Secondary Indexes -- Key: CASSANDRA-2915 URL: https://issues.apache.org/jira/browse/CASSANDRA-2915 Project: Cassandra Issue Type: New Feature Components: Core Reporter: T Jake Luciani Labels: secondary_index Fix For: 1.0 Secondary indexes (of type KEYS) suffer from a number of limitations in their current form: - Multiple IndexClauses only work when there is a subset of rows under the highest clause - One new column family is created per index this means 10 new CFs for 10 secondary indexes This ticket will use the Lucene library to implement secondary indexes as one index per CF, and utilize the Lucene query engine to handle multiple index clauses. Also, by using the Lucene we get a highly optimized file format. There are a few parallels we can draw between Cassandra and Lucene. Lucene indexes segments in memory then flushes them to disk so we can sync our memtable flushes to lucene flushes. Lucene also has optimize() which correlates to our compaction process, so these can be sync'd as well. We will also need to correlate column validators to Lucene tokenizers, so the data can be stored properly, the big win in once this is done we can perform complex queries within a column like wildcard searches. The downside of this approach is we will need to read before write since documents in Lucene are written as complete documents. For random workloads with lot's of indexed columns this means we need to read the document from the index, update it and write it back. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-1717) Cassandra cannot detect corrupt-but-readable column data
[ https://issues.apache.org/jira/browse/CASSANDRA-1717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13079449#comment-13079449 ] Ryan King commented on CASSANDRA-1717: -- I think checksums per column would be way too much overhead. We already add a lot of overhead to all data stored in Cassandra, we should be careful about adding more. Cassandra cannot detect corrupt-but-readable column data Key: CASSANDRA-1717 URL: https://issues.apache.org/jira/browse/CASSANDRA-1717 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Assignee: Pavel Yaskevich Fix For: 1.0 Attachments: checksums.txt Most corruptions of on-disk data due to bitrot render the column (or row) unreadable, so the data can be replaced by read repair or anti-entropy. But if the corruption keeps column data readable we do not detect it, and if it corrupts to a higher timestamp value can even resist being overwritten by newer values. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2506) Push read repair setting down to the DC-level
[ https://issues.apache.org/jira/browse/CASSANDRA-2506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13073634#comment-13073634 ] Ryan King commented on CASSANDRA-2506: -- It would also be nice if you could specify a different repair rate for intra-DC and inter-DC repairs. Push read repair setting down to the DC-level - Key: CASSANDRA-2506 URL: https://issues.apache.org/jira/browse/CASSANDRA-2506 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Brandon Williams Assignee: Patricio Echague Fix For: 1.0 Currently, read repair is a global setting. However, when you have two DCs and use one for analytics, it would be nice to turn it off only for that DC so the live DC serving the application can still benefit from it. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2498) Improve read performance in update-intensive workload
[ https://issues.apache.org/jira/browse/CASSANDRA-2498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13073633#comment-13073633 ] Ryan King commented on CASSANDRA-2498: -- Not sure what we can do about that unless we make counters idempotent, which may not feasible. Improve read performance in update-intensive workload - Key: CASSANDRA-2498 URL: https://issues.apache.org/jira/browse/CASSANDRA-2498 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Sylvain Lebresne Priority: Minor Labels: ponies Fix For: 1.0 Attachments: 2498-v2.txt, supersede-name-filter-collations.patch Read performance in an update-heavy environment relies heavily on compaction to maintain good throughput. (This is not the case for workloads where rows are only inserted once, because the bloom filter keeps us from having to check sstables unnecessarily.) Very early versions of Cassandra attempted to mitigate this by checking sstables in descending generation order (mostly equivalent to descending mtime): once all the requested columns were found, it would not check any older sstables. This was incorrect, because data timestamp will not correspond to sstable timestamp, both because compaction has the side effect of refreshing data to a newer sstable, and because hintead handoff may send us data older than what we already have. Instead, we could create a per-sstable piece of metadata containing the most recent (client-specified) timestamp for any column in the sstable. We could then sort sstables by this timestamp instead, and perform a similar optimization (if the remaining sstable client-timestamps are older than the oldest column found in the desired result set so far, we don't need to look further). Since under almost every workload, client timestamps of data in a given sstable will tend to be similar, we expect this to cut the number of sstables down proportionally to how frequently each column in the row is updated. (If each column is updated with each write, we only have to check a single sstable.) This may also be useful information when deciding which SSTables to compact. (Note that this optimization is only appropriate for named-column queries, not slice queries, since we don't know what non-overlapping columns may exist in older sstables.) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-1379) Uncached row reads may block cached reads
[ https://issues.apache.org/jira/browse/CASSANDRA-1379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13073653#comment-13073653 ] Ryan King commented on CASSANDRA-1379: -- We have use cases along these lines too. We've had to resort to bumping the read threads up much higher (128 or 256) for highly cached workloads. Uncached row reads may block cached reads - Key: CASSANDRA-1379 URL: https://issues.apache.org/jira/browse/CASSANDRA-1379 Project: Cassandra Issue Type: New Feature Components: Core Reporter: David King Assignee: Javier Canillas Priority: Minor Attachments: CASSANDRA-1379.patch The cap on the number of concurrent reads appears to cap the *total* number of concurrent reads instead of just capping the reads that are bound for disk. That is, given N concurrent readers if all of them are busy waiting on disk, even reads that can be served from the row cache will block waiting for them. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-1608) Redesigned Compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13071857#comment-13071857 ] Ryan King commented on CASSANDRA-1608: -- bq. bq. Is it even worth keeping bloom filters around with such a drastic reduction in worst-case number of sstables to check (for read path too)? bq. I think they are absolutely worth keeping around for unleveled sstables, but for leveled sstables the value is certainly questionable. Perhaps having some kind of LRU cache where we have an upper bound on the number of bloom filters we keep in memory would be wise. Is it possible that we could move these off-heap? I admit that I probably don't fully understand this change, but we have at least one workload where keeping BFs would probably be necessary– the vast majority of the traffic on that workload is for keys that don't exist anywhere. Even small bumps in BF false positive rates greatly effect the read performance. Redesigned Compaction - Key: CASSANDRA-1608 URL: https://issues.apache.org/jira/browse/CASSANDRA-1608 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Chris Goffinet Assignee: Benjamin Coverston Attachments: 1608-v2.txt, 1608-v8.txt, 1609-v10.txt After seeing the I/O issues in CASSANDRA-1470, I've been doing some more thinking on this subject that I wanted to lay out. I propose we redo the concept of how compaction works in Cassandra. At the moment, compaction is kicked off based on a write access pattern, not read access pattern. In most cases, you want the opposite. You want to be able to track how well each SSTable is performing in the system. If we were to keep statistics in-memory of each SSTable, prioritize them based on most accessed, and bloom filter hit/miss ratios, we could intelligently group sstables that are being read most often and schedule them for compaction. We could also schedule lower priority maintenance on SSTable's not often accessed. I also propose we limit the size of each SSTable to a fix sized, that gives us the ability to better utilize our bloom filters in a predictable manner. At the moment after a certain size, the bloom filters become less reliable. This would also allow us to group data most accessed. Currently the size of an SSTable can grow to a point where large portions of the data might not actually be accessed as often. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2897) Secondary indexes without read-before-write
[ https://issues.apache.org/jira/browse/CASSANDRA-2897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13064991#comment-13064991 ] Ryan King commented on CASSANDRA-2897: -- Can't we deal with the races by properly using timestamps? Secondary indexes without read-before-write --- Key: CASSANDRA-2897 URL: https://issues.apache.org/jira/browse/CASSANDRA-2897 Project: Cassandra Issue Type: Improvement Components: Core Affects Versions: 0.7.0 Reporter: Sylvain Lebresne Priority: Minor Labels: secondary_index Currently, secondary index updates require a read-before-write to maintain the index consistency. Keeping the index consistent at all time is not necessary however. We could let the (secondary) index get inconsistent on writes and repair those on reads. This would be easy because on reads, we make sure to request the indexed columns anyway, so we can just skip the row that are not needed and repair the index at the same time. This does trade work on writes for work on reads. However, read-before-write is sufficiently costly that it will likely be a win overall. There is (at least) two small technical difficulties here though: # If we repair on read, this will be racy with writes, so we'll probably have to synchronize there. # We probably shouldn't only rely on read to repair and we should also have a task to repair the index for things that are rarely read. It's unclear how to make that low impact though. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2819) Split rpc timeout for read and write ops
[ https://issues.apache.org/jira/browse/CASSANDRA-2819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13060119#comment-13060119 ] Ryan King commented on CASSANDRA-2819: -- Opened https://issues.apache.org/jira/browse/CASSANDRA-2819 for the followup. Split rpc timeout for read and write ops Key: CASSANDRA-2819 URL: https://issues.apache.org/jira/browse/CASSANDRA-2819 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Stu Hood Assignee: Melvin Wang Fix For: 1.0 Attachments: twttr-cassandra-0.8-counts-resync-rpc-rw-timeouts.diff Given the vastly different latency characteristics of reads and writes, it makes sense for them to have independent rpc timeouts internally. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (CASSANDRA-2858) make request dropping more accurate
make request dropping more accurate --- Key: CASSANDRA-2858 URL: https://issues.apache.org/jira/browse/CASSANDRA-2858 Project: Cassandra Issue Type: Improvement Reporter: Ryan King Assignee: Melvin Wang Priority: Minor Based on the discussion in https://issues.apache.org/jira/browse/CASSANDRA-2819, we can make the bookkeeping for request times more accurate. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2817) Expose number of threads blocked on submitting a memtable for flush
[ https://issues.apache.org/jira/browse/CASSANDRA-2817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053946#comment-13053946 ] Ryan King commented on CASSANDRA-2817: -- +1 Expose number of threads blocked on submitting a memtable for flush --- Key: CASSANDRA-2817 URL: https://issues.apache.org/jira/browse/CASSANDRA-2817 Project: Cassandra Issue Type: Bug Components: Core Affects Versions: 0.7.0 Reporter: Sylvain Lebresne Assignee: Sylvain Lebresne Priority: Minor Fix For: 0.7.7 Attachments: 0001-Expose-threads-blocked-on-submission-to-executor.patch Writes can be blocked by a thread trying to submit a memtable while the flush queue is full. While this is the expected behavior (the goal being to prevent OOMing), it is worth exposing when that happens so that people can monitor it and modify settings accordingly if that happens too often. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Edited] (CASSANDRA-2045) Simplify HH to decrease read load when nodes come back
[ https://issues.apache.org/jira/browse/CASSANDRA-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12986185#comment-12986185 ] Ryan King edited comment on CASSANDRA-2045 at 6/23/11 5:18 PM: --- I think the two approaches are suitable for different kinds of data models. The pointer approach is almost certainly better for narrow rows, while worse for large, dynamic rows. was (Author: kingryan): I think the two approaches are suitable for different kinds of data models. The point approach is almost certainly better for narrow rows, while worse for large, dynamic rows. Simplify HH to decrease read load when nodes come back -- Key: CASSANDRA-2045 URL: https://issues.apache.org/jira/browse/CASSANDRA-2045 Project: Cassandra Issue Type: Improvement Reporter: Chris Goffinet Assignee: Nicholas Telford Fix For: 1.0 Attachments: 0001-Changed-storage-of-Hints-to-store-a-serialized-RowMu.patch, 0002-Refactored-HintedHandoffManager.sendRow-to-reduce-co.patch, 0003-Fixed-some-coding-style-issues.patch, 0004-Fixed-direct-usage-of-Gossiper.getEndpointStateForEn.patch, 0005-Removed-duplicate-failure-detection-conditionals.-It.patch, 0006-Removed-handling-of-old-style-hints.patch, CASSANDRA-2045-simplify-hinted-handoff-001.diff, CASSANDRA-2045-simplify-hinted-handoff-002.diff Currently when HH is enabled, hints are stored, and when a node comes back, we begin sending that node data. We do a lookup on the local node for the row to send. To help reduce read load (if a node is offline for long period of time) we should store the data we want forward the node locally instead. We wouldn't have to do any lookups, just take byte[] and send to the destination. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-2804) expose dropped messages, exceptions over JMX
[ https://issues.apache.org/jira/browse/CASSANDRA-2804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King updated CASSANDRA-2804: - Attachment: twttr-cassandra-0.8-counts-resync-droppedmsg-metric.diff Funny- we have a patch we've been working on for similar things. Attached patch only does dropped messages, but it also includes a recent variant in JMX, which we need for our monitoring. expose dropped messages, exceptions over JMX Key: CASSANDRA-2804 URL: https://issues.apache.org/jira/browse/CASSANDRA-2804 Project: Cassandra Issue Type: Improvement Components: Tools Reporter: Jonathan Ellis Assignee: Jonathan Ellis Priority: Minor Fix For: 0.7.7, 0.8.2 Attachments: 2804.txt, twttr-cassandra-0.8-counts-resync-droppedmsg-metric.diff Patch against 0.7. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-47) SSTable compression
[ https://issues.apache.org/jira/browse/CASSANDRA-47?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13052734#comment-13052734 ] Ryan King commented on CASSANDRA-47: I think this is going to be obsoleted by CASSANDRA-674. SSTable compression --- Key: CASSANDRA-47 URL: https://issues.apache.org/jira/browse/CASSANDRA-47 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Assignee: Pavel Yaskevich Labels: compression Fix For: 1.0 We should be able to do SSTable compression which would trade CPU for I/O (almost always a good trade). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-1717) Cassandra cannot detect corrupt-but-readable column data
[ https://issues.apache.org/jira/browse/CASSANDRA-1717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13052765#comment-13052765 ] Ryan King commented on CASSANDRA-1717: -- I know I'm starting to sound like a broken record, but CASSANDRA-674 is going to include checksums. And its almost ready for reviewing. Cassandra cannot detect corrupt-but-readable column data Key: CASSANDRA-1717 URL: https://issues.apache.org/jira/browse/CASSANDRA-1717 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Assignee: Pavel Yaskevich Fix For: 1.0 Attachments: checksums.txt Most corruptions of on-disk data due to bitrot render the column (or row) unreadable, so the data can be replaced by read repair or anti-entropy. But if the corruption keeps column data readable we do not detect it, and if it corrupts to a higher timestamp value can even resist being overwritten by newer values. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2781) regression: exposing cache size through MBean
[ https://issues.apache.org/jira/browse/CASSANDRA-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13050532#comment-13050532 ] Ryan King commented on CASSANDRA-2781: -- It would be nice if we had some tests around these things, but I'm +1 on this patch. regression: exposing cache size through MBean - Key: CASSANDRA-2781 URL: https://issues.apache.org/jira/browse/CASSANDRA-2781 Project: Cassandra Issue Type: Bug Components: Core Reporter: Chris Burroughs Assignee: Chris Burroughs Priority: Minor Attachments: 2781-v1.txt Looks like it was part of CASSANDRA-1969. A method called size, as opposed to getSize, won't be exposed through jmx. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2521) Move away from Phantom References for Compaction/Memtable
[ https://issues.apache.org/jira/browse/CASSANDRA-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13050597#comment-13050597 ] Ryan King commented on CASSANDRA-2521: -- +1 for less hacking around the GC Move away from Phantom References for Compaction/Memtable - Key: CASSANDRA-2521 URL: https://issues.apache.org/jira/browse/CASSANDRA-2521 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Chris Goffinet Assignee: Sylvain Lebresne Fix For: 1.0 Attachments: 0001-Use-reference-counting-to-decide-when-a-sstable-can-.patch http://wiki.apache.org/cassandra/MemtableSSTable Let's move to using reference counting instead of relying on GC to be called in StorageService. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-2751) Improved Metrics collection
[ https://issues.apache.org/jira/browse/CASSANDRA-2751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King updated CASSANDRA-2751: - Description: Collecting metrics in cassandra needs to be easier. Currently the amount of work required to expose one new metric in the server and consume it outside the server is way to high. In my mind, collecting a new metric in the server should be a single line of code and consuming it should be easily doable from any programming language. There are several options for better metrics collection on the JVM: https://github.com/twitter/ostrich https://github.com/codahale/metrics/ https://github.com/twitter/commons/tree/master/src/java/com/twitter/common/stats We should look at these was: Collecting metrics in cassandra needs to be easier. Currently the amount of work required to expose one new metric in the server and consume it outside the server is way to high. In my mind, collecting a new metric in the server should be a single line of code and consuming it should be easily doable from any programming language. There are several options for better metrics collection on the JVM: https://github.com/twitter/ostrich https://github.com/codahale/metrics/ We should look at these Improved Metrics collection --- Key: CASSANDRA-2751 URL: https://issues.apache.org/jira/browse/CASSANDRA-2751 Project: Cassandra Issue Type: Improvement Reporter: Ryan King Assignee: Ryan King Collecting metrics in cassandra needs to be easier. Currently the amount of work required to expose one new metric in the server and consume it outside the server is way to high. In my mind, collecting a new metric in the server should be a single line of code and consuming it should be easily doable from any programming language. There are several options for better metrics collection on the JVM: https://github.com/twitter/ostrich https://github.com/codahale/metrics/ https://github.com/twitter/commons/tree/master/src/java/com/twitter/common/stats We should look at these -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2749) fine-grained control over data directories
[ https://issues.apache.org/jira/browse/CASSANDRA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13046069#comment-13046069 ] Ryan King commented on CASSANDRA-2749: -- Since each keyspace is stored in a different sub-directory of the DataDiretories, you can already split the storage of different keyspaces with some clever mount options. Maybe we could give column families the same treatment? fine-grained control over data directories -- Key: CASSANDRA-2749 URL: https://issues.apache.org/jira/browse/CASSANDRA-2749 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Priority: Minor Currently Cassandra supports multiple data directories but no way to control what sstables are placed where. Particularly for systems with mixed SSDs and rotational disks, it would be nice to pin frequently accessed columnfamilies to the SSDs. Postgresql does this with tablespaces (http://www.postgresql.org/docs/9.0/static/manage-ag-tablespaces.html) but we should probably avoid using that name because of confusing similarity to keyspaces. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (CASSANDRA-2751) Improved Metrics collection
Improved Metrics collection --- Key: CASSANDRA-2751 URL: https://issues.apache.org/jira/browse/CASSANDRA-2751 Project: Cassandra Issue Type: Improvement Reporter: Ryan King Assignee: Ryan King Collecting metrics in cassandra needs to be easier. Currently the amount of work required to expose one new metric in the server and consume it outside the server is way to high. In my mind, collecting a new metric in the server should be a single line of code and consuming it should be easily doable from any programming language. There are several options for better metrics collection on the JVM: https://github.com/twitter/ostrich https://github.com/codahale/metrics/ We should look at these -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2686) Distributed per row locks
[ https://issues.apache.org/jira/browse/CASSANDRA-2686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038070#comment-13038070 ] Ryan King commented on CASSANDRA-2686: -- You'll likely end up reimplementing something like Paxos (what google's chubby uses) or ZAB (what Zookeeper uses). Distributed per row locks - Key: CASSANDRA-2686 URL: https://issues.apache.org/jira/browse/CASSANDRA-2686 Project: Cassandra Issue Type: Wish Components: Core Environment: any Reporter: LuĂs Ferreira Labels: api-addition, features Instead of using a centralized locking strategy like cages with zookeeper, I would like to have it in a decentralized way. Even if it carries some limitations. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2686) Distributed per row locks
[ https://issues.apache.org/jira/browse/CASSANDRA-2686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13038078#comment-13038078 ] Ryan King commented on CASSANDRA-2686: -- Those protocols are methods for reach[ing] agreement. You're basically describing how ZK works. Distributed per row locks - Key: CASSANDRA-2686 URL: https://issues.apache.org/jira/browse/CASSANDRA-2686 Project: Cassandra Issue Type: Wish Components: Core Environment: any Reporter: LuĂs Ferreira Labels: api-addition, features Instead of using a centralized locking strategy like cages with zookeeper, I would like to have it in a decentralized way. Even if it carries some limitations. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-1610) Pluggable Compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034878#comment-13034878 ] Ryan King commented on CASSANDRA-1610: -- Agreed. Pluggable Compaction Key: CASSANDRA-1610 URL: https://issues.apache.org/jira/browse/CASSANDRA-1610 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Chris Goffinet Assignee: Alan Liang Priority: Minor Labels: compaction Fix For: 1.0 Attachments: 0001-move-compaction-code-into-own-package.patch, 0002-Pluggable-Compaction-and-Expiration.patch In CASSANDRA-1608, I proposed some changes on how compaction works. I think it also makes sense to allow the ability to have pluggable compaction per CF. There could be many types of workloads where this makes sense. One example we had at Digg was to completely throw away certain SSTables after N days. The goal of this ticket is to make compaction pluggable enough to support compaction based on max timestamp ordering of the sstables while satisfying max sstable size, min and max compaction thresholds. Another goal is to allow expiration of sstables based on a timestamp. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-47) SSTable compression
[ https://issues.apache.org/jira/browse/CASSANDRA-47?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034143#comment-13034143 ] Ryan King commented on CASSANDRA-47: Stu is working on https://issues.apache.org/jira/browse/CASSANDRA-674 which will improve the file size dramatically. SSTable compression --- Key: CASSANDRA-47 URL: https://issues.apache.org/jira/browse/CASSANDRA-47 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Priority: Minor Labels: compression Fix For: 1.0 We should be able to do SSTable compression which would trade CPU for I/O (almost always a good trade). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-2657) Allow configuration of multiple types of the Thrift server
[ https://issues.apache.org/jira/browse/CASSANDRA-2657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King updated CASSANDRA-2657: - Summary: Allow configuration of multiple types of the Thrift server (was: Allow configuration of multiple types of the Trift server) Allow configuration of multiple types of the Thrift server -- Key: CASSANDRA-2657 URL: https://issues.apache.org/jira/browse/CASSANDRA-2657 Project: Cassandra Issue Type: Improvement Components: Core Affects Versions: 0.8.1 Environment: JVM 1.6 Reporter: Vijay Assignee: Vijay Fix For: 0.8.0, 0.8.1 Thrift server has multiple modes of operations specifically... 1) TNonblockingServer 2) THsHaServer 3) TThreadPoolServer We should provide a configuration to enable all of the above. The client library can either use Async or the Sync... (independent of the server side) This patch also might address the issue (which we where seeing), when there are large number of connections to the server (throughput reduces). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-2597) inconsistent implementation of 'cumulative distribution function' for Exponential Distribution
[ https://issues.apache.org/jira/browse/CASSANDRA-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King updated CASSANDRA-2597: - Component/s: (was: Core) Contrib Description: As reported on the mailing list (http://mail-archives.apache.org/mod_mbox/cassandra-dev/201104.mbox/%3CAANLkTimdMSLE8-z0x+0kvzqp7za3AEMLaOFXvd4Z=t...@mail.gmail.com%3E), {quote} I just found there are two implementations of 'cumulative distribution function' for Exponential Distribution and there are inconsistent : *FailureDetector* {code:java} org.apache.cassandra.gms.ArrivalWindow.p(double) double p(double t) { double mean = mean(); double exponent = (-1)*(t)/mean; return *Math.pow(Math.E, exponent)*; } {code} *DynamicEndpointSnitch* {code:java} org.apache.cassandra.locator.AdaptiveLatencyTracker.p(double) double p(double t) { double mean = mean(); double exponent = (-1) * (t) / mean; return *1 - Math.pow( Math.E, exponent);* } {code} According to the Exponential Distribution cumulative distribution function definitionhttp://en.wikipedia.org/wiki/Exponential_distribution#Cumulative_distribution_function, the later one is correct {quote} ... however FailureDetector has been working as advertised for some time now. Does this mean the Snitch version is actually wrong? was: As reported on the mailing list (http://mail-archives.apache.org/mod_mbox/cassandra-dev/201104.mbox/%3CAANLkTimdMSLE8-z0x+0kvzqp7za3AEMLaOFXvd4Z=t...@mail.gmail.com%3E), {quote} I just found there are two implementations of 'cumulative distribution function' for Exponential Distribution and there are inconsistent : *FailureDetector* org.apache.cassandra.gms.ArrivalWindow.p(double) double p(double t) { double mean = mean(); double exponent = (-1)*(t)/mean; return *Math.pow(Math.E, exponent)*; } *DynamicEndpointSnitch* org.apache.cassandra.locator.AdaptiveLatencyTracker.p(double) double p(double t) { double mean = mean(); double exponent = (-1) * (t) / mean; return *1 - Math.pow( Math.E, exponent);* } According to the Exponential Distribution cumulative distribution function definitionhttp://en.wikipedia.org/wiki/Exponential_distribution#Cumulative_distribution_function, the later one is correct {quote} ... however FailureDetector has been working as advertised for some time now. Does this mean the Snitch version is actually wrong? Fix Version/s: (was: 0.7.7) 0.7.6 inconsistent implementation of 'cumulative distribution function' for Exponential Distribution -- Key: CASSANDRA-2597 URL: https://issues.apache.org/jira/browse/CASSANDRA-2597 Project: Cassandra Issue Type: Bug Components: Contrib Reporter: Jonathan Ellis Assignee: paul cannon Priority: Minor Fix For: 0.7.6 As reported on the mailing list (http://mail-archives.apache.org/mod_mbox/cassandra-dev/201104.mbox/%3CAANLkTimdMSLE8-z0x+0kvzqp7za3AEMLaOFXvd4Z=t...@mail.gmail.com%3E), {quote} I just found there are two implementations of 'cumulative distribution function' for Exponential Distribution and there are inconsistent : *FailureDetector* {code:java} org.apache.cassandra.gms.ArrivalWindow.p(double) double p(double t) { double mean = mean(); double exponent = (-1)*(t)/mean; return *Math.pow(Math.E, exponent)*; } {code} *DynamicEndpointSnitch* {code:java} org.apache.cassandra.locator.AdaptiveLatencyTracker.p(double) double p(double t) { double mean = mean(); double exponent = (-1) * (t) / mean; return *1 - Math.pow( Math.E, exponent);* } {code} According to the Exponential Distribution cumulative distribution function definitionhttp://en.wikipedia.org/wiki/Exponential_distribution#Cumulative_distribution_function, the later one is correct {quote} ... however FailureDetector has been working as advertised for some time now. Does this mean the Snitch version is actually wrong? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (CASSANDRA-2003) get_range_slices test
[ https://issues.apache.org/jira/browse/CASSANDRA-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King reassigned CASSANDRA-2003: Assignee: Stu Hood (was: Kelvin Kakugawa) get_range_slices test - Key: CASSANDRA-2003 URL: https://issues.apache.org/jira/browse/CASSANDRA-2003 Project: Cassandra Issue Type: Test Components: Core Environment: RandomPartitioner Reporter: Kelvin Kakugawa Assignee: Stu Hood Priority: Minor Fix For: 0.8.1 Attachments: 0002-Assert-that-we-don-t-double-count-any-keys.txt, CASSANDRA-2003-0.7-0001.patch, CASSANDRA-2003-0001.patch Test get_range_slices (on an RP cluster) to walk: * all keys on each node * all keys across cluster -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-1608) Redesigned Compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13033294#comment-13033294 ] Ryan King commented on CASSANDRA-1608: -- I only read the LevelDB stuff briefly. I think there's a lot we can learn, but there's at least 2 challenges: 1) client supplied timestamps mean that you can't know that newer files supercede older ones 2) the CF data model means that data for a given key in multiple sstables may need to be merged Redesigned Compaction - Key: CASSANDRA-1608 URL: https://issues.apache.org/jira/browse/CASSANDRA-1608 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Chris Goffinet After seeing the I/O issues in CASSANDRA-1470, I've been doing some more thinking on this subject that I wanted to lay out. I propose we redo the concept of how compaction works in Cassandra. At the moment, compaction is kicked off based on a write access pattern, not read access pattern. In most cases, you want the opposite. You want to be able to track how well each SSTable is performing in the system. If we were to keep statistics in-memory of each SSTable, prioritize them based on most accessed, and bloom filter hit/miss ratios, we could intelligently group sstables that are being read most often and schedule them for compaction. We could also schedule lower priority maintenance on SSTable's not often accessed. I also propose we limit the size of each SSTable to a fix sized, that gives us the ability to better utilize our bloom filters in a predictable manner. At the moment after a certain size, the bloom filters become less reliable. This would also allow us to group data most accessed. Currently the size of an SSTable can grow to a point where large portions of the data might not actually be accessed as often. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-1608) Redesigned Compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13031333#comment-13031333 ] Ryan King commented on CASSANDRA-1608: -- Its important to remember that LevelDB is key/value, not a column family data model, so there are concerns and constraints that apply to cassandra which do not apply to LevelDB. Redesigned Compaction - Key: CASSANDRA-1608 URL: https://issues.apache.org/jira/browse/CASSANDRA-1608 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Chris Goffinet After seeing the I/O issues in CASSANDRA-1470, I've been doing some more thinking on this subject that I wanted to lay out. I propose we redo the concept of how compaction works in Cassandra. At the moment, compaction is kicked off based on a write access pattern, not read access pattern. In most cases, you want the opposite. You want to be able to track how well each SSTable is performing in the system. If we were to keep statistics in-memory of each SSTable, prioritize them based on most accessed, and bloom filter hit/miss ratios, we could intelligently group sstables that are being read most often and schedule them for compaction. We could also schedule lower priority maintenance on SSTable's not often accessed. I also propose we limit the size of each SSTable to a fix sized, that gives us the ability to better utilize our bloom filters in a predictable manner. At the moment after a certain size, the bloom filters become less reliable. This would also allow us to group data most accessed. Currently the size of an SSTable can grow to a point where large portions of the data might not actually be accessed as often. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2614) create Column and CounterColumn in the same column family
[ https://issues.apache.org/jira/browse/CASSANDRA-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13030799#comment-13030799 ] Ryan King commented on CASSANDRA-2614: -- Ah, that makes more sense. create Column and CounterColumn in the same column family - Key: CASSANDRA-2614 URL: https://issues.apache.org/jira/browse/CASSANDRA-2614 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Dave Rav Assignee: Sylvain Lebresne Priority: Minor Fix For: 0.8.1 create Column and CounterColumn in the same column family -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2614) create Column and CounterColumn in the same column family
[ https://issues.apache.org/jira/browse/CASSANDRA-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13030160#comment-13030160 ] Ryan King commented on CASSANDRA-2614: -- I don't think this is feasible to do robustly. Several problems 1. If a column is initially created as a counter, but a non-counter insert comes through what do we do? We can't give the inserter an error unless we introduce reads in the write path. 2. The write path is somewhat different for the two kinds of columns. Counters don't really respect CLs the same way normal columns do. create Column and CounterColumn in the same column family - Key: CASSANDRA-2614 URL: https://issues.apache.org/jira/browse/CASSANDRA-2614 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Dave Rav Assignee: Sylvain Lebresne Priority: Minor Fix For: 0.8.1 create Column and CounterColumn in the same column family -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2540) Data reads by default
[ https://issues.apache.org/jira/browse/CASSANDRA-2540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027140#comment-13027140 ] Ryan King commented on CASSANDRA-2540: -- Sylvain- It don't think its the average performance that matters here, but the worse case. For our deployments we have latency targets at the 99th percentile. Some of those are quite low ( 10ms), so even a small number of requests that have to wait for the rpc timeout make our goals difficult, even if we lower the rpc timeout. Data reads by default - Key: CASSANDRA-2540 URL: https://issues.apache.org/jira/browse/CASSANDRA-2540 Project: Cassandra Issue Type: Wish Reporter: Stu Hood Priority: Minor The intention of digest vs data reads is to save bandwidth in the read path at the cost of latency, but I expect that this has been a premature optimization. * Data requested by a read will often be within an order of magnitude of the digest size, and a failed digest means extra roundtrips, more bandwidth * The [digest reads but not your data read|https://issues.apache.org/jira/browse/CASSANDRA-2282?focusedCommentId=13004656page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13004656] problem means failing QUORUM reads because a single node is unavailable, and would require eagerly re-requesting at some fraction of your timeout * Saving bandwidth in cross datacenter usecases comes at huge cost to latency, but since both constraints change proportionally (enough), the tradeoff is not clear Some options: # Add an option to use digest reads # Remove digest reads entirely (and/or punt and make them a runtime optimization based on data size in the future) # Continue to use digest reads, but send them to {{N - R}} nodes for (somewhat) more predicatable behavior with QUORUM \\ The outcome of data-reads-by-default should be significantly improved latency, with a moderate increase in bandwidth usage for large reads. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2558) Add concurrent_compactions configuration
[ https://issues.apache.org/jira/browse/CASSANDRA-2558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13026392#comment-13026392 ] Ryan King commented on CASSANDRA-2558: -- I believe Terje had compaction turned off during a bulk import. Those compactions happened when compaction was reactivated. Add concurrent_compactions configuration -- Key: CASSANDRA-2558 URL: https://issues.apache.org/jira/browse/CASSANDRA-2558 Project: Cassandra Issue Type: Improvement Components: Core Affects Versions: 0.8 beta 1 Reporter: Sylvain Lebresne Assignee: Sylvain Lebresne Priority: Trivial Fix For: 0.8.1 Attachments: 0001-Make-compaction-thread-number-configurable.patch Original Estimate: 2h Remaining Estimate: 2h We should expose a way to configure the max number of thread to use when multi_threaded compaction is turned on. So far, it uses nb_of_processors thread, which if you have many cores may be unreasonably high (as far as random IO is concerned and thus independently of compaction throttling)... at least unless you have SSD. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2498) Improve read performance in update-intensive workload
[ https://issues.apache.org/jira/browse/CASSANDRA-2498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13021247#comment-13021247 ] Ryan King commented on CASSANDRA-2498: -- In addition to update-heavy operations. Column Families with wide rows need some love on the latency side too. Improve read performance in update-intensive workload - Key: CASSANDRA-2498 URL: https://issues.apache.org/jira/browse/CASSANDRA-2498 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Priority: Minor Labels: ponies Fix For: 1.0 Read performance in an update-heavy environment relies heavily on compaction to maintain good throughput. (This is not the case for workloads where rows are only inserted once, because the bloom filter keeps us from having to check sstables unnecessarily.) Very early versions of Cassandra attempted to mitigate this by checking sstables in descending generation order (mostly equivalent to descending mtime): once all the requested columns were found, it would not check any older sstables. This was incorrect, because data timestamp will not correspond to sstable timestamp, both because compaction has the side effect of refreshing data to a newer sstable, and because hintead handoff may send us data older than what we already have. Instead, we could create a per-sstable piece of metadata containing the most recent (client-specified) timestamp for any column in the sstable. We could then sort sstables by this timestamp instead, and perform a similar optimization (if the remaining sstable client-timestamps are older than the oldest column found in the desired result set so far, we don't need to look further). Since under almost every workload, client timestamps of data in a given sstable will tend to be similar, we expect this to cut the number of sstables down proportionally to how frequently each column in the row is updated. (If each column is updated with each write, we only have to check a single sstable.) This may also be useful information when deciding which SSTables to compact. (Note that this optimization is only appropriate for named-column queries, not slice queries, since we don't know what non-overlapping columns may exist in older sstables.) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (CASSANDRA-2502) disable cache saving on system CFs
disable cache saving on system CFs -- Key: CASSANDRA-2502 URL: https://issues.apache.org/jira/browse/CASSANDRA-2502 Project: Cassandra Issue Type: Improvement Reporter: Ryan King Assignee: Ryan King Priority: Minor Attachments: 0001-disable-cache-saving-on-system-tables.patch -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-2502) disable cache saving on system CFs
[ https://issues.apache.org/jira/browse/CASSANDRA-2502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King updated CASSANDRA-2502: - Attachment: 0001-disable-cache-saving-on-system-tables.patch disable cache saving on system CFs -- Key: CASSANDRA-2502 URL: https://issues.apache.org/jira/browse/CASSANDRA-2502 Project: Cassandra Issue Type: Improvement Reporter: Ryan King Assignee: Ryan King Priority: Minor Attachments: 0001-disable-cache-saving-on-system-tables.patch -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-1329) make multiget take a set of keys instead of a list
[ https://issues.apache.org/jira/browse/CASSANDRA-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13019440#comment-13019440 ] Ryan King commented on CASSANDRA-1329: -- I don't know if this change is worth the breakage is causes (I assume that all clients will have to be updated). make multiget take a set of keys instead of a list -- Key: CASSANDRA-1329 URL: https://issues.apache.org/jira/browse/CASSANDRA-1329 Project: Cassandra Issue Type: Task Components: Core Reporter: Jonathan Ellis Priority: Minor Attachments: 1329-rebase.txt, 1329-stresspy-multiget.txt, 1329.txt, multiget.test, multigetsmall.test this more correctly sets the expectation that the order of keys in that list doesn't matter, and duplicates don't make sense -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2466) bloom filters should avoid huge array allocations to avoid fragmentation concerns
[ https://issues.apache.org/jira/browse/CASSANDRA-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13019457#comment-13019457 ] Ryan King commented on CASSANDRA-2466: -- Moving to smaller arrays would make the allocation easier, but wouldn't reduce the raw amount of memory needed for a large bloom filter. Would it be worth moving these off-heap completely? bloom filters should avoid huge array allocations to avoid fragmentation concerns - Key: CASSANDRA-2466 URL: https://issues.apache.org/jira/browse/CASSANDRA-2466 Project: Cassandra Issue Type: Bug Reporter: Peter Schuller Priority: Minor The fact that bloom filters are backed by single large arrays of longs is expected to interact badly with promotion of objects into old gen with CMS, due to fragmentation concerns (as discussed in CASSANDRA-2463). It should be less of an issue than CASSANDRA-2463 in the sense that you need to have a lot of rows before the array sizes become truly huge. For comparison, the ~ 143 million row key limit implied by the use of 'int' in BitSet prior to the switch to OpenBitSet translates roughly to 238 MB (assuming the limitation factor there was the addressability of the bits with a 32 bit int, which is my understanding). Having a preliminary look at OpenBitSet with an eye towards replacing the single long[] with multiple arrays, it seems that if we're willing to drop some of the functionality that is not used for bloom filter purposes, the bits[i] indexing should be pretty easy to augment with modulo to address an appropriate smaller array. Locality is not an issue since the bloom filter case is the worst possible case for locality anyway, and it doesn't matter whether it's one huge array or a number of ~ 64k arrays. Callers may be affected like BloomFilterSerializer which cares about the underlying bit array. If the full functionality of OpenBitSet is to be maintained (e.g., xorCount) some additional acrobatics would be necessary and presumably at a noticable performance cost if such operations were to be used in performance critical places. An argument against touching OpenBitSet is that it seems to be pretty carefully written and tested and has some non-trivial details and people have seemingly benchmarked it quite carefully. On the other hand, the improvement would then apply to other things as well, such as the bitsets used to keep track of in-core pages (off the cuff for scale, a 64 gig sstable should imply a 2 mb bit set, with one bit per 4k page). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2428) Running cleanup on a node with join_ring=false removes all data
[ https://issues.apache.org/jira/browse/CASSANDRA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017137#comment-13017137 ] Ryan King commented on CASSANDRA-2428: -- Sylvain- That seems like the right plan. Running cleanup on a node with join_ring=false removes all data --- Key: CASSANDRA-2428 URL: https://issues.apache.org/jira/browse/CASSANDRA-2428 Project: Cassandra Issue Type: Bug Components: Core Affects Versions: 0.7.1 Reporter: Chris Goffinet Assignee: Sylvain Lebresne Priority: Critical Fix For: 0.7.5 Attachments: 0001-Don-t-allow-cleanup-when-node-hasn-t-join-the-ring.patch If you need to bring up a node with join_ring=false for operator maintenance, and this node already has data, it will end up removing the data on the node. We noticed this when we were calling cleanup on a specific CF. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (CASSANDRA-1952) Support TTLs on counter columns
[ https://issues.apache.org/jira/browse/CASSANDRA-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King resolved CASSANDRA-1952. -- Resolution: Duplicate dupe of CASSANDRA-2103 Support TTLs on counter columns --- Key: CASSANDRA-1952 URL: https://issues.apache.org/jira/browse/CASSANDRA-1952 Project: Cassandra Issue Type: Improvement Components: API, Core Reporter: Stu Hood Priority: Minor We would like to support TTLs for counter columns, with the behaviour that the count is unset when the TTL expires, and that every mutation to the counter updates the TTL deadline. This would allow for interesting rate-limiting usecases, automatic cleanup of time-series data, and API consistency. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (CASSANDRA-2103) expiring counter columns
[ https://issues.apache.org/jira/browse/CASSANDRA-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King reassigned CASSANDRA-2103: Assignee: Ryan King (was: Kelvin Kakugawa) expiring counter columns Key: CASSANDRA-2103 URL: https://issues.apache.org/jira/browse/CASSANDRA-2103 Project: Cassandra Issue Type: New Feature Components: Core Affects Versions: 0.8 Reporter: Kelvin Kakugawa Assignee: Ryan King Fix For: 0.8 Attachments: 0001-CASSANDRA-2103-expiring-counters-logic-tests.patch add ttl functionality to counter columns. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-1418) Automatic, online load balancing
[ https://issues.apache.org/jira/browse/CASSANDRA-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King updated CASSANDRA-1418: - Fix Version/s: (was: 0.8) 1.0 Automatic, online load balancing Key: CASSANDRA-1418 URL: https://issues.apache.org/jira/browse/CASSANDRA-1418 Project: Cassandra Issue Type: Improvement Reporter: Stu Hood Fix For: 1.0 h2. Goal CASSANDRA-192 began with the intention of implementing full cluster load balancing, but ended up being (wisely) limited to a manual load balancing operation. This issue is an umbrella ticket for finishing the job of implementing automatic, always-on load balancing. It is possible to implement very efficient load balancing operations with a single process directing the rebalancing of all nodes, but avoiding such a central process and allowing individual nodes to make their own movement decisions would be ideal. h2. Components h3. Optimal movements for individual nodes h4. Ruhl One such approach is the Ruhl algorithm described on 192: https://issues.apache.org/jira/browse/CASSANDRA-192#action_12713079 . But as described, it performs excessive movement for large hotspots, and can take a long time to reach equilibrium. Consider the following ring: ||token||load|| |a|5| |c|5| |e|5| |f|40| |k|5| Assuming that node 'a' is the first to discover that 'f' is overloaded: it will apply Case 2, and assume half of 'f's load by moving to 'i', leaving both with 20 units. But this is not a optimal movement, because both 'f' and 'a/i' will still be holding data that they will need to give away. Additionally, 'a/i' can't begin giving the data away until it has finished receiving it. If node 'e' is the first to discover that 'f' is overloaded, it will apply Case 1, and 'f' will give half of its load to 'e' by moving to 'i'. Again, this is a non-optimal movement, because it will result in both 'e' and 'f/i' holding data that they need to give away. h4. Adding load awareness to Ruhl Luckily, there appears to be a simple adjustment to the Ruhl algorithm that solves this problem by taking advantage of the fact that Cassandra knows the total load of a cluster, and can use it to calculate the average/ideal load ω. Once node j has decided it should take load from node i (based on the ε value in Ruhl), rather than node j taking 1/2 of the load on node i, it should chose a token such that either i or j ends up with a load within ε*ω of ω. Again considering the ring described above, and assuming ε == 1.0, the total load for the 5 nodes is 60, giving a ω of 12. If node 'a' is the first to discover 'f', it will choose to move to 'j' (a token that takes 12 or ω load units from 'f'), leaving 'f' with a load of 28. When combined with the improvement in the next section, this is closer to being an optimal movement, because 'a/j' will at worst have ε*ω of load to give away, and 'f' is in a position to start more movements. h3. Automatic load balancing Since the Ruhl algorithm only requires a node to make a decision based on itself and one other node, it should be relatively straightforward to add a timer on each node that periodically wakes up and executes the modifiied Ruhl algorithm if it is not already in the process of moving (based on pending ranges). Automatic balancing should probably be enabled by default, and should have a configurable per-node bandwidth cap. h3. Allowing concurrent moves on a node Allowing a node to give away multiple ranges at once allows for the type of quick balancing that is typically only attributed to vnodes. If a node is a hotspot, such as in the example above, the node should be able to quickly dump the load in a manner that causes minimal load on the rest of the cluster. Rather than transferring to 1 target at 10 MB/s, a hotspot can give to 5 targets at 2 MB/s each. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-2089) Distributed test for the dynamic snitch
[ https://issues.apache.org/jira/browse/CASSANDRA-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King updated CASSANDRA-2089: - Fix Version/s: (was: 0.8) 1.0 Distributed test for the dynamic snitch --- Key: CASSANDRA-2089 URL: https://issues.apache.org/jira/browse/CASSANDRA-2089 Project: Cassandra Issue Type: Test Components: Core Reporter: Stu Hood Labels: des Fix For: 1.0 The dynamic snitch has turned into an essential component in dealing with partially failed nodes: it would be great to have it fully tested before the 0.8 release. In order to implement a proper test of the snitch, it is necessary to be able to flip a switch to place a node in a degraded state. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-1205) Unify Partitioners and AbstractTypes
[ https://issues.apache.org/jira/browse/CASSANDRA-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King updated CASSANDRA-1205: - Fix Version/s: (was: 0.8) 1.0 Unify Partitioners and AbstractTypes Key: CASSANDRA-1205 URL: https://issues.apache.org/jira/browse/CASSANDRA-1205 Project: Cassandra Issue Type: Improvement Reporter: Stu Hood Priority: Critical Fix For: 1.0 There is no good reason for Partitioners to have different semantics than AbstractTypes. Instead, we should probably have 2 partitioners: Random and Ordered, where the Ordered partitioner requires an AbstractType to be specified, defaulting to BytesType. One solution [suggested by jbellis|https://issues.apache.org/jira/browse/CASSANDRA-767?focusedCommentId=12841565page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12841565] is to have AbstractType generate a collation id (essentially, a Token) for a set of bytes. Looking forward, we should probably consider laying the groundwork to add native support for compound row keys here as well. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-2109) Improve default window size for DES
[ https://issues.apache.org/jira/browse/CASSANDRA-2109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King updated CASSANDRA-2109: - Fix Version/s: (was: 0.8) 1.0 Improve default window size for DES --- Key: CASSANDRA-2109 URL: https://issues.apache.org/jira/browse/CASSANDRA-2109 Project: Cassandra Issue Type: Improvement Reporter: Stu Hood Priority: Minor Labels: des Fix For: 1.0 The window size for DES is currently hardcoded at 100 requests. A larger window means that it takes longer to react to a suddenly slow node, but that you have a smoother transition for scores. An example of bad behaviour: with a window of size 100, we saw a case with a failing node where if enough requests could be answered quickly out of cache or bloomfilters, the window might be momentarily filled with 10 ms requests, pushing out requests that had to go disk and took 10 seconds. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-2045) Simplify HH to decrease read load when nodes come back
[ https://issues.apache.org/jira/browse/CASSANDRA-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King updated CASSANDRA-2045: - Fix Version/s: (was: 0.8) 1.0 Simplify HH to decrease read load when nodes come back -- Key: CASSANDRA-2045 URL: https://issues.apache.org/jira/browse/CASSANDRA-2045 Project: Cassandra Issue Type: Improvement Reporter: Chris Goffinet Fix For: 1.0 Currently when HH is enabled, hints are stored, and when a node comes back, we begin sending that node data. We do a lookup on the local node for the row to send. To help reduce read load (if a node is offline for long period of time) we should store the data we want forward the node locally instead. We wouldn't have to do any lookups, just take byte[] and send to the destination. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-674) New SSTable Format
[ https://issues.apache.org/jira/browse/CASSANDRA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King updated CASSANDRA-674: Fix Version/s: (was: 0.8) 1.0 New SSTable Format -- Key: CASSANDRA-674 URL: https://issues.apache.org/jira/browse/CASSANDRA-674 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Stu Hood Fix For: 1.0 Attachments: 674-v1.diff, 674-v2.tgz, perf-674-v1.txt, perf-trunk-2f3d2c0e4845faf62e33c191d152cb1b3fa62806.txt Various tickets exist due to limitations in the SSTable file format, including #16, #47 and #328. Attached is a proposed design/implementation of a new file format for SSTables that addresses a few of these limitations. This v2 implementation is not ready for serious use: see comments for remaining issues. It is roughly the format described here: http://wiki.apache.org/cassandra/FileFormatDesignDoc -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-2319) Promote row index
[ https://issues.apache.org/jira/browse/CASSANDRA-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King updated CASSANDRA-2319: - Fix Version/s: (was: 0.8) 1.0 Promote row index - Key: CASSANDRA-2319 URL: https://issues.apache.org/jira/browse/CASSANDRA-2319 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Stu Hood Assignee: Stu Hood Labels: index, timeseries Fix For: 1.0 The row index contains entries for configurably sized blocks of a wide row. For a row of appreciable size, the row index ends up directing the third seek (1. index, 2. row index, 3. content) to nearby the first column of a scan. Since the row index is always used for wide rows, and since it contains information that tells us whether or not the 3rd seek is necessary (the column range or name we are trying to slice may not exist in a given sstable), promoting the row index into the sstable index would allow us to drop the maximum number of seeks for wide rows back to 2, and, more importantly, would allow sstables to be eliminated using only the index. An example usecase that benefits greatly from this change is time series data in wide rows, where data is appended to the beginning or end of the row. Our existing compaction strategy gets lucky and clusters the oldest data in the oldest sstables: for queries to recently appended data, we would be able to eliminate wide rows using only the sstable index, rather than needing to seek into the data file to determine that it isn't interesting. For narrow rows, this change would have no effect, as they will not reach the threshold for indexing anyway. A first cut design for this change would look very similar to the file format design proposed on #674: http://wiki.apache.org/cassandra/FileFormatDesignDoc: row keys clustered, column names clustered, and offsets clustered and delta encoded. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-1827) Batching across stages
[ https://issues.apache.org/jira/browse/CASSANDRA-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King updated CASSANDRA-1827: - Fix Version/s: (was: 0.8) 1.0 Batching across stages -- Key: CASSANDRA-1827 URL: https://issues.apache.org/jira/browse/CASSANDRA-1827 Project: Cassandra Issue Type: Improvement Reporter: Chris Goffinet Fix For: 1.0 We might be able to get some improvement if we start batching tasks for every stage. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-1601) Refactor index definitions
[ https://issues.apache.org/jira/browse/CASSANDRA-1601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King updated CASSANDRA-1601: - Fix Version/s: (was: 0.8) 1.0 Refactor index definitions -- Key: CASSANDRA-1601 URL: https://issues.apache.org/jira/browse/CASSANDRA-1601 Project: Cassandra Issue Type: Improvement Components: API Reporter: Stu Hood Fix For: 1.0 h3. Overview There are a few considerations for defining secondary indexes and row validation that I don't think have been brought up yet. While the interface is still malleable pre 0.7.0, we should attempt to make changes that allow for forwards compatibility of index/validator schemas. This is an umbrella ticket for suggesting/debating the changes: other tickets should be opened for quick improvements that can be made before 0.7.0. h3. Index output types The output (queryable) data from an indexing operation is what actually goes in the index. For a particular row, the output can be either _single-valued_, _multi-valued_ or _compound_: * Single-valued ** Implemented in trunk (special case of multi-valued) * Multi-valued ** Multiple index values _of the same type_ can match a single row ** Row probably contains a list/set (perhaps in a supercolumn) * Compound ** Multiple base properties concatenated as one index entry ** Different validators/comparators for each component ** (Given the simplicity of performing boolean operations on 1472 indexes, compound local indexes are unlikely to ever be worthwhile, but compound distributed indexes will be: see comments on CASSANDRA-1599) h3. Index input types The other end of indexing is selection of values from a row to be indexed. Selection can correspond directly to our current {{db.filter.*}} implementations, and may be best implemented by specifying the validator/index using the same Thrift objects you would use for a similar query: * Name selection ** Implemented in trunk, but should probably just be a special case of list selection below ** Corresponds to db.filter.NamesQueryFilter of size 1 * List selection ** Should specify a list of columns of which all values must be of the same type, as defined by the Validator ** Corresponds to db.filter.NamesQueryFilter * Range (prefix?) selection ** Subsets of a row may be interesting for indexing ** Range corresponds to db.filter.SliceQueryFilter *** (A Prefix might actually be more useful for indexing, but is better implemented by indexing an arbitrarily nested row) ** Open question: might the ability to index only the 'top N values' from a row be useful? If so, then this selector should allow N to be specified like it would be for a slice h3. Supercolumns/arbitrary-nesting Another consideration is that we should be able to support indexing and validation of supercolumns (and hence, arbitrarily nested rows). Since the selection of columns to index is essentially the same as the selection of columns to return for a query, this can probably mirror (and suggest improvements to) our query API. h3. UDFs This is obviously still an open area, but user defined indexing functions are essentially a transform between the _input_ and _output_ (as defined above), which would normally have equal structures. Leaving room for UDFs in our index definitions makes sense, and will likely lead to a much more general and elegant design. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-808) Need a way to skip corrupted data in SSTables
[ https://issues.apache.org/jira/browse/CASSANDRA-808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King updated CASSANDRA-808: Fix Version/s: (was: 0.8) 1.0 Need a way to skip corrupted data in SSTables - Key: CASSANDRA-808 URL: https://issues.apache.org/jira/browse/CASSANDRA-808 Project: Cassandra Issue Type: Improvement Reporter: Stu Hood Priority: Minor Fix For: 1.0 The new SSTable format will allow for checksumming of the data file, but as it stands, we don't have a better way to handle the situation than throwing an Exception indicating that the data is unreadable. We might want to add an option (triggerable via a command line flag?) to Cassandra that will allow for skipping of corrupted keys/blocks in SSTables, to pretend they don't exist rather than throwing the Exception. An administrator could temporarily enable the option and trigger a compaction to perform a local repair of data, or they could leave it enabled constantly for hands-off recovery. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-2364) Record dynamic snitch latencies for counter writes
[ https://issues.apache.org/jira/browse/CASSANDRA-2364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King updated CASSANDRA-2364: - Fix Version/s: (was: 0.8) 1.0 Record dynamic snitch latencies for counter writes -- Key: CASSANDRA-2364 URL: https://issues.apache.org/jira/browse/CASSANDRA-2364 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Stu Hood Priority: Minor Labels: counters Fix For: 1.0 The counter code chooses a single replica to coordinate a write, meaning that it should be subject to dynamic snitch latencies like a read would be. This already works when there are reads going on, because the dynamic snitch read latencies are used to pick a node to coordinate, but when there are no reads going on (such as during a backfill) the latencies do not adjust. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-809) Full disk can result in being marked down
[ https://issues.apache.org/jira/browse/CASSANDRA-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King updated CASSANDRA-809: Fix Version/s: (was: 0.8) 1.0 Full disk can result in being marked down - Key: CASSANDRA-809 URL: https://issues.apache.org/jira/browse/CASSANDRA-809 Project: Cassandra Issue Type: Bug Reporter: Ryan King Priority: Minor Fix For: 1.0 We had a node file up the disk under one of two data directories. The result was that the node stopped making progress. The problem appears to be this (I'll update with more details as we find them): When new tasks are put onto most queues in Cassandra, if there isn't a thread in the pool to handle the task immediately, the task in run in the caller's thread (org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor:69 sets the caller-runs policy). The queue in question here is the queue that manages flushes, which is enqueued to from various places in our code (and therefore likely from multiple threads). Assuming that the full disk meant that no threads doing flushing could make progress (it appears that way) eventually any thread that calls the flush code would become stalled. Assuming our analysis is right (and we're still looking into it) we need to make a change. Here's a proposal so far: SHORT TERM: * change the TheadPoolExecutor policy to not be caller runs. This will let other threads make progress in the event that one pool is stalled LONG TERM * It appears that there are n threads for n data directories that we flush to, but they're not dedicated to a data directory. We should have a thread per data directory and have that thread dedicated to that directory * Perhaps we could use the failure detector on disks? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-1610) Pluggable Compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King updated CASSANDRA-1610: - Fix Version/s: (was: 0.8) 1.0 Pluggable Compaction Key: CASSANDRA-1610 URL: https://issues.apache.org/jira/browse/CASSANDRA-1610 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Chris Goffinet Priority: Minor Fix For: 1.0 In CASSANDRA-1608, I proposed some changes on how compaction works. I think it also makes sense to allow the ability to have pluggable compaction per CF. There could be many types of workloads where this makes sense. One example we had at Digg was to completely throw away certain SSTables after N days. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2156) Compaction Throttling
[ https://issues.apache.org/jira/browse/CASSANDRA-2156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13014298#comment-13014298 ] Ryan King commented on CASSANDRA-2156: -- This has been a big improvement for us in production. It'd be nice to get more eyes on it for 0.8. Compaction Throttling - Key: CASSANDRA-2156 URL: https://issues.apache.org/jira/browse/CASSANDRA-2156 Project: Cassandra Issue Type: New Feature Reporter: Stu Hood Fix For: 0.8 Attachments: 0005-Throttle-total-compaction-to-a-configurable-throughput.txt, for-0.6-0001-Throttle-compaction-to-a-fixed-throughput.txt, for-0.6-0002-Make-compaction-throttling-configurable.txt Compaction is currently relatively bursty: we compact as fast as we can, and then we wait for the next compaction to be possible (hurry up and wait). Instead, to properly amortize compaction, you'd like to compact exactly as fast as you need to to keep the sstable count under control. For every new level of compaction, you need to increase the rate that you compact at: a rule of thumb that we're testing on our clusters is to determine the maximum number of buckets a node can support (aka, if the 15th bucket holds 750 GB, we're not going to have more than 15 buckets), and then multiply the flush throughput by the number of buckets to get a minimum compaction throughput to maintain your sstable count. Full explanation: for a min compaction threshold of {{T}}, the bucket at level {{N}} can contain {{SsubN = T^N}} 'units' (unit == memtable's worth of data on disk). Every time a new unit is added, it has a {{1/SsubN}} chance of causing the bucket at level N to fill. If the bucket at level N fills, it causes {{SsubN}} units to be compacted. So, for each active level in your system you have {{SubN * 1 / SsubN}}, or {{1}} amortized unit to compact any time a new unit is added. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-2281) keep a count of errors
[ https://issues.apache.org/jira/browse/CASSANDRA-2281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King updated CASSANDRA-2281: - Attachment: patch update to trunk keep a count of errors -- Key: CASSANDRA-2281 URL: https://issues.apache.org/jira/browse/CASSANDRA-2281 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Ryan King Assignee: Ryan King Priority: Minor Fix For: 0.7.5 Attachments: patch, textmate stdin Vrj9Xa.txt I have patch that keeps a counter (exposed via jmx) of errors. This is quite useful for operators to keep track of the quality of cassandra without having to tail and parse logs across a cluster. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (CASSANDRA-2281) keep a count of errors
[ https://issues.apache.org/jira/browse/CASSANDRA-2281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King updated CASSANDRA-2281: - Attachment: (was: textmate stdin Vrj9Xa.txt) keep a count of errors -- Key: CASSANDRA-2281 URL: https://issues.apache.org/jira/browse/CASSANDRA-2281 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Ryan King Assignee: Ryan King Priority: Minor Fix For: 0.7.4 Attachments: textmate stdin Vrj9Xa.txt I have patch that keeps a counter (exposed via jmx) of errors. This is quite useful for operators to keep track of the quality of cassandra without having to tail and parse logs across a cluster. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (CASSANDRA-2281) keep a count of errors
[ https://issues.apache.org/jira/browse/CASSANDRA-2281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King updated CASSANDRA-2281: - Attachment: textmate stdin Vrj9Xa.txt Fixes a minor bug caught by chrisg. keep a count of errors -- Key: CASSANDRA-2281 URL: https://issues.apache.org/jira/browse/CASSANDRA-2281 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Ryan King Assignee: Ryan King Priority: Minor Fix For: 0.7.4 Attachments: textmate stdin Vrj9Xa.txt I have patch that keeps a counter (exposed via jmx) of errors. This is quite useful for operators to keep track of the quality of cassandra without having to tail and parse logs across a cluster. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (CASSANDRA-2281) keep a count of errors
[ https://issues.apache.org/jira/browse/CASSANDRA-2281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King updated CASSANDRA-2281: - Attachment: (was: textmate stdin 5y2H5u.txt) keep a count of errors -- Key: CASSANDRA-2281 URL: https://issues.apache.org/jira/browse/CASSANDRA-2281 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Ryan King Assignee: Ryan King Priority: Minor Fix For: 0.7.4 Attachments: textmate stdin Vrj9Xa.txt I have patch that keeps a counter (exposed via jmx) of errors. This is quite useful for operators to keep track of the quality of cassandra without having to tail and parse logs across a cluster. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (CASSANDRA-2281) keep a count of errors
[ https://issues.apache.org/jira/browse/CASSANDRA-2281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King updated CASSANDRA-2281: - Attachment: textmate stdin Vrj9Xa.txt keep a count of errors -- Key: CASSANDRA-2281 URL: https://issues.apache.org/jira/browse/CASSANDRA-2281 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Ryan King Assignee: Ryan King Priority: Minor Fix For: 0.7.4 Attachments: textmate stdin Vrj9Xa.txt I have patch that keeps a counter (exposed via jmx) of errors. This is quite useful for operators to keep track of the quality of cassandra without having to tail and parse logs across a cluster. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (CASSANDRA-2281) keep a count of errors
keep a count of errors -- Key: CASSANDRA-2281 URL: https://issues.apache.org/jira/browse/CASSANDRA-2281 Project: Cassandra Issue Type: Improvement Reporter: Ryan King Assignee: Ryan King I have patch that keeps a counter (exposed via jmx) of errors. This is quite useful for operators to keep track of the quality of cassandra without having to tail and parse logs across a cluster. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (CASSANDRA-2281) keep a count of errors
[ https://issues.apache.org/jira/browse/CASSANDRA-2281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King updated CASSANDRA-2281: - Attachment: textmate stdin c4Hh5i.txt Patch to keep track of errors and expose via JMX. I've probably missed a few places where we need to track things. keep a count of errors -- Key: CASSANDRA-2281 URL: https://issues.apache.org/jira/browse/CASSANDRA-2281 Project: Cassandra Issue Type: Improvement Reporter: Ryan King Assignee: Ryan King Attachments: textmate stdin c4Hh5i.txt I have patch that keeps a counter (exposed via jmx) of errors. This is quite useful for operators to keep track of the quality of cassandra without having to tail and parse logs across a cluster. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (CASSANDRA-2281) keep a count of errors
[ https://issues.apache.org/jira/browse/CASSANDRA-2281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King updated CASSANDRA-2281: - Attachment: textmate stdin 5y2H5u.txt fix some codestyle issues in ErrorReporter keep a count of errors -- Key: CASSANDRA-2281 URL: https://issues.apache.org/jira/browse/CASSANDRA-2281 Project: Cassandra Issue Type: Improvement Reporter: Ryan King Assignee: Ryan King Attachments: textmate stdin 5y2H5u.txt, textmate stdin c4Hh5i.txt I have patch that keeps a counter (exposed via jmx) of errors. This is quite useful for operators to keep track of the quality of cassandra without having to tail and parse logs across a cluster. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (CASSANDRA-2281) keep a count of errors
[ https://issues.apache.org/jira/browse/CASSANDRA-2281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King updated CASSANDRA-2281: - Attachment: (was: textmate stdin 5y2H5u.txt) keep a count of errors -- Key: CASSANDRA-2281 URL: https://issues.apache.org/jira/browse/CASSANDRA-2281 Project: Cassandra Issue Type: Improvement Reporter: Ryan King Assignee: Ryan King Attachments: textmate stdin 5y2H5u.txt I have patch that keeps a counter (exposed via jmx) of errors. This is quite useful for operators to keep track of the quality of cassandra without having to tail and parse logs across a cluster. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (CASSANDRA-2281) keep a count of errors
[ https://issues.apache.org/jira/browse/CASSANDRA-2281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King updated CASSANDRA-2281: - Attachment: textmate stdin 5y2H5u.txt forgot to click the license on the last one. keep a count of errors -- Key: CASSANDRA-2281 URL: https://issues.apache.org/jira/browse/CASSANDRA-2281 Project: Cassandra Issue Type: Improvement Reporter: Ryan King Assignee: Ryan King Attachments: textmate stdin 5y2H5u.txt I have patch that keeps a counter (exposed via jmx) of errors. This is quite useful for operators to keep track of the quality of cassandra without having to tail and parse logs across a cluster. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (CASSANDRA-2281) keep a count of errors
[ https://issues.apache.org/jira/browse/CASSANDRA-2281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13003620#comment-13003620 ] Ryan King commented on CASSANDRA-2281: -- I think it might be better to go the opposite direction and have ErrorReporter do the logging. keep a count of errors -- Key: CASSANDRA-2281 URL: https://issues.apache.org/jira/browse/CASSANDRA-2281 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Ryan King Assignee: Ryan King Priority: Minor Fix For: 0.7.4 Attachments: textmate stdin 5y2H5u.txt I have patch that keeps a counter (exposed via jmx) of errors. This is quite useful for operators to keep track of the quality of cassandra without having to tail and parse logs across a cluster. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (CASSANDRA-2281) keep a count of errors
[ https://issues.apache.org/jira/browse/CASSANDRA-2281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13003715#comment-13003715 ] Ryan King commented on CASSANDRA-2281: -- That's a good point. I'm not sure how I feel about conflating logging and statistics gathering. keep a count of errors -- Key: CASSANDRA-2281 URL: https://issues.apache.org/jira/browse/CASSANDRA-2281 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Ryan King Assignee: Ryan King Priority: Minor Fix For: 0.7.4 Attachments: textmate stdin 5y2H5u.txt I have patch that keeps a counter (exposed via jmx) of errors. This is quite useful for operators to keep track of the quality of cassandra without having to tail and parse logs across a cluster. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (CASSANDRA-2229) Back off compaction after failure
[ https://issues.apache.org/jira/browse/CASSANDRA-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12998511#comment-12998511 ] Ryan King commented on CASSANDRA-2229: -- It might be worth keeping track of the specific SSTables involved in the failed compaction and skipping those. Its possible we could make some progress on compaction in scenarios where a single sstable is corrupt. Back off compaction after failure - Key: CASSANDRA-2229 URL: https://issues.apache.org/jira/browse/CASSANDRA-2229 Project: Cassandra Issue Type: Improvement Components: Core Affects Versions: 0.7.2 Reporter: Nick Bailey Priority: Minor Fix For: 0.8 When compaction fails (for one of the multitude of reasons it can fail, generally some sort of 'corruption'), we should back off on attempting to compact that column family. Continuously trying to compact it will just waste resources. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (CASSANDRA-1657) support in-memory column families
[ https://issues.apache.org/jira/browse/CASSANDRA-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12989473#comment-12989473 ] Ryan King commented on CASSANDRA-1657: -- For narrow SSTables, shouldn't the row cache be enough for this? support in-memory column families - Key: CASSANDRA-1657 URL: https://issues.apache.org/jira/browse/CASSANDRA-1657 Project: Cassandra Issue Type: Improvement Reporter: Peter Schuller Priority: Minor Some workloads are such that you absolutely depend on column families being in-memory for performance, yet you most definitely want all the things that Cassandra offers in terms of replication, consistency, durability etc. In order to semi-deterministically ensure acceptable performance for such data, Cassandra could support in-memory column families. Such an in-memory column family would imply that mlock() be used on sstables for this column family. On start-up and on compaction completion, they could be mmap():ed with MAP_POPULATE (Linux specific) or else just mmap():ed + mlock():ed in such a way as to otherwise guarantee it is in-memory (such as userland traversal of the entire file). -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (CASSANDRA-2057) overflow in NodeCmd
[ https://issues.apache.org/jira/browse/CASSANDRA-2057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King updated CASSANDRA-2057: - Attachment: nodetool_overflow.patch overflow in NodeCmd --- Key: CASSANDRA-2057 URL: https://issues.apache.org/jira/browse/CASSANDRA-2057 Project: Cassandra Issue Type: Bug Components: Tools Reporter: Ryan King Assignee: Ryan King Priority: Minor Attachments: nodetool_overflow.patch We aggregate the long read/write counts across CFs into an int. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-2045) Simplify HH to decrease read load when nodes come back
[ https://issues.apache.org/jira/browse/CASSANDRA-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12986185#action_12986185 ] Ryan King commented on CASSANDRA-2045: -- I think the two approaches are suitable for different kinds of data models. The point approach is almost certainly better for narrow rows, while worse for large, dynamic rows. Simplify HH to decrease read load when nodes come back -- Key: CASSANDRA-2045 URL: https://issues.apache.org/jira/browse/CASSANDRA-2045 Project: Cassandra Issue Type: Improvement Reporter: Chris Goffinet Fix For: 0.7.2 Currently when HH is enabled, hints are stored, and when a node comes back, we begin sending that node data. We do a lookup on the local node for the row to send. To help reduce read load (if a node is offline for long period of time) we should store the data we want forward the node locally instead. We wouldn't have to do any lookups, just take byte[] and send to the destination. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-1777) The describe_host API method is misleading in that it returns the interface associated with gossip traffic
[ https://issues.apache.org/jira/browse/CASSANDRA-1777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12984837#action_12984837 ] Ryan King commented on CASSANDRA-1777: -- I don't care about making it routing-aware. I just want to do discovery. The describe_host API method is misleading in that it returns the interface associated with gossip traffic -- Key: CASSANDRA-1777 URL: https://issues.apache.org/jira/browse/CASSANDRA-1777 Project: Cassandra Issue Type: Bug Reporter: Nate McCall Assignee: Brandon Williams Fix For: 0.8 Attachments: 1777.txt Original Estimate: 16h Remaining Estimate: 16h If the hardware is configured to use separate interfaces for thrift and gossip, the gossip interface will be returned, given the results come out of the ReplicationStrategy eventually. I understand the approach, but given this is on the API, it effective worthless in situations of host auto discovery via describe_ring from a client. I actually see this as the primary use case of this method - why else would I care about the gossip iface from the client perspective? It's current form should be relegated to JMX only. At the same time, we should add port information as well. describe_splits probably has similar issues. I see the potential cart-before-horse issues here and that this will probably be non-trivial to fix, but I think give me a set of all the hosts to which I can talk is pretty important from a client perspective. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-1777) The describe_host API method is misleading in that it returns the interface associated with gossip traffic
[ https://issues.apache.org/jira/browse/CASSANDRA-1777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12984466#action_12984466 ] Ryan King commented on CASSANDRA-1777: -- Unless you have a dns server that can understand cassandra membership, RRDNS is actually a rough way to do this. I'd prefer to supply something for clients that works correctly. The describe_host API method is misleading in that it returns the interface associated with gossip traffic -- Key: CASSANDRA-1777 URL: https://issues.apache.org/jira/browse/CASSANDRA-1777 Project: Cassandra Issue Type: Bug Reporter: Nate McCall Assignee: Brandon Williams Fix For: 0.8 Attachments: 1777.txt Original Estimate: 16h Remaining Estimate: 16h If the hardware is configured to use separate interfaces for thrift and gossip, the gossip interface will be returned, given the results come out of the ReplicationStrategy eventually. I understand the approach, but given this is on the API, it effective worthless in situations of host auto discovery via describe_ring from a client. I actually see this as the primary use case of this method - why else would I care about the gossip iface from the client perspective? It's current form should be relegated to JMX only. At the same time, we should add port information as well. describe_splits probably has similar issues. I see the potential cart-before-horse issues here and that this will probably be non-trivial to fix, but I think give me a set of all the hosts to which I can talk is pretty important from a client perspective. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (CASSANDRA-1932) NegativeArraySizeException at org.apache.cassandra.utils.BloomFilterSerializer.deserialize(BloomFilterSerializer.java:28)
[ https://issues.apache.org/jira/browse/CASSANDRA-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King resolved CASSANDRA-1932. -- Resolution: Cannot Reproduce NegativeArraySizeException at org.apache.cassandra.utils.BloomFilterSerializer.deserialize(BloomFilterSerializer.java:28) - Key: CASSANDRA-1932 URL: https://issues.apache.org/jira/browse/CASSANDRA-1932 Project: Cassandra Issue Type: Bug Affects Versions: 0.7.1 Reporter: Karl Mueller Assignee: Ryan King Fix For: 0.7.1 ERROR [ReadStage:30017] 2011-01-03 19:28:45,406 DebuggableThreadPoolExecutor.java (line 103) Error in ThreadPoolExecutor java.lang.NegativeArraySizeException at org.apache.cassandra.utils.BloomFilterSerializer.deserialize(BloomFilterSerializer.java:28) at org.apache.cassandra.utils.BloomFilterSerializer.deserialize(BloomFilterSerializer.java:9) at org.apache.cassandra.io.sstable.IndexHelper.defreezeBloomFilter(IndexHelper.java:104) at org.apache.cassandra.db.columniterator.SSTableNamesIterator.read(SSTableNamesIterator.java:106) at org.apache.cassandra.db.columniterator.SSTableNamesIterator.init(SSTableNamesIterator.java:71) at org.apache.cassandra.db.filter.NamesQueryFilter.getSSTableColumnIterator(NamesQueryFilter.java:59) at org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:80) at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1219) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1081) at org.apache.cassandra.db.Table.getRow(Table.java:384) at org.apache.cassandra.db.SliceByNamesReadCommand.getRow(SliceByNamesReadCommand.java:60) at org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:68) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:63) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-1932) NegativeArraySizeException at org.apache.cassandra.utils.BloomFilterSerializer.deserialize(BloomFilterSerializer.java:28)
[ https://issues.apache.org/jira/browse/CASSANDRA-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12983509#action_12983509 ] Ryan King commented on CASSANDRA-1932: -- I'm prepared to consider this can't reproduce. I think it was user error. The fix to refuse opening future sstables should make that error clearer in the future. NegativeArraySizeException at org.apache.cassandra.utils.BloomFilterSerializer.deserialize(BloomFilterSerializer.java:28) - Key: CASSANDRA-1932 URL: https://issues.apache.org/jira/browse/CASSANDRA-1932 Project: Cassandra Issue Type: Bug Affects Versions: 0.7.1 Reporter: Karl Mueller Assignee: Ryan King Fix For: 0.7.1 ERROR [ReadStage:30017] 2011-01-03 19:28:45,406 DebuggableThreadPoolExecutor.java (line 103) Error in ThreadPoolExecutor java.lang.NegativeArraySizeException at org.apache.cassandra.utils.BloomFilterSerializer.deserialize(BloomFilterSerializer.java:28) at org.apache.cassandra.utils.BloomFilterSerializer.deserialize(BloomFilterSerializer.java:9) at org.apache.cassandra.io.sstable.IndexHelper.defreezeBloomFilter(IndexHelper.java:104) at org.apache.cassandra.db.columniterator.SSTableNamesIterator.read(SSTableNamesIterator.java:106) at org.apache.cassandra.db.columniterator.SSTableNamesIterator.init(SSTableNamesIterator.java:71) at org.apache.cassandra.db.filter.NamesQueryFilter.getSSTableColumnIterator(NamesQueryFilter.java:59) at org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:80) at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1219) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1081) at org.apache.cassandra.db.Table.getRow(Table.java:384) at org.apache.cassandra.db.SliceByNamesReadCommand.getRow(SliceByNamesReadCommand.java:60) at org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:68) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:63) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-1935) Refuse to open SSTables from the future
[ https://issues.apache.org/jira/browse/CASSANDRA-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981904#action_12981904 ] Ryan King commented on CASSANDRA-1935: -- Not that I can think of. Refuse to open SSTables from the future --- Key: CASSANDRA-1935 URL: https://issues.apache.org/jira/browse/CASSANDRA-1935 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Stu Hood Priority: Minor Fix For: 0.8 Attachments: CASSANDRA-1935.patch If somebody has rolled back to a previous version of Cassandra that is unable to read an SSTable written by a future version correctly (indicated by a version change), failing fast is safer than accidentally performing a compaction that rewrites incorrect data and leaves you in an odd state. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-1983) Make sstable filenames contain a UUID instead of increasing integer
[ https://issues.apache.org/jira/browse/CASSANDRA-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981507#action_12981507 ] Ryan King commented on CASSANDRA-1983: -- Alternatively, since we'll need a host-uuid mapping for counters we can put that uuid in the filename along with a serial integer (make it a long and we should be ok, right?) Make sstable filenames contain a UUID instead of increasing integer --- Key: CASSANDRA-1983 URL: https://issues.apache.org/jira/browse/CASSANDRA-1983 Project: Cassandra Issue Type: Improvement Components: Core Affects Versions: 0.7.0 Reporter: David King Priority: Minor sstable filenames look like CFName-1569-Index.db, containing an integer for uniqueness. This makes it possible (however unlikely) that the integer could overflow, which could be a problem. It also makes it difficult to collapse multiple nodes into a single one with rsync. I do this occasionally for testing: I'll copy our 20 node cluster into only 3 nodes by copying all of the data files and running cleanup; at present this requires a manual step of uniqifying the overlapping sstable names. If instead of an incrementing integer, it would be handy if these contained a UUID or somesuch that guarantees uniqueness across the cluster. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (CASSANDRA-1935) Refuse to open SSTables from the future
[ https://issues.apache.org/jira/browse/CASSANDRA-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King updated CASSANDRA-1935: - Attachment: CASSANDRA-1935.patch Here's the simplest patch that could work. I'm a bit afraid that this may cause problems in scenarios other than startup. Also, I'd appreciate feedback on a better exception to raise. Refuse to open SSTables from the future --- Key: CASSANDRA-1935 URL: https://issues.apache.org/jira/browse/CASSANDRA-1935 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Stu Hood Priority: Minor Fix For: 0.8 Attachments: CASSANDRA-1935.patch If somebody has rolled back to a previous version of Cassandra that is unable to read an SSTable written by a future version correctly (indicated by a version change), failing fast is safer than accidentally performing a compaction that rewrites incorrect data and leaves you in an odd state. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-1935) Refuse to open SSTables from the future
[ https://issues.apache.org/jira/browse/CASSANDRA-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979673#action_12979673 ] Ryan King commented on CASSANDRA-1935: -- That seems like a somewhat bigger change. Perhaps we could tackle the startup situation now and open another ticket for making sure we don't try to stream incompatible sstables? Refuse to open SSTables from the future --- Key: CASSANDRA-1935 URL: https://issues.apache.org/jira/browse/CASSANDRA-1935 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Stu Hood Priority: Minor Fix For: 0.8 If somebody has rolled back to a previous version of Cassandra that is unable to read an SSTable written by a future version correctly (indicated by a version change), failing fast is safer than accidentally performing a compaction that rewrites incorrect data and leaves you in an odd state. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-1427) Optimize loadbalance/move for moves within the current range
[ https://issues.apache.org/jira/browse/CASSANDRA-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12978595#action_12978595 ] Ryan King commented on CASSANDRA-1427: -- I think we should generalize it to cover all cases. Optimize loadbalance/move for moves within the current range Key: CASSANDRA-1427 URL: https://issues.apache.org/jira/browse/CASSANDRA-1427 Project: Cassandra Issue Type: Sub-task Components: Core Affects Versions: 0.7 beta 1 Reporter: Nick Bailey Assignee: Brandon Williams Fix For: 0.8 Currently our move/loadbalance operations only implement case 2 of the Ruhl algorithm described at https://issues.apache.org/jira/browse/CASSANDRA-192#action_12713079. We should add functionality to optimize moves that take/give ranges to a node's direct neighbors. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-1935) Refuse to open SSTables from the future
[ https://issues.apache.org/jira/browse/CASSANDRA-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12978029#action_12978029 ] Ryan King commented on CASSANDRA-1935: -- It seems like we should probably abort in this case, but that might be a bit draconian. Refuse to open SSTables from the future --- Key: CASSANDRA-1935 URL: https://issues.apache.org/jira/browse/CASSANDRA-1935 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Stu Hood Priority: Minor Fix For: 0.8 If somebody has rolled back to a previous version of Cassandra that is unable to read an SSTable written by a future version correctly (indicated by a version change), failing fast is safer than accidentally performing a compaction that rewrites incorrect data and leaves you in an odd state. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-1935) Refuse to open SSTables from the future
[ https://issues.apache.org/jira/browse/CASSANDRA-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12978076#action_12978076 ] Ryan King commented on CASSANDRA-1935: -- What about scenarios outside startup, like streaming? Refuse to open SSTables from the future --- Key: CASSANDRA-1935 URL: https://issues.apache.org/jira/browse/CASSANDRA-1935 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Stu Hood Priority: Minor Fix For: 0.8 If somebody has rolled back to a previous version of Cassandra that is unable to read an SSTable written by a future version correctly (indicated by a version change), failing fast is safer than accidentally performing a compaction that rewrites incorrect data and leaves you in an odd state. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-1932) NegativeArraySizeException at org.apache.cassandra.utils.BloomFilterSerializer.deserialize(BloomFilterSerializer.java:28)
[ https://issues.apache.org/jira/browse/CASSANDRA-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12977458#action_12977458 ] Ryan King commented on CASSANDRA-1932: -- What were the file names of the SSTables you set aside? NegativeArraySizeException at org.apache.cassandra.utils.BloomFilterSerializer.deserialize(BloomFilterSerializer.java:28) - Key: CASSANDRA-1932 URL: https://issues.apache.org/jira/browse/CASSANDRA-1932 Project: Cassandra Issue Type: Bug Affects Versions: 0.7.1 Reporter: Karl Mueller Assignee: Ryan King Fix For: 0.7.1 ERROR [ReadStage:30017] 2011-01-03 19:28:45,406 DebuggableThreadPoolExecutor.java (line 103) Error in ThreadPoolExecutor java.lang.NegativeArraySizeException at org.apache.cassandra.utils.BloomFilterSerializer.deserialize(BloomFilterSerializer.java:28) at org.apache.cassandra.utils.BloomFilterSerializer.deserialize(BloomFilterSerializer.java:9) at org.apache.cassandra.io.sstable.IndexHelper.defreezeBloomFilter(IndexHelper.java:104) at org.apache.cassandra.db.columniterator.SSTableNamesIterator.read(SSTableNamesIterator.java:106) at org.apache.cassandra.db.columniterator.SSTableNamesIterator.init(SSTableNamesIterator.java:71) at org.apache.cassandra.db.filter.NamesQueryFilter.getSSTableColumnIterator(NamesQueryFilter.java:59) at org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:80) at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1219) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1081) at org.apache.cassandra.db.Table.getRow(Table.java:384) at org.apache.cassandra.db.SliceByNamesReadCommand.getRow(SliceByNamesReadCommand.java:60) at org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:68) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:63) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (CASSANDRA-1859) distributed test harness
[ https://issues.apache.org/jira/browse/CASSANDRA-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King updated CASSANDRA-1859: - Attachment: 0003-add-a-test-for-one-writes-and-all-reads.txt Add another test for writing with one and reading with all (the last of the strong consistency scenarios). distributed test harness Key: CASSANDRA-1859 URL: https://issues.apache.org/jira/browse/CASSANDRA-1859 Project: Cassandra Issue Type: Test Components: Tools Reporter: Kelvin Kakugawa Assignee: Kelvin Kakugawa Fix For: 0.8 Attachments: 0001-Add-distributed-ultra-long-running-tests-using-Whirr-j.txt, 0002-Pull-whirr-0.3.0-incubating-SNAPSHOT-155-from-Twitter-.txt, 0003-add-a-test-for-one-writes-and-all-reads.txt Distributed Test Harness - deploys a cluster on a cloud provider - runs tests targeted at the cluster - tears down the cluster -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-1015) Internal Messaging should be backwards compatible
[ https://issues.apache.org/jira/browse/CASSANDRA-1015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12976157#action_12976157 ] Ryan King commented on CASSANDRA-1015: -- I fear that in going on own way we'll end up replicating a lot of what's already been done in these frameworks. Additionally, we make it much harder to to write code to comprehend the message in another language. I know this sounds like a YAGNI, but I've found it quite nice to be able to decode thrift RPC interchanges that are captured via tcpdump. We have to rebuild a lot if we go our own way. Internal Messaging should be backwards compatible - Key: CASSANDRA-1015 URL: https://issues.apache.org/jira/browse/CASSANDRA-1015 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Ryan King Assignee: Gary Dusbabek Priority: Critical Fix For: 0.8 Currently, incompatible changes in the node-to-node communication prevent rolling restarts of clusters. In order to fix this we should: 1) use a framework that makes doing compatible changes easy 2) have a policy of only making compatible changes between versions n and n+1* * Running multiple versions should only be supported for small periods of time. Running clusters of mixed version is not needed here. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-1072) Increment counters
[ https://issues.apache.org/jira/browse/CASSANDRA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12973335#action_12973335 ] Ryan King commented on CASSANDRA-1072: -- Committing to trunk seems like a reasonable approach. Hopefully we can successfully backport to 0.7 then. Increment counters -- Key: CASSANDRA-1072 URL: https://issues.apache.org/jira/browse/CASSANDRA-1072 Project: Cassandra Issue Type: Sub-task Components: Core Reporter: Johan Oskarsson Assignee: Kelvin Kakugawa Attachments: CASSANDRA-1072.121710.2.patch, increment_test.py, Partitionedcountersdesigndoc.pdf Break out the increment counters out of CASSANDRA-580. Classes are shared between the two features but without the plain version vector code the changeset becomes smaller and more manageable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-1083) Improvement to CompactionManger's submitMinorIfNeeded
[ https://issues.apache.org/jira/browse/CASSANDRA-1083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12970244#action_12970244 ] Ryan King commented on CASSANDRA-1083: -- I agree. I think this idea is mostly a dead end because its attacking the problem from the wrong direction. Improvement to CompactionManger's submitMinorIfNeeded - Key: CASSANDRA-1083 URL: https://issues.apache.org/jira/browse/CASSANDRA-1083 Project: Cassandra Issue Type: Improvement Reporter: Ryan King Assignee: Tyler Hobbs Priority: Minor Fix For: 0.7.1 Attachments: 1083-configurable-compaction-thresholds.patch, 1083-sort.txt, compaction_simulation.rb, compaction_simulation.rb We've discovered that we are unable to tune compaction the way we want for our production cluster. I think the current algorithm doesn't do this as well as it could, since it doesn't sort the sstables by size before doing the bucketing, which means the tuning parameters have unpredictable results. I looked at CASSANDRA-792, but it seems like overkill. Here's an alternative proposal: config operations: minimumCompactionThreshold maximumCompactionThreshold targetSSTableCount The first two would mean what they currently mean: the bounds on how many sstables to compact in one compaction operation. The 3rd is a target for how many SSTables you'd like to have. Pseudo code algorithm for determining whether or not to do a minor compaction: {noformat} if sstables.length + minimumCompactionThreshold -1 targetSSTableCount sort sstables from smallest to largest compact the up to maximumCompactionThreshold smallest tables {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-1083) Improvement to CompactionManger's submitMinorIfNeeded
[ https://issues.apache.org/jira/browse/CASSANDRA-1083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12969829#action_12969829 ] Ryan King commented on CASSANDRA-1083: -- To be honest, I'm not sure this is the best approach anymore. I think the fundamental problem is that its driven by the write traffic, not the read traffic. Improvement to CompactionManger's submitMinorIfNeeded - Key: CASSANDRA-1083 URL: https://issues.apache.org/jira/browse/CASSANDRA-1083 Project: Cassandra Issue Type: Improvement Reporter: Ryan King Assignee: Tyler Hobbs Priority: Minor Fix For: 0.7.1 Attachments: 1083-configurable-compaction-thresholds.patch, compaction_simulation.rb, compaction_simulation.rb We've discovered that we are unable to tune compaction the way we want for our production cluster. I think the current algorithm doesn't do this as well as it could, since it doesn't sort the sstables by size before doing the bucketing, which means the tuning parameters have unpredictable results. I looked at CASSANDRA-792, but it seems like overkill. Here's an alternative proposal: config operations: minimumCompactionThreshold maximumCompactionThreshold targetSSTableCount The first two would mean what they currently mean: the bounds on how many sstables to compact in one compaction operation. The 3rd is a target for how many SSTables you'd like to have. Pseudo code algorithm for determining whether or not to do a minor compaction: {noformat} if sstables.length + minimumCompactionThreshold -1 targetSSTableCount sort sstables from smallest to largest compact the up to maximumCompactionThreshold smallest tables {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-1555) Considerations for larger bloom filters
[ https://issues.apache.org/jira/browse/CASSANDRA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12968861#action_12968861 ] Ryan King commented on CASSANDRA-1555: -- Stu's last patch is incorporated (in spirit, I took a slightly different appraoch) in my latest. Considerations for larger bloom filters --- Key: CASSANDRA-1555 URL: https://issues.apache.org/jira/browse/CASSANDRA-1555 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Stu Hood Assignee: Ryan King Fix For: 0.8 Attachments: 1555_v5.txt, addendum-to-1555.txt, cassandra-1555.tgz, CASSANDRA-1555v2.patch, CASSANDRA-1555v3.patch.gz, CASSANDRA-1555v4.patch.gz To (optimally) support SSTables larger than 143 million keys, we need to support bloom filters larger than 2^31 bits, which java.util.BitSet can't handle directly. A few options: * Switch to a BitSet class which supports 2^31 * 64 bits (Lucene's OpenBitSet) * Partition the java.util.BitSet behind our current BloomFilter ** Straightforward bit partitioning: bit N is in bitset N // 2^31 ** Separate equally sized complete bloom filters for member ranges, which can be used independently or OR'd together under memory pressure. All of these options require new approaches to serialization. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (CASSANDRA-1555) Considerations for larger bloom filters
[ https://issues.apache.org/jira/browse/CASSANDRA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan King updated CASSANDRA-1555: - Attachment: CASSANDRA-1555v3.patch.gz New patch with several changes based on Stu's feedback: * renamed BloomFilter to LegacyBloomFilter and BigBloomFilter to BloomFilter * moved maxBucketsPerElement to BloomCalculations * removed emptybuckets * cleaned up formatting in SSTableReader and BigBloomFilter Finally I changed the serialization to read and write the long[] directly, which saves a lot of spaces for small filters (column filter for a 10 item row goes from 120 bytes to 16). Considerations for larger bloom filters --- Key: CASSANDRA-1555 URL: https://issues.apache.org/jira/browse/CASSANDRA-1555 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Stu Hood Assignee: Ryan King Fix For: 0.8 Attachments: cassandra-1555.tgz, CASSANDRA-1555v2.patch, CASSANDRA-1555v3.patch.gz To (optimally) support SSTables larger than 143 million keys, we need to support bloom filters larger than 2^31 bits, which java.util.BitSet can't handle directly. A few options: * Switch to a BitSet class which supports 2^31 * 64 bits (Lucene's OpenBitSet) * Partition the java.util.BitSet behind our current BloomFilter ** Straightforward bit partitioning: bit N is in bitset N // 2^31 ** Separate equally sized complete bloom filters for member ranges, which can be used independently or OR'd together under memory pressure. All of these options require new approaches to serialization. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.