Re: High Bloom filter false ratio
There exists a JMX endpoint called forceUserDefinedCompaction that takes a comma separated list of sstables to compact together. There also exists a tool called sstablemetadata (may be in a ‘cassandra-tools’ package separate from whatever package you used to install cassandra, or in the tools/ directory of your binary package). Using sstablemetadata, you can look at the maxTimestamp for each table, and the ‘Estimated droppable tombstones’. Using those two fields, you could, very easily, write a script that gives you a list of sstables that you could feed to forceUserDefinedCompaction to join together to eliminate leftover waste. Your long ParNew times may be fixable by increasing the new gen size of your heap – the general guidance in cassandra-env.sh is out of date, you may want to reference CASSANDRA-8150 for “newer” advice ( http://issues.apache.org/jira/browse/CASSANDRA-8150 ) - Jeff From: Anishek Agarwal Reply-To: "user@cassandra.apache.org" Date: Monday, February 22, 2016 at 8:33 PM To: "user@cassandra.apache.org" Subject: Re: High Bloom filter false ratio Hey Jeff, Thanks for the clarification, I did not explain my self clearly, the max_stable_age_days is set to 30 days and the ttl on every insert is set to 30 days also by default. gc_grace_seconds is 0, so i would think the sstable as a whole would be deleted. Because of the problems mentioned by at 1) above it looks like, there might be cases where the table just lies around since no compaction is happening on it and even though everything is expired it would still not be deleted? for 3) the average read is pretty good, though the throughput doesn't seem to be that great, when no repair is running we get GCIns > 200ms every couple of hours once, otherwise its every 10-20 mins INFO [ScheduledTasks:1] 2016-02-23 05:15:03,070 GCInspector.java (line 116) GC for ParNew: 205 ms for 1 collections, 1712439128 used; max is 7784628224 INFO [ScheduledTasks:1] 2016-02-23 08:30:47,709 GCInspector.java (line 116) GC for ParNew: 242 ms for 1 collections, 1819126928 used; max is 7784628224 INFO [ScheduledTasks:1] 2016-02-23 09:09:55,085 GCInspector.java (line 116) GC for ParNew: 374 ms for 1 collections, 1829660304 used; max is 7784628224 INFO [ScheduledTasks:1] 2016-02-23 09:11:21,245 GCInspector.java (line 116) GC for ParNew: 419 ms for 1 collections, 2309875224 used; max is 7784628224 INFO [ScheduledTasks:1] 2016-02-23 09:35:50,717 GCInspector.java (line 116) GC for ParNew: 231 ms for 1 collections, 2515325328 used; max is 7784628224 INFO [ScheduledTasks:1] 2016-02-23 09:38:47,194 GCInspector.java (line 116) GC for ParNew: 252 ms for 1 collections, 1724241952 used; max is 7784628224 our reading patterns are dependent on BF to work efficiently as we do a lot of reads for keys that may not exists because its time series and we segregate data based on hourly boundary from epoch. hey Christoper, yes every row in the stable that should have been deleted has "d" in that column. Also the key for one of the row is as "key": "00080cdd5edd080006251000" how do i get it back to normal readable format to get the (long,long) -- composite partition key back? Looks like i have to force a major compaction to delete a lot of data ? are there any other solutions ? thanks anishek On Mon, Feb 22, 2016 at 11:21 PM, Jeff Jirsawrote: 1) getFullyExpiredSSTables in 2.0 isn’t as thorough as many expect, so it’s very likely that some sstables stick around longer than you expect. 2) max_sstable_age_days tells cassandra when to stop compacting that file, not when to delete it. 3) You can change the window size using both the base_time_seconds parameter and max_sstable_age_days parameter (use the former to set the size of the first window, and the latter to determine how long before you stop compacting that window). It’s somewhat non-intuitive. Your read latencies actually look pretty reasonable, are you sure you’re not simply hitting GC pauses that cause your queries to run longer than you expect? Do you have graphs of GC time (first derivative of total gc time is common for tools like graphite), or do you see ‘gcinspector’ in your logs indicating pauses > 200ms? From: Anishek Agarwal Reply-To: "user@cassandra.apache.org" Date: Sunday, February 21, 2016 at 11:13 PM To: "user@cassandra.apache.org" Subject: Re: High Bloom filter false ratio Hey guys, Just did some more digging ... looks like DTCS is not removing old data completely, I used sstable2json for one such table and saw old data there. we have a value of 30 for max_stable_age_days for the table. One of the columns showed data as :["2015-12-10 11\\:03+0530:", "56690ea2", 1449725602552000, "d"] what is the meaning of "d" in the last IS_MARKED_FOR_DELETE column ? I see data from 10 dec 2015 still there, looks like there are a few issues with DTCS, Operationally what choices do i have to
Re: High Bloom filter false ratio
Hey Jeff, Thanks for the clarification, I did not explain my self clearly, the max_stable_age_days is set to 30 days and the ttl on every insert is set to 30 days also by default. gc_grace_seconds is 0, so i would think the sstable as a whole would be deleted. Because of the problems mentioned by at 1) above it looks like, there might be cases where the table just lies around since no compaction is happening on it and even though everything is expired it would still not be deleted? for 3) the average read is pretty good, though the throughput doesn't seem to be that great, when no repair is running we get GCIns > 200ms every couple of hours once, otherwise its every 10-20 mins INFO [ScheduledTasks:1] 2016-02-23 05:15:03,070 GCInspector.java (line 116) GC for ParNew: 205 ms for 1 collections, 1712439128 used; max is 7784628224 INFO [ScheduledTasks:1] 2016-02-23 08:30:47,709 GCInspector.java (line 116) GC for ParNew: 242 ms for 1 collections, 1819126928 used; max is 7784628224 INFO [ScheduledTasks:1] 2016-02-23 09:09:55,085 GCInspector.java (line 116) GC for ParNew: 374 ms for 1 collections, 1829660304 used; max is 7784628224 INFO [ScheduledTasks:1] 2016-02-23 09:11:21,245 GCInspector.java (line 116) GC for ParNew: 419 ms for 1 collections, 2309875224 used; max is 7784628224 INFO [ScheduledTasks:1] 2016-02-23 09:35:50,717 GCInspector.java (line 116) GC for ParNew: 231 ms for 1 collections, 2515325328 used; max is 7784628224 INFO [ScheduledTasks:1] 2016-02-23 09:38:47,194 GCInspector.java (line 116) GC for ParNew: 252 ms for 1 collections, 1724241952 used; max is 7784628224 our reading patterns are dependent on BF to work efficiently as we do a lot of reads for keys that may not exists because its time series and we segregate data based on hourly boundary from epoch. hey Christoper, yes every row in the stable that should have been deleted has "d" in that column. Also the key for one of the row is as "key": "00080cdd5edd080006251000" how do i get it back to normal readable format to get the (long,long) -- composite partition key back? Looks like i have to force a major compaction to delete a lot of data ? are there any other solutions ? thanks anishek On Mon, Feb 22, 2016 at 11:21 PM, Jeff Jirsawrote: > 1) getFullyExpiredSSTables in 2.0 isn’t as thorough as many expect, so > it’s very likely that some sstables stick around longer than you expect. > > 2) max_sstable_age_days tells cassandra when to stop compacting that file, > not when to delete it. > > 3) You can change the window size using both the base_time_seconds > parameter and max_sstable_age_days parameter (use the former to set the > size of the first window, and the latter to determine how long before you > stop compacting that window). It’s somewhat non-intuitive. > > Your read latencies actually look pretty reasonable, are you sure you’re > not simply hitting GC pauses that cause your queries to run longer than you > expect? Do you have graphs of GC time (first derivative of total gc time is > common for tools like graphite), or do you see ‘gcinspector’ in your logs > indicating pauses > 200ms? > > From: Anishek Agarwal > Reply-To: "user@cassandra.apache.org" > Date: Sunday, February 21, 2016 at 11:13 PM > To: "user@cassandra.apache.org" > Subject: Re: High Bloom filter false ratio > > Hey guys, > > Just did some more digging ... looks like DTCS is not removing old data > completely, I used sstable2json for one such table and saw old data there. > we have a value of 30 for max_stable_age_days for the table. > > One of the columns showed data as :["2015-12-10 11\\:03+0530:", > "56690ea2", 1449725602552000, "d"] what is the meaning of "d" in the last > IS_MARKED_FOR_DELETE column ? > > I see data from 10 dec 2015 still there, looks like there are a few issues > with DTCS, Operationally what choices do i have to rectify this, We are on > version 2.0.15. > > thanks > anishek > > > > > On Mon, Feb 22, 2016 at 10:23 AM, Anishek Agarwal > wrote: > >> We are using DTCS have a 30 day window for them before they are cleaned >> up. I don't think with DTCS we can do anything about table sizing. Please >> do let me know if there are other ideas. >> >> On Sat, Feb 20, 2016 at 12:51 AM, Jaydeep Chovatia < >> chovatia.jayd...@gmail.com> wrote: >> >>> To me following three looks on higher side: >>> SSTable count: 1289 >>> >>> In order to reduce SSTable count see if you are compacting of not (If >>> using STCS). Is it possible to change this to LCS? >>> >>> >>> Number of keys (estimate): 345137664 (345M partition keys) >>> >>> I don't have any suggestion about reducing this unless you partition >>> your data. >>> >>> >>> Bloom filter space used, bytes: 493777336 (400MB is huge) >>> >>> If number of keys are reduced then this will automatically reduce bloom >>> filter size I believe. >>> >>> >>> >>> Jaydeep >>> >>> On Thu, Feb 18, 2016 at 7:52 PM, Anishek Agarwal
Isolation for atomic batch on the same partition key
Hi all, A couple of questions about atomic batch: 1. If an atomic batch (logged batch) contains a bunch of row mutations and all of them have the same partition key, can I assume all those changes have the same isolation as the row-level isolation? According to the post here http://www.mail-archive.com/user%40cassandra.apache.org/msg42434.html, it seems that we can get strong isolation. e.g. *BEGIN BATCH* * UPDATE a IF condition_1;* * INSERT b;* * INSERT c;* *APPLY BATCH* So at any replica, we expect isolation for the three changes on *a*, *b*, *c* (*a* , *b*, *c* have the same partition key *k1*) -- i.e. either none or all of them are visible. Can someone help confirm? 2. Say in the above batch, we include two extra row mutations d and e for another partition key *k2*. Will the changes on (*a*, *b*, *c*) and (*d*, *e*) still atomic respectively in terms of isolation? I understand there is no isolation between (*a*, *b*, *c*) and (*d*, *e*). I.e. is there a per-parition-key isolation guaranteed? 3. I assume CL SERIAL or LOCAL_SERIAL on reads will try applying the above logged batch if it is committed but not applied. Right? Thanks Yawei
[Announcement] Achilles 4.1.0 released
Hello all I am pleased to announce the release of Achilles 4.1.0. The biggest change is the support for new Cassandra 3.x Materialized View by annotation. Achilles also enforces constraints on your views (all primary key columns of the base table should be in the view etc..) at compile time and generates only SELECT query builder (since update/insert is not allowed) For more details, the wiki: https://github.com/doanduyhai/Achilles/wiki Regards Duy Hai DOAN
Cassandra Calcite integration
Hi all, For those not familiar, Apache Calcite is a data management framework that enables storage-agnostic SQL query processing. The practical implications are that by writing a relatively small amount of code, Calcite can execute a large subset of SQL queries against different backend databases. Over the past couple weeks I wrote a Cassandra adapter for Calcite. By just pointing Calcite at a Cassandra installation, you can execute SQL queries over the data stored in your Cassandra tables (including joins). These queries will not necessarily be efficient as it entirely depends on how your data is modelled in the underlying CQL tables. There's a lot of work to be done, but I'm hoping this will be helpful to those who want to do a bit of exploration of their data without writing any code. I wrote a blog post here that provides more details: http://michael.mior.ca/blog/calcite-cassandra-adapter/ Cheers, -- Michael Mior mm...@uwaterloo.ca
Re: High Bloom filter false ratio
1) getFullyExpiredSSTables in 2.0 isn’t as thorough as many expect, so it’s very likely that some sstables stick around longer than you expect. 2) max_sstable_age_days tells cassandra when to stop compacting that file, not when to delete it. 3) You can change the window size using both the base_time_seconds parameter and max_sstable_age_days parameter (use the former to set the size of the first window, and the latter to determine how long before you stop compacting that window). It’s somewhat non-intuitive. Your read latencies actually look pretty reasonable, are you sure you’re not simply hitting GC pauses that cause your queries to run longer than you expect? Do you have graphs of GC time (first derivative of total gc time is common for tools like graphite), or do you see ‘gcinspector’ in your logs indicating pauses > 200ms? From: Anishek Agarwal Reply-To: "user@cassandra.apache.org" Date: Sunday, February 21, 2016 at 11:13 PM To: "user@cassandra.apache.org" Subject: Re: High Bloom filter false ratio Hey guys, Just did some more digging ... looks like DTCS is not removing old data completely, I used sstable2json for one such table and saw old data there. we have a value of 30 for max_stable_age_days for the table. One of the columns showed data as :["2015-12-10 11\\:03+0530:", "56690ea2", 1449725602552000, "d"] what is the meaning of "d" in the last IS_MARKED_FOR_DELETE column ? I see data from 10 dec 2015 still there, looks like there are a few issues with DTCS, Operationally what choices do i have to rectify this, We are on version 2.0.15. thanks anishek On Mon, Feb 22, 2016 at 10:23 AM, Anishek Agarwalwrote: We are using DTCS have a 30 day window for them before they are cleaned up. I don't think with DTCS we can do anything about table sizing. Please do let me know if there are other ideas. On Sat, Feb 20, 2016 at 12:51 AM, Jaydeep Chovatia wrote: To me following three looks on higher side: SSTable count: 1289 In order to reduce SSTable count see if you are compacting of not (If using STCS). Is it possible to change this to LCS? Number of keys (estimate): 345137664 (345M partition keys) I don't have any suggestion about reducing this unless you partition your data. Bloom filter space used, bytes: 493777336 (400MB is huge) If number of keys are reduced then this will automatically reduce bloom filter size I believe. Jaydeep On Thu, Feb 18, 2016 at 7:52 PM, Anishek Agarwal wrote: Hey all, @Jaydeep here is the cfstats output from one node. Read Count: 1721134722 Read Latency: 0.04268825050756254 ms. Write Count: 56743880 Write Latency: 0.014650376727851532 ms. Pending Tasks: 0 Table: user_stay_points SSTable count: 1289 Space used (live), bytes: 122141272262 Space used (total), bytes: 224227850870 Off heap memory used (total), bytes: 653827528 SSTable Compression Ratio: 0.4959736121441446 Number of keys (estimate): 345137664 Memtable cell count: 339034 Memtable data size, bytes: 106558314 Memtable switch count: 3266 Local read count: 1721134803 Local read latency: 0.048 ms Local write count: 56743898 Local write latency: 0.018 ms Pending tasks: 0 Bloom filter false positives: 40664437 Bloom filter false ratio: 0.69058 Bloom filter space used, bytes: 493777336 Bloom filter off heap memory used, bytes: 493767024 Index summary off heap memory used, bytes: 91677192 Compression metadata off heap memory used, bytes: 68383312 Compacted partition minimum bytes: 104 Compacted partition maximum bytes: 1629722 Compacted partition mean bytes: 1773 Average live cells per slice (last five minutes): 0.0 Average tombstones per slice (last five minutes): 0.0 @Tyler Hobbs we are using cassandra 2.0.15 so https://issues.apache.org/jira/browse/CASSANDRA-8525 shouldnt occur. Other problems looks like will be fixed in 3.0 .. we will mostly try and slot in an upgrade to 3.x version towards second quarter of this year. @Daemon Latencies seem to have higher ratios, attached is the graph. I am mostly trying to look at Bloom filters, because the way we do reads, we read data with non existent partition keys and it seems to be taking long to respond, like for 720 queries it takes 2 seconds, with all 721 queries not returning anything. the 720 queries are done in sequence of 180 queries each with 180 of them running in parallel. thanks anishek On Fri, Feb 19, 2016 at 3:09 AM, Jaydeep Chovatia wrote: How many partition keys exists for the table which shows this problem (or provide nodetool cfstats for that table)? On Thu, Feb 18, 2016 at 11:38 AM, daemeon reiydelle wrote: The bloom filter buckets the values in a small number of buckets. I have been surprised by how many cases I see with large cardinality where a few values populate a given bloom leaf, resulting in high false positives, and a
Re: High Bloom filter false ratio
Does every record in the SSTable have a "d" column? On Mon, Feb 22, 2016 at 2:14 AM Anishek Agarwalwrote: > Hey guys, > > Just did some more digging ... looks like DTCS is not removing old data > completely, I used sstable2json for one such table and saw old data there. > we have a value of 30 for max_stable_age_days for the table. > > One of the columns showed data as :["2015-12-10 11\\:03+0530:", > "56690ea2", 1449725602552000, "d"] what is the meaning of "d" in the last > IS_MARKED_FOR_DELETE column ? > > I see data from 10 dec 2015 still there, looks like there are a few issues > with DTCS, Operationally what choices do i have to rectify this, We are on > version 2.0.15. > > thanks > anishek > > > > > On Mon, Feb 22, 2016 at 10:23 AM, Anishek Agarwal > wrote: > >> We are using DTCS have a 30 day window for them before they are cleaned >> up. I don't think with DTCS we can do anything about table sizing. Please >> do let me know if there are other ideas. >> >> On Sat, Feb 20, 2016 at 12:51 AM, Jaydeep Chovatia < >> chovatia.jayd...@gmail.com> wrote: >> >>> To me following three looks on higher side: >>> SSTable count: 1289 >>> >>> In order to reduce SSTable count see if you are compacting of not (If >>> using STCS). Is it possible to change this to LCS? >>> >>> >>> Number of keys (estimate): 345137664 (345M partition keys) >>> >>> I don't have any suggestion about reducing this unless you partition >>> your data. >>> >>> >>> Bloom filter space used, bytes: 493777336 (400MB is huge) >>> >>> If number of keys are reduced then this will automatically reduce bloom >>> filter size I believe. >>> >>> >>> >>> Jaydeep >>> >>> On Thu, Feb 18, 2016 at 7:52 PM, Anishek Agarwal >>> wrote: >>> Hey all, @Jaydeep here is the cfstats output from one node. Read Count: 1721134722 Read Latency: 0.04268825050756254 ms. Write Count: 56743880 Write Latency: 0.014650376727851532 ms. Pending Tasks: 0 Table: user_stay_points SSTable count: 1289 Space used (live), bytes: 122141272262 Space used (total), bytes: 224227850870 Off heap memory used (total), bytes: 653827528 SSTable Compression Ratio: 0.4959736121441446 Number of keys (estimate): 345137664 Memtable cell count: 339034 Memtable data size, bytes: 106558314 Memtable switch count: 3266 Local read count: 1721134803 Local read latency: 0.048 ms Local write count: 56743898 Local write latency: 0.018 ms Pending tasks: 0 Bloom filter false positives: 40664437 Bloom filter false ratio: 0.69058 Bloom filter space used, bytes: 493777336 Bloom filter off heap memory used, bytes: 493767024 Index summary off heap memory used, bytes: 91677192 Compression metadata off heap memory used, bytes: 68383312 Compacted partition minimum bytes: 104 Compacted partition maximum bytes: 1629722 Compacted partition mean bytes: 1773 Average live cells per slice (last five minutes): 0.0 Average tombstones per slice (last five minutes): 0.0 @Tyler Hobbs we are using cassandra 2.0.15 so https://issues.apache.org/jira/browse/CASSANDRA-8525 shouldnt occur. Other problems looks like will be fixed in 3.0 .. we will mostly try and slot in an upgrade to 3.x version towards second quarter of this year. @Daemon Latencies seem to have higher ratios, attached is the graph. I am mostly trying to look at Bloom filters, because the way we do reads, we read data with non existent partition keys and it seems to be taking long to respond, like for 720 queries it takes 2 seconds, with all 721 queries not returning anything. the 720 queries are done in sequence of 180 queries each with 180 of them running in parallel. thanks anishek On Fri, Feb 19, 2016 at 3:09 AM, Jaydeep Chovatia < chovatia.jayd...@gmail.com> wrote: > How many partition keys exists for the table which shows this problem > (or provide nodetool cfstats for that table)? > > On Thu, Feb 18, 2016 at 11:38 AM, daemeon reiydelle < > daeme...@gmail.com> wrote: > >> The bloom filter buckets the values in a small number of buckets. I >> have been surprised by how many cases I see with large cardinality where >> a >> few values populate a given bloom leaf, resulting in high false >> positives, >> and a surprising impact on latencies! >> >> Are you seeing 2:1 ranges between mean and worse case latencies >> (allowing for gc times)? >> >> Daemeon Reiydelle >> On Feb 18, 2016 8:57 AM, "Tyler Hobbs" wrote: >> >>> You can try