Re: [External] Re: Understanding when Cassandra drops expired time series data

2016-06-17 Thread jerome
Also, just out of curiosity, if the TTL is stored internally in the cell does 
that mean Cassandra checks the timestamp everytime it reads the data to 
determine if its valid?


Thanks,

Jerome


From: jerome <jeromefroel...@hotmail.com>
Sent: Friday, June 17, 2016 4:32:47 PM
To: user@cassandra.apache.org
Subject: Re: [External] Re: Understanding when Cassandra drops expired time 
series data


Got it. Thanks again! But in versions of Cassandra after 2.0.15 
getFullyExpiredSSTables is more in line with what people would consider to be 
fully expired (i.e. any SSTable whose newest value is past the default TTL for 
the table)?


From: Jeff Jirsa <jeff.ji...@crowdstrike.com>
Sent: Friday, June 17, 2016 4:21:02 PM
To: user@cassandra.apache.org
Subject: Re: [External] Re: Understanding when Cassandra drops expired time 
series data

Correcting myself - https://issues.apache.org/jira/browse/CASSANDRA-9882 made 
it check for fully expired tables no more than once every 10 minutes (still 
happens on flush as described, just not EVERY flush). Went in 2.0.17 / 2.1.9.


-  Jeff

From: Jeff Jirsa <jeff.ji...@crowdstrike.com>
Reply-To: <user@cassandra.apache.org>
Date: Friday, June 17, 2016 at 1:05 PM
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Subject: [External] Re: Understanding when Cassandra drops expired time series 
data

Anytime the memtable is flushed to disk, the compaction strategy will look for 
tables to compact. If there exists fully expired sstables (again, where 
getFullyExpiredSStables agrees that they’re fully expired, which may not be 
when YOU think they’re fully expired), they’ll be included/dropped.



From: jerome <jeromefroel...@hotmail.com>
Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Friday, June 17, 2016 at 1:02 PM
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Subject: Re: Understanding when Cassandra drops expired time series data


Hi Jeff,



Thanks for the information! That helps clear up a lot of things for me. We've 
had issues with SSTables not being dropped despite being past their TTLs (no 
read repairs or deletes) and I think issue 9572 explains why (definitely going 
to look into updating soon). Just to clarify, in the current version of 
Cassandra when do fully expired SSTables get dropped? Is it when a minor 
compaction runs or is it separate from minor compactions? Also thanks to the 
link to the slides, great stuff!



Jerome


From: Jeff Jirsa <jeff.ji...@crowdstrike.com>
Sent: Friday, June 17, 2016 3:25:07 PM
To: user@cassandra.apache.org
Subject: Re: Understanding when Cassandra drops expired time series data

First, DTCS in 2.0.15 has some weird behaviors - 
https://issues.apache.org/jira/browse/CASSANDRA-9572<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D9572=CwMF-g=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow=P_U0B4wXTjErl3KVO9I_kEtybcbUTROpIezuwd_Ickc=bL5UPy-_ul2mEsiAessYj9pjwVlftMTCUlyeMexigFM=>
 .

That said, some other general notes:

Data deleted by TTL isn’t the same as issuing a delete – each expiring cell 
internally has a ttl/timestamp at which it will be converted into a tombstone. 
There is no tombstone added to the memtable, or flushed to disk – it just 
treats the expired cells as tombstones once they’re past that timestamp.

Cassandra’s getFullyExpiredSSTables() will consider a table fully expired if 
(and only if) all cells within that table are expired (current time > max 
timestamp ) AND the sstable timestamps don’t overlap with others that aren’t 
fully expired. Björn talks about this in 
https://issues.apache.org/jira/browse/CASSANDRA-8243<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D8243=CwMF-g=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow=P_U0B4wXTjErl3KVO9I_kEtybcbUTROpIezuwd_Ickc=nyn6RMlmzFTZKnyIsBtmecqykuHUC8ALj-VQj46WgMw=>
 - the intent here is so that explicit deletes (which do create tombstones) 
won’t be GC’d  from an otherwise fully expired sstable if they’re covering data 
in a more recent sstable – without this check, we could accidentally bring dead 
data back to life. In an append only time series workload this would be 
unusual, but not impossible.

Unfortunately, read repairs (foreground/blocking, if you write with CL < ALL 
and read with CL > ONE) will cause cells written with old timestamps to be 
written into the newly flushed sstables, which creates sstables with wide gaps 
between minTimestamp and maxTimestamp (you could have a read repair pull data 
that is 23 hours old into a new sstable, and now that one sstable spans 23 
hours, and isn’t fully expired until the oldest dat

Re: [External] Re: Understanding when Cassandra drops expired time series data

2016-06-17 Thread jerome
Got it. Thanks again! But in versions of Cassandra after 2.0.15 
getFullyExpiredSSTables is more in line with what people would consider to be 
fully expired (i.e. any SSTable whose newest value is past the default TTL for 
the table)?


From: Jeff Jirsa <jeff.ji...@crowdstrike.com>
Sent: Friday, June 17, 2016 4:21:02 PM
To: user@cassandra.apache.org
Subject: Re: [External] Re: Understanding when Cassandra drops expired time 
series data

Correcting myself - https://issues.apache.org/jira/browse/CASSANDRA-9882 made 
it check for fully expired tables no more than once every 10 minutes (still 
happens on flush as described, just not EVERY flush). Went in 2.0.17 / 2.1.9.


-  Jeff

From: Jeff Jirsa <jeff.ji...@crowdstrike.com>
Reply-To: <user@cassandra.apache.org>
Date: Friday, June 17, 2016 at 1:05 PM
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Subject: [External] Re: Understanding when Cassandra drops expired time series 
data

Anytime the memtable is flushed to disk, the compaction strategy will look for 
tables to compact. If there exists fully expired sstables (again, where 
getFullyExpiredSStables agrees that they’re fully expired, which may not be 
when YOU think they’re fully expired), they’ll be included/dropped.



From: jerome <jeromefroel...@hotmail.com>
Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Friday, June 17, 2016 at 1:02 PM
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Subject: Re: Understanding when Cassandra drops expired time series data


Hi Jeff,



Thanks for the information! That helps clear up a lot of things for me. We've 
had issues with SSTables not being dropped despite being past their TTLs (no 
read repairs or deletes) and I think issue 9572 explains why (definitely going 
to look into updating soon). Just to clarify, in the current version of 
Cassandra when do fully expired SSTables get dropped? Is it when a minor 
compaction runs or is it separate from minor compactions? Also thanks to the 
link to the slides, great stuff!



Jerome


From: Jeff Jirsa <jeff.ji...@crowdstrike.com>
Sent: Friday, June 17, 2016 3:25:07 PM
To: user@cassandra.apache.org
Subject: Re: Understanding when Cassandra drops expired time series data

First, DTCS in 2.0.15 has some weird behaviors - 
https://issues.apache.org/jira/browse/CASSANDRA-9572<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D9572=CwMF-g=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow=P_U0B4wXTjErl3KVO9I_kEtybcbUTROpIezuwd_Ickc=bL5UPy-_ul2mEsiAessYj9pjwVlftMTCUlyeMexigFM=>
 .

That said, some other general notes:

Data deleted by TTL isn’t the same as issuing a delete – each expiring cell 
internally has a ttl/timestamp at which it will be converted into a tombstone. 
There is no tombstone added to the memtable, or flushed to disk – it just 
treats the expired cells as tombstones once they’re past that timestamp.

Cassandra’s getFullyExpiredSSTables() will consider a table fully expired if 
(and only if) all cells within that table are expired (current time > max 
timestamp ) AND the sstable timestamps don’t overlap with others that aren’t 
fully expired. Björn talks about this in 
https://issues.apache.org/jira/browse/CASSANDRA-8243<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D8243=CwMF-g=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow=P_U0B4wXTjErl3KVO9I_kEtybcbUTROpIezuwd_Ickc=nyn6RMlmzFTZKnyIsBtmecqykuHUC8ALj-VQj46WgMw=>
 - the intent here is so that explicit deletes (which do create tombstones) 
won’t be GC’d  from an otherwise fully expired sstable if they’re covering data 
in a more recent sstable – without this check, we could accidentally bring dead 
data back to life. In an append only time series workload this would be 
unusual, but not impossible.

Unfortunately, read repairs (foreground/blocking, if you write with CL < ALL 
and read with CL > ONE) will cause cells written with old timestamps to be 
written into the newly flushed sstables, which creates sstables with wide gaps 
between minTimestamp and maxTimestamp (you could have a read repair pull data 
that is 23 hours old into a new sstable, and now that one sstable spans 23 
hours, and isn’t fully expired until the oldest data is 47 hours old). There’s 
an open ticket 
(https://issues.apache.org/jira/browse/CASSANDRA-10496<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D10496=CwMF-g=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow=P_U0B4wXTjErl3KVO9I_kEtybcbUTROpIezuwd_Ickc=aKWNeYAeyF0i1d9X4HD0Ot1NSO3erKJ33IWSY9hujrg=>
 ) meant to make this behavior ‘better’ in the fu

Re: [External] Re: Understanding when Cassandra drops expired time series data

2016-06-17 Thread Jeff Jirsa
Correcting myself - https://issues.apache.org/jira/browse/CASSANDRA-9882 made 
it check for fully expired tables no more than once every 10 minutes (still 
happens on flush as described, just not EVERY flush). Went in 2.0.17 / 2.1.9. 

 

-  Jeff

 

From: Jeff Jirsa 
Reply-To: 
Date: Friday, June 17, 2016 at 1:05 PM
To: "user@cassandra.apache.org" 
Subject: [External] Re: Understanding when Cassandra drops expired time series 
data

 

Anytime the memtable is flushed to disk, the compaction strategy will look for 
tables to compact. If there exists fully expired sstables (again, where 
getFullyExpiredSStables agrees that they’re fully expired, which may not be 
when YOU think they’re fully expired), they’ll be included/dropped. 

 

 

 

From: jerome 
Reply-To: "user@cassandra.apache.org" 
Date: Friday, June 17, 2016 at 1:02 PM
To: "user@cassandra.apache.org" 
Subject: Re: Understanding when Cassandra drops expired time series data

 

Hi Jeff,

 

Thanks for the information! That helps clear up a lot of things for me. We've 
had issues with SSTables not being dropped despite being past their TTLs (no 
read repairs or deletes) and I think issue 9572 explains why (definitely going 
to look into updating soon). Just to clarify, in the current version of 
Cassandra when do fully expired SSTables get dropped? Is it when a minor 
compaction runs or is it separate from minor compactions? Also thanks to the 
link to the slides, great stuff!

 

Jerome

From: Jeff Jirsa 
Sent: Friday, June 17, 2016 3:25:07 PM
To: user@cassandra.apache.org
Subject: Re: Understanding when Cassandra drops expired time series data 

 

First, DTCS in 2.0.15 has some weird behaviors - 
https://issues.apache.org/jira/browse/CASSANDRA-9572 . 

 

That said, some other general notes:


Data deleted by TTL isn’t the same as issuing a delete – each expiring cell 
internally has a ttl/timestamp at which it will be converted into a tombstone. 
There is no tombstone added to the memtable, or flushed to disk – it just 
treats the expired cells as tombstones once they’re past that timestamp. 





Cassandra’s getFullyExpiredSSTables() will consider a table fully expired if 
(and only if) all cells within that table are expired (current time > max 
timestamp ) AND the sstable timestamps don’t overlap with others that aren’t 
fully expired. Björn talks about this in 
https://issues.apache.org/jira/browse/CASSANDRA-8243 - the intent here is so 
that explicit deletes (which do create tombstones) won’t be GC’d  from an 
otherwise fully expired sstable if they’re covering data in a more recent 
sstable – without this check, we could accidentally bring dead data back to 
life. In an append only time series workload this would be unusual, but not 
impossible. 

Unfortunately, read repairs (foreground/blocking, if you write with CL < ALL 
and read with CL > ONE) will cause cells written with old timestamps to be 
written into the newly flushed sstables, which creates sstables with wide gaps 
between minTimestamp and maxTimestamp (you could have a read repair pull data 
that is 23 hours old into a new sstable, and now that one sstable spans 23 
hours, and isn’t fully expired until the oldest data is 47 hours old). There’s 
an open ticket (https://issues.apache.org/jira/browse/CASSANDRA-10496 ) meant 
to make this behavior ‘better’ in the future by splitting those old 
read-repaired cells from the newly flushed sstables. 





I gave a talk on a lot of this behavior last year at Summit 
(http://www.slideshare.net/JeffJirsa1/cassandra-summit-2015-real-world-dtcs-for-operators
 ) - if you’re running time series in production on DTCS, it’s worth a glance.

 

 

 

From: jerome 
Reply-To: "user@cassandra.apache.org" 
Date: Friday, June 17, 2016 at 11:52 AM
To: "user@cassandra.apache.org" 
Subject: Understanding when Cassandra drops expired time series data

 

Hello! Recently I have been trying to familiarize myself with Cassandra but 
don't quite understand when data is removed from disk after it has been 
deleted. The use case I'm particularly interested is expiring time series data 
with DTCS. As an example, I created the following table:
CREATE TABLE metrics (
  metric_id text,
  time timestamp,
  value double,