How the write path finds the N nodes to write to?

2023-08-30 Thread Gabriel Giussi
I know cassandra uses consistent hashing for choosing the node where a key
should go to, and if I understand correctly from this image
https://cassandra.apache.org/doc/latest/cassandra/_images/ring.svg
if the replication factor is 3 it just picks the other two nodes following
the ring clockwise.
I would like to know if someone can point me to where that is implemented,
because I want to implement something similar for the finagle http client,
the finagle library already
has some implementation of partitioning using consistent hashing, but it
doesn't support replication so a key only belongs to a single node, see
https://github.com/twitter/util/blob/develop/util-hashing/src/main/scala/com/twitter/hashing/ConsistentHashingDistributor.scala
 .


Thanks.


How to delete huge partition in cassandra 3.0.13

2019-08-12 Thread Gabriel Giussi
I've found a huge partion (~9GB) in my cassandra cluster because I'm
loosing 3 nodes recurrently due to OutOfMemoryError

> ERROR [SharedPool-Worker-12] 2019-08-12 11:07:45,735
> JVMStabilityInspector.java:140 - JVM state determined to be unstable.
> Exiting forcefully due to:
> java.lang.OutOfMemoryError: Java heap space
> at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57) ~[na:1.8.0_151]
> at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) ~[na:1.8.0_151]
> at
> org.apache.cassandra.io.util.DataOutputBuffer.reallocate(DataOutputBuffer.java:126)
> ~[apache-cassandra-3.0.13.jar:3.0.13]
> at
> org.apache.cassandra.io.util.DataOutputBuffer.doFlush(DataOutputBuffer.java:86)
> ~[apache-cassandra-3.0.13.jar:3.0.13]
> at
> org.apache.cassandra.io.util.BufferedDataOutputStreamPlus.write(BufferedDataOutputStreamPlus.java:132)
> ~[apache-cassandra-3.0.13.jar:3.0.13]
> at
> org.apache.cassandra.io.util.BufferedDataOutputStreamPlus.write(BufferedDataOutputStreamPlus.java:151)
> ~[apache-cassandra-3.0.13.jar:3.0.13]
> at
> org.apache.cassandra.utils.ByteBufferUtil.writeWithVIntLength(ByteBufferUtil.java:297)
> ~[apache-cassandra-3.0.13.jar:3.0.13]
> at
> org.apache.cassandra.db.marshal.AbstractType.writeValue(AbstractType.java:373)
> ~[apache-cassandra-3.0.13.jar:3.0.13]
> at
> org.apache.cassandra.db.rows.BufferCell$Serializer.serialize(BufferCell.java:267)
> ~[apache-cassandra-3.0.13.jar:3.0.13]
> at
> org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:193)
> ~[apache-cassandra-3.0.13.jar:3.0.13]
> at
> org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:109)
> ~[apache-cassandra-3.0.13.jar:3.0.13]
> at
> org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:97)
> ~[apache-cassandra-3.0.13.jar:3.0.13]
> at
> org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.serialize(UnfilteredRowIteratorSerializer.java:132)
> ~[apache-cassandra-3.0.13.jar:3.0.13]
> at
> org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.serialize(UnfilteredRowIteratorSerializer.java:87)
> ~[apache-cassandra-3.0.13.jar:3.0.13]
> at
> org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.serialize(UnfilteredRowIteratorSerializer.java:77)
> ~[apache-cassandra-3.0.13.jar:3.0.13]
> at
> org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$Serializer.serialize(UnfilteredPartitionIterators.java:301)
> ~[apache-cassandra-3.0.13.jar:3.0.13]
> at
> org.apache.cassandra.db.ReadResponse$LocalDataResponse.build(ReadResponse.java:145)
> ~[apache-cassandra-3.0.13.jar:3.0.13]
> at
> org.apache.cassandra.db.ReadResponse$LocalDataResponse.(ReadResponse.java:138)
> ~[apache-cassandra-3.0.13.jar:3.0.13]
> at
> org.apache.cassandra.db.ReadResponse$LocalDataResponse.(ReadResponse.java:134)
> ~[apache-cassandra-3.0.13.jar:3.0.13]
> at
> org.apache.cassandra.db.ReadResponse.createDataResponse(ReadResponse.java:76)
> ~[apache-cassandra-3.0.13.jar:3.0.13]
> at
> org.apache.cassandra.db.ReadCommand.createResponse(ReadCommand.java:321)
> ~[apache-cassandra-3.0.13.jar:3.0.13]
> at
> org.apache.cassandra.db.ReadCommandVerbHandler.doVerb(ReadCommandVerbHandler.java:47)
> ~[apache-cassandra-3.0.13.jar:3.0.13]
> at
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:67)
> ~[apache-cassandra-3.0.13.jar:3.0.13]
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> ~[na:1.8.0_151]
> at
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164)
> ~[apache-cassandra-3.0.13.jar:3.0.13]
> at
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:136)
> [apache-cassandra-3.0.13.jar:3.0.13]
> at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105)
> [apache-cassandra-3.0.13.jar:3.0.13]
> at java.lang.Thread.run(Thread.java:748) [na:1.8.0_151]
>

>From the stacktrace I assume that some client is try to read that partition
(ReadResponse) so I may filter requests to this specific partition as a
quick solution but I think the compaction will never be able to remove this
partition (I already executed a DELETE).
What can I do to delete this partition? May I delete the sstable directly?
Or should I upgrade the node and give more heap to cassandra?

Thanks.


Re: How to monitor datastax driver compression performance?

2019-04-09 Thread Gabriel Giussi
tlp-stress allow us to define size of rows? Because I will see the benefit
of compression in terms of request rates only if the compression ratio is
significant, i.e. requires less network round trips.
This could be done generating bigger partitions with parameters -n and -p,
i.e. decreasing the -p?

Also, don't you think that driver should allow configuring compression per
query? Because one table with wide rows could benefit from compression
while another one with less payload could not.

Thanks for your help Jon.


El lun., 8 abr. 2019 a las 19:13, Jon Haddad () escribió:

> If it were me, I'd look at raw request rates (in terms of requests /
> second as well as request latency), network throughput and then some
> flame graphs of both the server and your application:
> https://github.com/jvm-profiling-tools/async-profiler.
>
> I've created an issue in tlp-stress to add compression options for the
> driver: https://github.com/thelastpickle/tlp-stress/issues/67.  If
> you're interested in contributing the feature I think tlp-stress will
> more or less solve the remainder of the problem for you (the load
> part, not the os numbers).
>
> Jon
>
>
>
>
> On Mon, Apr 8, 2019 at 7:26 AM Gabriel Giussi 
> wrote:
> >
> > Hi, I'm trying to test if adding driver compression will bring me any
> benefit.
> > I understand that the trade-off is less bandwidth but increased CPU
> usage in both cassandra nodes (compression) and client nodes
> (decompression) but I want to know what are the key metrics and how to
> monitor them to probe compression is giving good results?
> > I guess I should look at latency percentiles reported by
> com.datastax.driver.core.Metrics and CPU usage, but what about bandwith
> usage and compression ratio?
> > Should I use tcpdump to capture packets length coming from cassandra
> nodes? Something like tcpdump -n "src port 9042 and tcp[13] & 8 != 0" | sed
> -n "s/^.*length \(.*\).*$/\1/p" would be enough?
> >
> > Thanks
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


How to monitor datastax driver compression performance?

2019-04-08 Thread Gabriel Giussi
Hi, I'm trying to test if adding driver compression will bring me any
benefit.
I understand that the trade-off is less bandwidth but increased CPU usage
in both cassandra nodes (compression) and client nodes (decompression) but
I want to know what are the key metrics and how to monitor them to probe
compression is giving good results?
I guess I should look at latency percentiles reported by
com.datastax.driver.core.Metrics and CPU usage, but what about bandwith
usage and compression ratio?
Should I use tcpdump to capture packets length coming from cassandra nodes?
Something like* tcpdump -n "src port 9042 and tcp[13] & 8 != 0" | sed -n
"s/^.*length \(.*\).*$/\1/p"* would be enough?

Thanks


Cassandra crashed during major compaction

2018-11-07 Thread Gabriel Giussi
After a bulk load of writes to existing partition keys (with a higher
timestamp), I wanted to free disk space, suspecting that rows will be in
the highest levels and it would take a time until they were compacted.
I've started a major compaction, and the disk usage went from ~30% to ~40%
(as expected) but after ~10 hs Cassandra has crashed (*) and disk usage
continues being ~40%, even after restart.

How could I remove SSTables created during the failed compaction?

(*) From the logs, I understand that the JVM runs out of java heap space.
However, I don't understand why there are multiple OutOfMemory errors, it
shouldn't crashed with the first one?

ERROR [MessagingService-Incoming-/x.x.x.x] 2018-11-06 22:28:35,251
CassandraDaemon.java:207 - Exception in thread
Thread[MessagingService-Incoming-/x.x.x.x,5,main]
java.lang.OutOfMemoryError: Java heap space
at org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:403)
~[apache-cassandra-3.0.13.jar:3.0.13]
at 
org.apache.cassandra.utils.ByteBufferUtil.readWithVIntLength(ByteBufferUtil.java:341)
~[apache-cassandra-3.0.13.jar:3.0.13]
at 
org.apache.cassandra.db.ReadResponse$Serializer.deserialize(ReadResponse.java:382)
~[apache-cassandra-3.0.13.jar:3.0.13]
--
at 
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:92)
~[apache-cassandra-3.0.13.jar:3.0.13]
INFO  [HintsDispatcher:63] 2018-11-06 22:34:49,678
HintsDispatchExecutor.java:271 - Finished hinted handoff of file
9584b93c-f86e-464f-a9ba-3dd33134b7af-1541550621736-1.hints to endpoint
/y.y.y.y: 9584b93c-f86e-464f-a9ba-3dd33134b7af, partially
ERROR [MessagingService-Incoming-/z.z.z.z] 2018-11-06 22:37:11,099
CassandraDaemon.java:207 - Exception in thread
Thread[MessagingService-Incoming-/z.z.z.z,5,main]
java.lang.OutOfMemoryError: Java heap space
INFO  [ScheduledTasks:1] 2018-11-06 22:37:34,716 StatusLogger.java:56
- Sampler   0 0  0
0 0

ERROR [MessagingService-Incoming-/y.y.y.y] 2018-11-06 22:39:39,860
CassandraDaemon.java:207 - Exception in thread
Thread[MessagingService-Incoming-/y.y.y.y,5,main]
java.lang.OutOfMemoryError: Java heap space
ERROR [MessagingService-Incoming-/z.z.z.z] 2018-11-06 22:41:52,690
CassandraDaemon.java:207 - Exception in thread
Thread[MessagingService-Incoming-/z.z.z.z,5,main]
java.lang.OutOfMemoryError: Java heap space
ERROR [MessagingService-Incoming-/a.a.a.a] 2018-11-06 22:42:23,498
CassandraDaemon.java:207 - Exception in thread
Thread[MessagingService-Incoming-/a.a.a.a,5,main]
java.lang.OutOfMemoryError: Java heap space
.

ERROR [MessagingService-Incoming-/x.x.x.x] 2018-11-06 23:42:53,947
JVMStabilityInspector.java:140 - JVM state determined to be unstable.
Exiting forcefully due to:
java.lang.OutOfMemoryError: Java heap space
at org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:403)
~[apache-cassandra-3.0.13.jar:3.0.13]
at 
org.apache.cassandra.utils.ByteBufferUtil.readWithVIntLength(ByteBufferUtil.java:341)
~[apache-cassandra-3.0.13.jar:3.0.13]
at 
org.apache.cassandra.db.ReadResponse$Serializer.deserialize(ReadResponse.java:382)
~[apache-cassandra-3.0.13.jar:3.0.13]
--
at 
org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:178)
~[apache-cassandra-3.0.13.jar:3.0.13]
at 
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:92)
~[apache-cassandra-3.0.13.jar:3.0.13]
ERROR [SharedPool-Worker-6] 2018-11-06 23:42:53,947
JVMStabilityInspector.java:140 - JVM state determined to be unstable.
Exiting forcefully due to:
java.lang.OutOfMemoryError: Java heap space
ERROR [SharedPool-Worker-12] 2018-11-06 23:42:53,947
JVMStabilityInspector.java:140 - JVM state determined to be unstable.
Exiting forcefully due to:
java.lang.OutOfMemoryError: Java heap space
ERROR [SharedPool-Worker-18] 2018-11-06 23:42:53,948
JVMStabilityInspector.java:140 - JVM state determined to be unstable.
Exiting forcefully due to:
java.lang.OutOfMemoryError: Java heap space


Thanks.


Re: TTL tombstones in Cassandra using LCS are cretaed in the same level data TTLed data?

2018-10-04 Thread Gabriel Giussi
Hello Alain,

thanks again for answering.

Yes, I believe during the next compaction following the expiration date,
> the entry is 'transformed' into a tombstone, and lives in the SSTable that
> is the result of the compaction, on the level/bucket this SSTable is put
> into.
>

Great, however I'm still trying to figure it out a way to test this or see
it in code. If you have any idea I could give it a try.

I didn't understand what you mean with

> generally, it's good if you can rotate the partitions over time, not to
> reuse old partitions for example
>

About garbagecollect, it is a good idea but is not available in version
3.0.13.

Again, I've asked this on stackoverflow (
https://stackoverflow.com/q/52370661/3517383) top, so, just if you want,
you can answer there too and I will mark it as correct.

Cheers.

El jue., 27 sept. 2018 a las 14:11, Alain RODRIGUEZ ()
escribió:

> Hello Gabriel,
>
> Another clue to explore would be to use the TTL as a default value if
>> that's a good fit. TTLs set at the table level with 'default_time_to_live'
>> should not generate any tombstone at all in C*3.0+. Not tested on my hand,
>> but I read about this.
>>
>
> As explained on a parallel thread, this is wrong ^, mea culpa. I believe
> the rest of my comment still stands (hopefully :)).
>
> I'm not sure what it means with "*in-place*" since SSTables are immutable.
>> [...]
>
>  My guess is that is referring to tombstones being created in the same
>> level (but different SStables) that the TTLed data during a compaction
>> triggered
>
>
> Yes, I believe during the next compaction following the expiration date,
> the entry is 'transformed' into a tombstone, and lives in the SSTable that
> is the result of the compaction, on the level/bucket this SSTable is put
> into. That's why I said 'in-place' which is indeed a bit weird for
> immutable data.
>
> As a side idea for your problem, on 'modern' versions of Cassandra (I
> don't remember the version, that's what 'modern' means ;-)), you can run
> 'nodetool garbagecollect' regularly (not necessarily frequently) during the
> off-peak period. That might use the cluster resources when you don't need
> them to claim some disk space. Also making sure that a 2 years old record
> is not being updated regularly by design would definitely help. In the
> extreme case of writing a data once (never updated) and with a TTL for
> example, I see no reason for a 2 years old data not to be evicted
> correctly. As long as the disk can grow, it should be fine.
>
> I would not be too much scared about it, as there is 'always' a way to
> remove tombstones. Yet it's good to think about the design beforehand
> indeed, generally, it's good if you can rotate the partitions over time,
> not to reuse old partitions for example.
>
> C*heers,
> ---
> Alain Rodriguez - @arodream - al...@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> Le mar. 25 sept. 2018 à 17:38, Gabriel Giussi  a
> écrit :
>
>> I'm using LCS and a relatively large TTL of 2 years for all inserted rows
>> and I'm concerned about the moment at wich C* would drop the corresponding
>> tombstones (neither explicit deletes nor updates are being performed).
>>
>> From [Missing Manual for Leveled Compaction Strategy](
>> https://www.youtube.com/watch?v=-5sNVvL8RwI), [Tombstone Compactions in
>> Cassandra](https://www.youtube.com/watch?v=pher-9jqqC4) and [Deletes
>> Without Tombstones or TTLs](https://www.youtube.com/watch?v=BhGkSnBZgJA)
>> I understand that
>>
>>  - All levels except L0 contain non-overlapping SSTables, but a partition
>> key may be present in one SSTable in each level (aka distributed in all
>> levels).
>>  - For a compaction to be able to drop a tombstone it must be sure that
>> is compacting all SStables that contains de data to prevent zombie data
>> (this is done checking bloom filters). It also considers gc_grace_seconds
>>
>> So, for my particular use case (2 years TTL and write heavy load) I can
>> conclude that TTLed data will be in highest levels so I'm wondering when
>> those SSTables with TTLed data will be compacted with the SSTables that
>> contains the corresponding SSTables.
>> The main question will be: **Where are tombstones (from ttls) being
>> created? Are being created at Level 0 so it will take a long time until it
>> will end up in the highest levels (hence disk space will take long time to
>> be freed)?**
>>
>> In a comment from [About deletes and tombstones](
>> http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html)

Re: How default_time_to_live would delete rows without tombstones in Cassandra?

2018-10-01 Thread Gabriel Giussi
Hello Alain,
thanks for clarifying this topic. You had alerted that this should be
explored indeed, so there is nothing to apologize for.

I've asked this in stackoverflow too (
https://stackoverflow.com/q/52282517/3517383), so if you want to answer
there I will mark yours as the correct one, if not I will reference this
mail from the mailing list.

Your posts on LastPickle are really great BTW.

Cheers.

El jue., 27 sept. 2018 a las 13:48, Alain RODRIGUEZ ()
escribió:

> Hello Gabriel,
>
> Sorry for not answering earlier. I should have, given that I contributed
> spreading this wrong idea. I will also try to edit my comment in the post.
> I have been fooled by the piece of documentation you mentioned when
> answering this question on our blog. I probably answered this one too
> quickly, even though I wrote this a thing 'to explore', even saying I did
> not try it explicitely.
>
> Another clue to explore would be to use the TTL as a default value if
>> that's a good fit. TTLs set at the table level with
>> 'default_time_to_live' **should not generate any tombstone at all in
>> C*3.0+**. Not tested on my hand, but I read about this.
>
>
> So my sentence above is wrong. Basically, the default can be overwritten
> by the TTL at the query level and I do not see how Cassandra could handle
> this without tombstones.
>
> I spent time on the post and it was reviewed. I believe it is reliable.
> The questions, on the other side, are answered by me alone and well, that
> only reflects my opinion at the moment I am asked and I sometimes find
> enough time and interest to dig topics, sometimes a bit less. So this is
> fully on me, my apologies for this inaccuracy. I must say am always afraid
> when writing publicly and sharing information to do this kind of mistakes
> and mislead people. I hope the impact of this read was still positive for
> you overall.
>
> From the example I conclude that isn't true that `default_time_to_live`
>> not require tombstones, at least for version 3.0.13.
>>
>
> Also, I am glad to see you did not believe me or Datastax documentation
> but tried it by yourself. This is definitively the right approach.
>
> But how would C* delete without tombstones? Why this should be a different
>> scenario to using TTL per insert?
>>
>
> Yes, exactly this,
>
> C*heers.
> ---
> Alain Rodriguez - @arodream - al...@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> Le lun. 17 sept. 2018 à 14:58, Gabriel Giussi  a
> écrit :
>
>>
>> From
>> https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlAboutDeletes.html
>>
>> > Cassandra allows you to set a default_time_to_live property for an
>> entire table. Columns and rows marked with regular TTLs are processed as
>> described above; but when a record exceeds the table-level TTL, **Cassandra
>> deletes it immediately, without tombstoning or compaction**.
>>
>> This is also answered in https://stackoverflow.com/a/50060436/3517383
>>
>> >  If a table has default_time_to_live on it then rows that exceed this
>> time limit are **deleted immediately without tombstones being written**.
>>
>> And commented in LastPickle's post About deletes and tombstones (
>> http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html#comment-3949581514
>> )
>>
>> > Another clue to explore would be to use the TTL as a default value if
>> that's a good fit. TTLs set at the table level with 'default_time_to_live'
>> **should not generate any tombstone at all in C*3.0+**. Not tested on my
>> hand, but I read about this.
>>
>> I've made the simplest test that I could imagine using
>> `LeveledCompactionStrategy`:
>>
>> CREATE KEYSPACE IF NOT EXISTS temp WITH replication = {'class':
>> 'SimpleStrategy', 'replication_factor': '1'};
>>
>> CREATE TABLE IF NOT EXISTS temp.test_ttl (
>> key text,
>> value text,
>> PRIMARY KEY (key)
>> ) WITH  compaction = { 'class': 'LeveledCompactionStrategy'}
>>   AND default_time_to_live = 180;
>>
>>  1. `INSERT INTO temp.test_ttl (key,value) VALUES ('k1','v1');`
>>  2. `nodetool flush temp`
>>  3. `sstabledump mc-1-big-Data.db`
>> [image: cassandra0.png]
>>
>>  4. wait for 180 seconds (default_time_to_live)
>>  5. `sstabledump mc-1-big-Data.db`
>> [image: cassandra1.png]
>>
>> The tombstone isn't created yet
>>  6. `nodetool compact temp`
>>  7. `sstabledump mc-2-big-Data.db`
>> [image: cassandra2.png]
>>
>> The **tombstone is 

TTL tombstones in Cassandra using LCS are cretaed in the same level data TTLed data?

2018-09-25 Thread Gabriel Giussi
I'm using LCS and a relatively large TTL of 2 years for all inserted rows
and I'm concerned about the moment at wich C* would drop the corresponding
tombstones (neither explicit deletes nor updates are being performed).

>From [Missing Manual for Leveled Compaction Strategy](
https://www.youtube.com/watch?v=-5sNVvL8RwI), [Tombstone Compactions in
Cassandra](https://www.youtube.com/watch?v=pher-9jqqC4) and [Deletes
Without Tombstones or TTLs](https://www.youtube.com/watch?v=BhGkSnBZgJA) I
understand that

 - All levels except L0 contain non-overlapping SSTables, but a partition
key may be present in one SSTable in each level (aka distributed in all
levels).
 - For a compaction to be able to drop a tombstone it must be sure that is
compacting all SStables that contains de data to prevent zombie data (this
is done checking bloom filters). It also considers gc_grace_seconds

So, for my particular use case (2 years TTL and write heavy load) I can
conclude that TTLed data will be in highest levels so I'm wondering when
those SSTables with TTLed data will be compacted with the SSTables that
contains the corresponding SSTables.
The main question will be: **Where are tombstones (from ttls) being
created? Are being created at Level 0 so it will take a long time until it
will end up in the highest levels (hence disk space will take long time to
be freed)?**

In a comment from [About deletes and tombstones](
http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html)
Alain says that
> Yet using TTLs helps, it reduces the chances of having data being
fragmented between SSTables that will not be compacted together any time
soon. Using any compaction strategy, if the delete comes relatively late in
the row history, as it use to happen, the 'upsert'/'insert' of the
tombstone will go to a new SSTable. It might take time for this tombstone
to get to the right compaction "bucket" (with the rest of the row) and for
Cassandra to be able to finally free space.
**My understanding is that with TTLs the tombstones is created in-place**,
thus it is often and for many reasons easier and safer to get rid of a TTLs
than from a delete.
Another clue to explore would be to use the TTL as a default value if
that's a good fit. TTLs set at the table level with 'default_time_to_live'
should not generate any tombstone at all in C*3.0+. Not tested on my hand,
but I read about this.

I'm not sure what it means with "*in-place*" since SSTables are immutable.
(I also have some doubts about what it says of using `default_time_to_live`
that I've asked in [How default_time_to_live would delete rows without
tombstones in Cassandra?](https://stackoverflow.com/q/52282517/3517383)).

My guess is that is referring to tombstones being created in the same level
(but different SStables) that the TTLed data during a compaction triggered
by one of the following reasons:

 1. "Going from highest level, any level having score higher than 1.001 can
be picked by a compaction thread" [The Missing Manual for Leveled
Compaction Strategy](
https://image.slidesharecdn.com/csummit16lcstalk-161004232416/95/the-missing-manual-for-leveled-compaction-strategy-wei-deng-ryan-svihla-datastax-cassandra-summit-2016-12-638.jpg?cb=1475693117
)
 2. "If we go 25 rounds without compacting in the highest level, we start
bringing in sstables from that level into lower level compactions" [The
Missing Manual for Leveled Compaction Strategy](
https://image.slidesharecdn.com/csummit16lcstalk-161004232416/95/the-missing-manual-for-leveled-compaction-strategy-wei-deng-ryan-svihla-datastax-cassandra-summit-2016-12-638.jpg?cb=1475693117
)
 3. "When there are no other compactions to do, we trigger a single-sstable
compaction if there is more than X% droppable tombstones in the sstable."
[CASSANDRA-7019](https://issues.apache.org/jira/browse/CASSANDRA-7019)
Since tombstones are created during compaction, I think it may be using
SSTable metadata to estimate droppable tombstones.

**So, compactions (2) and (3) should be creating/dropping tombstones in
highest levels hence using LCS with a large TTL should not be an issue per
se.**
With creating/dropping I mean that the same kind of compactions will be
creating tombstones for expired data and/or dropping tombstones if the gc
period has already passed.

A link to source code that clarifies this situation will be great, thanks.


Fwd: How default_time_to_live would delete rows without tombstones in Cassandra?

2018-09-17 Thread Gabriel Giussi
From
https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlAboutDeletes.html

> Cassandra allows you to set a default_time_to_live property for an entire
table. Columns and rows marked with regular TTLs are processed as described
above; but when a record exceeds the table-level TTL, **Cassandra deletes
it immediately, without tombstoning or compaction**.

This is also answered in https://stackoverflow.com/a/50060436/3517383

>  If a table has default_time_to_live on it then rows that exceed this
time limit are **deleted immediately without tombstones being written**.

And commented in LastPickle's post About deletes and tombstones (
http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html#comment-3949581514
)

> Another clue to explore would be to use the TTL as a default value if
that's a good fit. TTLs set at the table level with 'default_time_to_live'
**should not generate any tombstone at all in C*3.0+**. Not tested on my
hand, but I read about this.

I've made the simplest test that I could imagine using
`LeveledCompactionStrategy`:

CREATE KEYSPACE IF NOT EXISTS temp WITH replication = {'class':
'SimpleStrategy', 'replication_factor': '1'};

CREATE TABLE IF NOT EXISTS temp.test_ttl (
key text,
value text,
PRIMARY KEY (key)
) WITH  compaction = { 'class': 'LeveledCompactionStrategy'}
  AND default_time_to_live = 180;

 1. `INSERT INTO temp.test_ttl (key,value) VALUES ('k1','v1');`
 2. `nodetool flush temp`
 3. `sstabledump mc-1-big-Data.db`
[image: cassandra0.png]

 4. wait for 180 seconds (default_time_to_live)
 5. `sstabledump mc-1-big-Data.db`
[image: cassandra1.png]

The tombstone isn't created yet
 6. `nodetool compact temp`
 7. `sstabledump mc-2-big-Data.db`
[image: cassandra2.png]

The **tombstone is created** (and not dropped on compaction due to
gc_grace_seconds)

The test was performed using apache cassandra 3.0.13

>From the example I conclude that isn't true that `default_time_to_live` not
require tombstones, at least for version 3.0.13.
However this is a very simple test and I'm forcing a major compaction with
`nodetool compact` so I may not be recreating the scenario where
default_time_to_live magic comes into play.

But how would C* delete without tombstones? Why this should be a different
scenario to using TTL per insert?

This is also asked at stackoverflow