Distant future people will not be happy about this, I can already tell you now.

Sounds like a reasonable improvement to me however.

> On 23 Jun 2023, at 07:22, Berenguer Blasi <berenguerbl...@gmail.com> wrote:
> 
> Hi all,
> 
> DeletionTime.markedForDeleteAt is a long useconds since Unix Epoch. But I 
> noticed that with 7 bytes we can already encode ~2284 years. We can either 
> shed the 8th byte, for reduced IO and disk, or can encode some sentinel 
> values (such as LIVE) as flags there. That would mean reading and writing 1 
> byte instead of 12 (8 mfda long + 4 ldts int). Yes we already avoid 
> serializing DeletionTime (DT) in sstables at _row_ level entirely but not at 
> _partition_ level and it is also serialized at index, metadata, etc.
> 
> So here's a POC: https://github.com/bereng/cassandra/commits/ldtdeser-trunk 
> and some jmh (1) to evaluate the impact of the new alg (2). It's tested here 
> against a 70% and a 30% LIVE DTs  to see how we perform:
> 
>      [java] Benchmark (liveDTPcParam)  (sstableParam)  Mode  Cnt  Score   
> Error  Units
>      [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive              NC  
> avgt   15  0.331 ± 0.001  ns/op
>      [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive              OA  
> avgt   15  0.335 ± 0.004  ns/op
>      [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive              NC  
> avgt   15  0.334 ± 0.002  ns/op
>      [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive              OA  
> avgt   15  0.340 ± 0.008  ns/op
>      [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive              NC  
> avgt   15  0.337 ± 0.006  ns/op
>      [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive              OA  
> avgt   15  0.340 ± 0.004  ns/op
>      [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive              NC  
> avgt   15  0.339 ± 0.004  ns/op
>      [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive              OA  
> avgt   15  0.343 ± 0.016  ns/op
> 
> That was ByteBuffer backed to test the extra bit level operations impact. But 
> what would be the impact of an end to end test against disk?
> 
>      [java] Benchmark (diskRAMParam)  (liveDTPcParam)  (sstableParam)  Mode  
> Cnt Score        Error  Units
>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM         70PcLive  
>             NC  avgt   15   605236.515 ± 19929.058  ns/op
>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM         70PcLive  
>             OA  avgt   15   586477.039 ± 7384.632  ns/op
>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM         30PcLive  
>             NC  avgt   15   937580.311 ± 30669.647  ns/op
>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM         30PcLive  
>             OA  avgt   15   914097.770 ± 9865.070  ns/op
>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT   Disk         
> 70PcLive              NC  avgt   15  1314417.207 ± 37879.012  ns/op
>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT          Disk         
> 70PcLive              OA  avgt   15 805256.345 ±  15471.587  ns/op
>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT         Disk         
> 30PcLive              NC  avgt   15 1583239.011 ±  50104.245  ns/op
>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT        Disk         
> 30PcLive              OA  avgt   15 1439605.006 ±  64342.510  ns/op
>      [java] DeletionTimeDeSerBench.testE2ESerializeDT          RAM         
> 70PcLive              NC  avgt   15 295711.217 ±   5432.507  ns/op
>      [java] DeletionTimeDeSerBench.testE2ESerializeDT        RAM         
> 70PcLive              OA  avgt   15 305282.827 ±   1906.841  ns/op
>      [java] DeletionTimeDeSerBench.testE2ESerializeDT      RAM         
> 30PcLive              NC  avgt   15   446029.899 ±   4038.938  ns/op
>      [java] DeletionTimeDeSerBench.testE2ESerializeDT    RAM         30PcLive 
>              OA  avgt   15   479085.875 ± 10032.804  ns/op
>      [java] DeletionTimeDeSerBench.testE2ESerializeDT     Disk         
> 70PcLive              NC  avgt   15  1789434.838 ± 206455.771  ns/op
>      [java] DeletionTimeDeSerBench.testE2ESerializeDT           Disk         
> 70PcLive              OA  avgt   15 589752.861 ±  31676.265  ns/op
>      [java] DeletionTimeDeSerBench.testE2ESerializeDT        Disk         
> 30PcLive              NC  avgt   15 1754862.122 ± 164903.051  ns/op
>      [java] DeletionTimeDeSerBench.testE2ESerializeDT     Disk         
> 30PcLive              OA  avgt   15  1252162.253 ± 121626.818  ns/o
> 
> We can see big improvements when backed with the disk and little impact from 
> the new alg.
> 
> Given we're already introducing a new sstable format (OA) in 5.0 I would like 
> to try to get this in before the freeze. The point being that sstables with 
> lots of small partitions would benefit from a smaller DT at partition level. 
> My tests show a 3%-4% size reduction on disk.
> 
> Before proceeding though I'd like to bounce the idea against the community 
> for all the corner cases and scenarios I might have missed where this could 
> be a problem?
> 
> Thx in advance!
> 
> 
> (1) 
> https://github.com/bereng/cassandra/blob/ldtdeser-trunk/test/microbench/org/apache/cassandra/test/microbench/DeletionTimeDeSerBench.java
> 
> (2) 
> https://github.com/bereng/cassandra/blob/ldtdeser-trunk/src/java/org/apache/cassandra/db/DeletionTime.java#L212
> 

Reply via email to