Distant future people will not be happy about this, I can already tell you now.
Sounds like a reasonable improvement to me however. > On 23 Jun 2023, at 07:22, Berenguer Blasi <berenguerbl...@gmail.com> wrote: > > Hi all, > > DeletionTime.markedForDeleteAt is a long useconds since Unix Epoch. But I > noticed that with 7 bytes we can already encode ~2284 years. We can either > shed the 8th byte, for reduced IO and disk, or can encode some sentinel > values (such as LIVE) as flags there. That would mean reading and writing 1 > byte instead of 12 (8 mfda long + 4 ldts int). Yes we already avoid > serializing DeletionTime (DT) in sstables at _row_ level entirely but not at > _partition_ level and it is also serialized at index, metadata, etc. > > So here's a POC: https://github.com/bereng/cassandra/commits/ldtdeser-trunk > and some jmh (1) to evaluate the impact of the new alg (2). It's tested here > against a 70% and a 30% LIVE DTs to see how we perform: > > [java] Benchmark (liveDTPcParam) (sstableParam) Mode Cnt Score > Error Units > [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive NC > avgt 15 0.331 ± 0.001 ns/op > [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive OA > avgt 15 0.335 ± 0.004 ns/op > [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive NC > avgt 15 0.334 ± 0.002 ns/op > [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive OA > avgt 15 0.340 ± 0.008 ns/op > [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive NC > avgt 15 0.337 ± 0.006 ns/op > [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive OA > avgt 15 0.340 ± 0.004 ns/op > [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive NC > avgt 15 0.339 ± 0.004 ns/op > [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive OA > avgt 15 0.343 ± 0.016 ns/op > > That was ByteBuffer backed to test the extra bit level operations impact. But > what would be the impact of an end to end test against disk? > > [java] Benchmark (diskRAMParam) (liveDTPcParam) (sstableParam) Mode > Cnt Score Error Units > [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive > NC avgt 15 605236.515 ± 19929.058 ns/op > [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive > OA avgt 15 586477.039 ± 7384.632 ns/op > [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive > NC avgt 15 937580.311 ± 30669.647 ns/op > [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive > OA avgt 15 914097.770 ± 9865.070 ns/op > [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk > 70PcLive NC avgt 15 1314417.207 ± 37879.012 ns/op > [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk > 70PcLive OA avgt 15 805256.345 ± 15471.587 ns/op > [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk > 30PcLive NC avgt 15 1583239.011 ± 50104.245 ns/op > [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk > 30PcLive OA avgt 15 1439605.006 ± 64342.510 ns/op > [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM > 70PcLive NC avgt 15 295711.217 ± 5432.507 ns/op > [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM > 70PcLive OA avgt 15 305282.827 ± 1906.841 ns/op > [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM > 30PcLive NC avgt 15 446029.899 ± 4038.938 ns/op > [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM 30PcLive > OA avgt 15 479085.875 ± 10032.804 ns/op > [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk > 70PcLive NC avgt 15 1789434.838 ± 206455.771 ns/op > [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk > 70PcLive OA avgt 15 589752.861 ± 31676.265 ns/op > [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk > 30PcLive NC avgt 15 1754862.122 ± 164903.051 ns/op > [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk > 30PcLive OA avgt 15 1252162.253 ± 121626.818 ns/o > > We can see big improvements when backed with the disk and little impact from > the new alg. > > Given we're already introducing a new sstable format (OA) in 5.0 I would like > to try to get this in before the freeze. The point being that sstables with > lots of small partitions would benefit from a smaller DT at partition level. > My tests show a 3%-4% size reduction on disk. > > Before proceeding though I'd like to bounce the idea against the community > for all the corner cases and scenarios I might have missed where this could > be a problem? > > Thx in advance! > > > (1) > https://github.com/bereng/cassandra/blob/ldtdeser-trunk/test/microbench/org/apache/cassandra/test/microbench/DeletionTimeDeSerBench.java > > (2) > https://github.com/bereng/cassandra/blob/ldtdeser-trunk/src/java/org/apache/cassandra/db/DeletionTime.java#L212 >