If we’re doing this, why don’t we delta encode a vint from some per-sstable minimum value? I’d expect that to commonly compress to a single byte or so.
> On 23 Jun 2023, at 12:55, Aleksey Yeshchenko <alek...@apple.com> wrote: > > Distant future people will not be happy about this, I can already tell you > now. > > Sounds like a reasonable improvement to me however. > >> On 23 Jun 2023, at 07:22, Berenguer Blasi <berenguerbl...@gmail.com> wrote: >> >> Hi all, >> >> DeletionTime.markedForDeleteAt is a long useconds since Unix Epoch. But I >> noticed that with 7 bytes we can already encode ~2284 years. We can either >> shed the 8th byte, for reduced IO and disk, or can encode some sentinel >> values (such as LIVE) as flags there. That would mean reading and writing 1 >> byte instead of 12 (8 mfda long + 4 ldts int). Yes we already avoid >> serializing DeletionTime (DT) in sstables at _row_ level entirely but not at >> _partition_ level and it is also serialized at index, metadata, etc. >> >> So here's a POC: https://github.com/bereng/cassandra/commits/ldtdeser-trunk >> and some jmh (1) to evaluate the impact of the new alg (2). It's tested here >> against a 70% and a 30% LIVE DTs to see how we perform: >> >> [java] Benchmark (liveDTPcParam) (sstableParam) Mode Cnt Score >> Error Units >> [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive NC >> avgt 15 0.331 ± 0.001 ns/op >> [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive OA >> avgt 15 0.335 ± 0.004 ns/op >> [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive NC >> avgt 15 0.334 ± 0.002 ns/op >> [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive OA >> avgt 15 0.340 ± 0.008 ns/op >> [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive NC >> avgt 15 0.337 ± 0.006 ns/op >> [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive OA >> avgt 15 0.340 ± 0.004 ns/op >> [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive NC >> avgt 15 0.339 ± 0.004 ns/op >> [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive OA >> avgt 15 0.343 ± 0.016 ns/op >> >> That was ByteBuffer backed to test the extra bit level operations impact. >> But what would be the impact of an end to end test against disk? >> >> [java] Benchmark (diskRAMParam) (liveDTPcParam) (sstableParam) Mode >> Cnt Score Error Units >> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive >> NC avgt 15 605236.515 ± 19929.058 ns/op >> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive >> OA avgt 15 586477.039 ± 7384.632 ns/op >> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive >> NC avgt 15 937580.311 ± 30669.647 ns/op >> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive >> OA avgt 15 914097.770 ± 9865.070 ns/op >> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk >> 70PcLive NC avgt 15 1314417.207 ± 37879.012 ns/op >> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk >> 70PcLive OA avgt 15 805256.345 ± 15471.587 ns/op >> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk >> 30PcLive NC avgt 15 1583239.011 ± 50104.245 ns/op >> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk >> 30PcLive OA avgt 15 1439605.006 ± 64342.510 ns/op >> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM >> 70PcLive NC avgt 15 295711.217 ± 5432.507 ns/op >> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM >> 70PcLive OA avgt 15 305282.827 ± 1906.841 ns/op >> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM >> 30PcLive NC avgt 15 446029.899 ± 4038.938 ns/op >> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM 30PcLive >> OA avgt 15 479085.875 ± 10032.804 ns/op >> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk >> 70PcLive NC avgt 15 1789434.838 ± 206455.771 ns/op >> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk >> 70PcLive OA avgt 15 589752.861 ± 31676.265 ns/op >> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk >> 30PcLive NC avgt 15 1754862.122 ± 164903.051 ns/op >> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk >> 30PcLive OA avgt 15 1252162.253 ± 121626.818 ns/o >> >> We can see big improvements when backed with the disk and little impact from >> the new alg. >> >> Given we're already introducing a new sstable format (OA) in 5.0 I would >> like to try to get this in before the freeze. The point being that sstables >> with lots of small partitions would benefit from a smaller DT at partition >> level. My tests show a 3%-4% size reduction on disk. >> >> Before proceeding though I'd like to bounce the idea against the community >> for all the corner cases and scenarios I might have missed where this could >> be a problem? >> >> Thx in advance! >> >> >> (1) >> https://github.com/bereng/cassandra/blob/ldtdeser-trunk/test/microbench/org/apache/cassandra/test/microbench/DeletionTimeDeSerBench.java >> >> (2) >> https://github.com/bereng/cassandra/blob/ldtdeser-trunk/src/java/org/apache/cassandra/db/DeletionTime.java#L212 >> >