> If we’re doing this, why don’t we delta encode a vint from some per-sstable > minimum value? I’d expect that to commonly compress to a single byte or so. +1 to this approach.
> Distant future people will not be happy about this, I can already tell you > now. Eh, they'll all be AI's anyway and will just rewrite the code in a background thread. On Fri, Jun 23, 2023, at 9:02 AM, Berenguer Blasi wrote: > It's a possibility. Though I haven't coded and benchmarked such an > approach and I don't think I would have the time before the freeze to > take advantage of the sstable format change opportunity. > > Still it's sthg that can be explored later. If we can shave a few extra > % then that would always be great imo. > > On 23/6/23 13:57, Benedict wrote: > > If we’re doing this, why don’t we delta encode a vint from some per-sstable > > minimum value? I’d expect that to commonly compress to a single byte or so. > > > >> On 23 Jun 2023, at 12:55, Aleksey Yeshchenko <alek...@apple.com> wrote: > >> > >> Distant future people will not be happy about this, I can already tell > >> you now. > >> > >> Sounds like a reasonable improvement to me however. > >> > >>> On 23 Jun 2023, at 07:22, Berenguer Blasi <berenguerbl...@gmail.com> > >>> wrote: > >>> > >>> Hi all, > >>> > >>> DeletionTime.markedForDeleteAt is a long useconds since Unix Epoch. But I > >>> noticed that with 7 bytes we can already encode ~2284 years. We can > >>> either shed the 8th byte, for reduced IO and disk, or can encode some > >>> sentinel values (such as LIVE) as flags there. That would mean reading > >>> and writing 1 byte instead of 12 (8 mfda long + 4 ldts int). Yes we > >>> already avoid serializing DeletionTime (DT) in sstables at _row_ level > >>> entirely but not at _partition_ level and it is also serialized at index, > >>> metadata, etc. > >>> > >>> So here's a POC: > >>> https://github.com/bereng/cassandra/commits/ldtdeser-trunk and some jmh > >>> (1) to evaluate the impact of the new alg (2). It's tested here against a > >>> 70% and a 30% LIVE DTs to see how we perform: > >>> > >>> [java] Benchmark (liveDTPcParam) (sstableParam) Mode Cnt Score > >>> Error Units > >>> [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive > >>> NC avgt 15 0.331 ± 0.001 ns/op > >>> [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive > >>> OA avgt 15 0.335 ± 0.004 ns/op > >>> [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive > >>> NC avgt 15 0.334 ± 0.002 ns/op > >>> [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive > >>> OA avgt 15 0.340 ± 0.008 ns/op > >>> [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive > >>> NC avgt 15 0.337 ± 0.006 ns/op > >>> [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive > >>> OA avgt 15 0.340 ± 0.004 ns/op > >>> [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive > >>> NC avgt 15 0.339 ± 0.004 ns/op > >>> [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive > >>> OA avgt 15 0.343 ± 0.016 ns/op > >>> > >>> That was ByteBuffer backed to test the extra bit level operations impact. > >>> But what would be the impact of an end to end test against disk? > >>> > >>> [java] Benchmark (diskRAMParam) (liveDTPcParam) (sstableParam) > >>> Mode Cnt Score Error Units > >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM > >>> 70PcLive NC avgt 15 605236.515 ± 19929.058 ns/op > >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM > >>> 70PcLive OA avgt 15 586477.039 ± 7384.632 ns/op > >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM > >>> 30PcLive NC avgt 15 937580.311 ± 30669.647 ns/op > >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM > >>> 30PcLive OA avgt 15 914097.770 ± 9865.070 ns/op > >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk > >>> 70PcLive NC avgt 15 1314417.207 ± 37879.012 ns/op > >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk > >>> 70PcLive OA avgt 15 805256.345 ± 15471.587 ns/op > >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk > >>> 30PcLive NC avgt 15 1583239.011 ± 50104.245 ns/op > >>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk > >>> 30PcLive OA avgt 15 1439605.006 ± 64342.510 ns/op > >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM > >>> 70PcLive NC avgt 15 295711.217 ± 5432.507 ns/op > >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM > >>> 70PcLive OA avgt 15 305282.827 ± 1906.841 ns/op > >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM > >>> 30PcLive NC avgt 15 446029.899 ± 4038.938 ns/op > >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT RAM > >>> 30PcLive OA avgt 15 479085.875 ± 10032.804 ns/op > >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk > >>> 70PcLive NC avgt 15 1789434.838 ± 206455.771 ns/op > >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk > >>> 70PcLive OA avgt 15 589752.861 ± 31676.265 ns/op > >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk > >>> 30PcLive NC avgt 15 1754862.122 ± 164903.051 ns/op > >>> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk > >>> 30PcLive OA avgt 15 1252162.253 ± 121626.818 ns/o > >>> > >>> We can see big improvements when backed with the disk and little impact > >>> from the new alg. > >>> > >>> Given we're already introducing a new sstable format (OA) in 5.0 I would > >>> like to try to get this in before the freeze. The point being that > >>> sstables with lots of small partitions would benefit from a smaller DT at > >>> partition level. My tests show a 3%-4% size reduction on disk. > >>> > >>> Before proceeding though I'd like to bounce the idea against the > >>> community for all the corner cases and scenarios I might have missed > >>> where this could be a problem? > >>> > >>> Thx in advance! > >>> > >>> > >>> (1) > >>> https://github.com/bereng/cassandra/blob/ldtdeser-trunk/test/microbench/org/apache/cassandra/test/microbench/DeletionTimeDeSerBench.java > >>> > >>> (2) > >>> https://github.com/bereng/cassandra/blob/ldtdeser-trunk/src/java/org/apache/cassandra/db/DeletionTime.java#L212 > >>> >