> If we’re doing this, why don’t we delta encode a vint from some per-sstable 
> minimum value? I’d expect that to commonly compress to a single byte or so.
+1 to this approach.

> Distant future people will not be happy about this, I can already tell you 
> now.
Eh, they'll all be AI's anyway and will just rewrite the code in a background 
thread.

On Fri, Jun 23, 2023, at 9:02 AM, Berenguer Blasi wrote:
> It's a possibility. Though I haven't coded and benchmarked such an 
> approach and I don't think I would have the time before the freeze to 
> take advantage of the sstable format change opportunity.
> 
> Still it's sthg that can be explored later. If we can shave a few extra 
> % then that would always be great imo.
> 
> On 23/6/23 13:57, Benedict wrote:
> > If we’re doing this, why don’t we delta encode a vint from some per-sstable 
> > minimum value? I’d expect that to commonly compress to a single byte or so.
> >
> >> On 23 Jun 2023, at 12:55, Aleksey Yeshchenko <alek...@apple.com> wrote:
> >>
> >> Distant future people will not be happy about this, I can already tell 
> >> you now.
> >>
> >> Sounds like a reasonable improvement to me however.
> >>
> >>> On 23 Jun 2023, at 07:22, Berenguer Blasi <berenguerbl...@gmail.com> 
> >>> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> DeletionTime.markedForDeleteAt is a long useconds since Unix Epoch. But I 
> >>> noticed that with 7 bytes we can already encode ~2284 years. We can 
> >>> either shed the 8th byte, for reduced IO and disk, or can encode some 
> >>> sentinel values (such as LIVE) as flags there. That would mean reading 
> >>> and writing 1 byte instead of 12 (8 mfda long + 4 ldts int). Yes we 
> >>> already avoid serializing DeletionTime (DT) in sstables at _row_ level 
> >>> entirely but not at _partition_ level and it is also serialized at index, 
> >>> metadata, etc.
> >>>
> >>> So here's a POC: 
> >>> https://github.com/bereng/cassandra/commits/ldtdeser-trunk and some jmh 
> >>> (1) to evaluate the impact of the new alg (2). It's tested here against a 
> >>> 70% and a 30% LIVE DTs  to see how we perform:
> >>>
> >>>      [java] Benchmark (liveDTPcParam)  (sstableParam)  Mode  Cnt  Score   
> >>> Error  Units
> >>>      [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive              
> >>> NC  avgt   15  0.331 ± 0.001  ns/op
> >>>      [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive              
> >>> OA  avgt   15  0.335 ± 0.004  ns/op
> >>>      [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive              
> >>> NC  avgt   15  0.334 ± 0.002  ns/op
> >>>      [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive              
> >>> OA  avgt   15  0.340 ± 0.008  ns/op
> >>>      [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive              
> >>> NC  avgt   15  0.337 ± 0.006  ns/op
> >>>      [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive              
> >>> OA  avgt   15  0.340 ± 0.004  ns/op
> >>>      [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive              
> >>> NC  avgt   15  0.339 ± 0.004  ns/op
> >>>      [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive              
> >>> OA  avgt   15  0.343 ± 0.016  ns/op
> >>>
> >>> That was ByteBuffer backed to test the extra bit level operations impact. 
> >>> But what would be the impact of an end to end test against disk?
> >>>
> >>>      [java] Benchmark (diskRAMParam)  (liveDTPcParam)  (sstableParam)  
> >>> Mode  Cnt Score        Error  Units
> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM         
> >>> 70PcLive              NC  avgt   15   605236.515 ± 19929.058  ns/op
> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM         
> >>> 70PcLive              OA  avgt   15   586477.039 ± 7384.632  ns/op
> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM         
> >>> 30PcLive              NC  avgt   15   937580.311 ± 30669.647  ns/op
> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM         
> >>> 30PcLive              OA  avgt   15   914097.770 ± 9865.070  ns/op
> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT   Disk         
> >>> 70PcLive              NC  avgt   15  1314417.207 ± 37879.012  ns/op
> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT          Disk     
> >>>     70PcLive              OA  avgt   15 805256.345 ±  15471.587  ns/op
> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT         Disk      
> >>>    30PcLive              NC  avgt   15 1583239.011 ±  50104.245  ns/op
> >>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT        Disk       
> >>>   30PcLive              OA  avgt   15 1439605.006 ±  64342.510  ns/op
> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT          RAM        
> >>>  70PcLive              NC  avgt   15 295711.217 ±   5432.507  ns/op
> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT        RAM         
> >>> 70PcLive              OA  avgt   15 305282.827 ±   1906.841  ns/op
> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT      RAM         
> >>> 30PcLive              NC  avgt   15   446029.899 ±   4038.938  ns/op
> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT    RAM         
> >>> 30PcLive              OA  avgt   15   479085.875 ± 10032.804  ns/op
> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT     Disk         
> >>> 70PcLive              NC  avgt   15  1789434.838 ± 206455.771  ns/op
> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT           Disk      
> >>>    70PcLive              OA  avgt   15 589752.861 ±  31676.265  ns/op
> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT        Disk         
> >>> 30PcLive              NC  avgt   15 1754862.122 ± 164903.051  ns/op
> >>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT     Disk         
> >>> 30PcLive              OA  avgt   15  1252162.253 ± 121626.818  ns/o
> >>>
> >>> We can see big improvements when backed with the disk and little impact 
> >>> from the new alg.
> >>>
> >>> Given we're already introducing a new sstable format (OA) in 5.0 I would 
> >>> like to try to get this in before the freeze. The point being that 
> >>> sstables with lots of small partitions would benefit from a smaller DT at 
> >>> partition level. My tests show a 3%-4% size reduction on disk.
> >>>
> >>> Before proceeding though I'd like to bounce the idea against the 
> >>> community for all the corner cases and scenarios I might have missed 
> >>> where this could be a problem?
> >>>
> >>> Thx in advance!
> >>>
> >>>
> >>> (1) 
> >>> https://github.com/bereng/cassandra/blob/ldtdeser-trunk/test/microbench/org/apache/cassandra/test/microbench/DeletionTimeDeSerBench.java
> >>>
> >>> (2) 
> >>> https://github.com/bereng/cassandra/blob/ldtdeser-trunk/src/java/org/apache/cassandra/db/DeletionTime.java#L212
> >>>
> 

Reply via email to