I would prefer we not plan on two distinct changes to this, particularly when neither change is particularly more complex than the other. There is a modest cost to maintenance from changing this multiple times. 

But if others feel strongly otherwise I won’t stand in the way.

On 26 Jun 2023, at 05:49, Berenguer Blasi <berenguerbl...@gmail.com> wrote:



Thanks for the replies.

I intend to javadoc the ssatble format in detail someday and more improvements might come up then, along the vint encoding mentioned here. But unless sbdy volunteers to do that in 5.0, is anybody against I try to get the original proposal (1 byte flags for sentinel values) in?

Regards


Distant future people will not be happy about this, I can already tell you now.
Eh, they'll all be AI's anyway and will just rewrite the code in a background thread.

LOL




On 23/6/23 15:44, Josh McKenzie wrote:
If we’re doing this, why don’t we delta encode a vint from some per-sstable minimum value? I’d expect that to commonly compress to a single byte or so.
+1 to this approach.

Distant future people will not be happy about this, I can already tell you now.
Eh, they'll all be AI's anyway and will just rewrite the code in a background thread.

On Fri, Jun 23, 2023, at 9:02 AM, Berenguer Blasi wrote:
It's a possibility. Though I haven't coded and benchmarked such an 
approach and I don't think I would have the time before the freeze to 
take advantage of the sstable format change opportunity.

Still it's sthg that can be explored later. If we can shave a few extra 
% then that would always be great imo.

On 23/6/23 13:57, Benedict wrote:
> If we’re doing this, why don’t we delta encode a vint from some per-sstable minimum value? I’d expect that to commonly compress to a single byte or so.
>
>> On 23 Jun 2023, at 12:55, Aleksey Yeshchenko <alek...@apple.com> wrote:
>>
>> Distant future people will not be happy about this, I can already tell you now.
>>
>> Sounds like a reasonable improvement to me however.
>>
>>> On 23 Jun 2023, at 07:22, Berenguer Blasi <berenguerbl...@gmail.com> wrote:
>>>
>>> Hi all,
>>>
>>> DeletionTime.markedForDeleteAt is a long useconds since Unix Epoch. But I noticed that with 7 bytes we can already encode ~2284 years. We can either shed the 8th byte, for reduced IO and disk, or can encode some sentinel values (such as LIVE) as flags there. That would mean reading and writing 1 byte instead of 12 (8 mfda long + 4 ldts int). Yes we already avoid serializing DeletionTime (DT) in sstables at _row_ level entirely but not at _partition_ level and it is also serialized at index, metadata, etc.
>>>
>>> So here's a POC: https://github.com/bereng/cassandra/commits/ldtdeser-trunk and some jmh (1) to evaluate the impact of the new alg (2). It's tested here against a 70% and a 30% LIVE DTs  to see how we perform:
>>>
>>>      [java] Benchmark (liveDTPcParam)  (sstableParam)  Mode  Cnt  Score   Error  Units
>>>      [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive              NC  avgt   15  0.331 ± 0.001  ns/op
>>>      [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive              OA  avgt   15  0.335 ± 0.004  ns/op
>>>      [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive              NC  avgt   15  0.334 ± 0.002  ns/op
>>>      [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive              OA  avgt   15  0.340 ± 0.008  ns/op
>>>      [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive              NC  avgt   15  0.337 ± 0.006  ns/op
>>>      [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive              OA  avgt   15  0.340 ± 0.004  ns/op
>>>      [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive              NC  avgt   15  0.339 ± 0.004  ns/op
>>>      [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive              OA  avgt   15  0.343 ± 0.016  ns/op
>>>
>>> That was ByteBuffer backed to test the extra bit level operations impact. But what would be the impact of an end to end test against disk?
>>>
>>>      [java] Benchmark (diskRAMParam)  (liveDTPcParam)  (sstableParam)  Mode  Cnt Score        Error  Units
>>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM         70PcLive              NC  avgt   15   605236.515 ± 19929.058  ns/op
>>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM         70PcLive              OA  avgt   15   586477.039 ± 7384.632  ns/op
>>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM         30PcLive              NC  avgt   15   937580.311 ± 30669.647  ns/op
>>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM         30PcLive              OA  avgt   15   914097.770 ± 9865.070  ns/op
>>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT   Disk         70PcLive              NC  avgt   15  1314417.207 ± 37879.012  ns/op
>>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT          Disk         70PcLive              OA  avgt   15 805256.345 ±  15471.587  ns/op
>>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT         Disk         30PcLive              NC  avgt   15 1583239.011 ±  50104.245  ns/op
>>>      [java] DeletionTimeDeSerBench.testE2EDeSerializeDT        Disk         30PcLive              OA  avgt   15 1439605.006 ±  64342.510  ns/op
>>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT          RAM         70PcLive              NC  avgt   15 295711.217 ±   5432.507  ns/op
>>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT        RAM         70PcLive              OA  avgt   15 305282.827 ±   1906.841  ns/op
>>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT      RAM         30PcLive              NC  avgt   15   446029.899 ±   4038.938  ns/op
>>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT    RAM         30PcLive              OA  avgt   15   479085.875 ± 10032.804  ns/op
>>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT     Disk         70PcLive              NC  avgt   15  1789434.838 ± 206455.771  ns/op
>>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT           Disk         70PcLive              OA  avgt   15 589752.861 ±  31676.265  ns/op
>>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT        Disk         30PcLive              NC  avgt   15 1754862.122 ± 164903.051  ns/op
>>>      [java] DeletionTimeDeSerBench.testE2ESerializeDT     Disk         30PcLive              OA  avgt   15  1252162.253 ± 121626.818  ns/o
>>>
>>> We can see big improvements when backed with the disk and little impact from the new alg.
>>>
>>> Given we're already introducing a new sstable format (OA) in 5.0 I would like to try to get this in before the freeze. The point being that sstables with lots of small partitions would benefit from a smaller DT at partition level. My tests show a 3%-4% size reduction on disk.
>>>
>>> Before proceeding though I'd like to bounce the idea against the community for all the corner cases and scenarios I might have missed where this could be a problem?
>>>
>>> Thx in advance!
>>>
>>>
>>>
>>>


Reply via email to