Hi all,
DeletionTime.markedForDeleteAt is a long useconds since Unix Epoch. But
I noticed that with 7 bytes we can already encode ~2284 years. We can
either shed the 8th byte, for reduced IO and disk, or can encode some
sentinel values (such as LIVE) as flags there. That would mean reading
and writing 1 byte instead of 12 (8 mfda long + 4 ldts int). Yes we
already avoid serializing DeletionTime (DT) in sstables at _row_ level
entirely but not at _partition_ level and it is also serialized at
index, metadata, etc.
So here's a POC:
https://github.com/bereng/cassandra/commits/ldtdeser-trunk and some jmh
(1) to evaluate the impact of the new alg (2). It's tested here against
a 70% and a 30% LIVE DTs to see how we perform:
[java] Benchmark (liveDTPcParam) (sstableParam) Mode Cnt
Score Error Units
[java] DeletionTimeDeSerBench.testRawAlgReads
70PcLive NC avgt 15 0.331 ± 0.001 ns/op
[java] DeletionTimeDeSerBench.testRawAlgReads
70PcLive OA avgt 15 0.335 ± 0.004 ns/op
[java] DeletionTimeDeSerBench.testRawAlgReads
30PcLive NC avgt 15 0.334 ± 0.002 ns/op
[java] DeletionTimeDeSerBench.testRawAlgReads
30PcLive OA avgt 15 0.340 ± 0.008 ns/op
[java] DeletionTimeDeSerBench.testNewAlgWrites
70PcLive NC avgt 15 0.337 ± 0.006 ns/op
[java] DeletionTimeDeSerBench.testNewAlgWrites
70PcLive OA avgt 15 0.340 ± 0.004 ns/op
[java] DeletionTimeDeSerBench.testNewAlgWrites
30PcLive NC avgt 15 0.339 ± 0.004 ns/op
[java] DeletionTimeDeSerBench.testNewAlgWrites
30PcLive OA avgt 15 0.343 ± 0.016 ns/op
That was ByteBuffer backed to test the extra bit level operations
impact. But what would be the impact of an end to end test against disk?
[java] Benchmark (diskRAMParam) (liveDTPcParam) (sstableParam)
Mode Cnt Score Error Units
[java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM
70PcLive NC avgt 15 605236.515 ± 19929.058 ns/op
[java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM
70PcLive OA avgt 15 586477.039 ± 7384.632 ns/op
[java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM
30PcLive NC avgt 15 937580.311 ± 30669.647 ns/op
[java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM
30PcLive OA avgt 15 914097.770 ± 9865.070 ns/op
[java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk
70PcLive NC avgt 15 1314417.207 ± 37879.012 ns/op
[java] DeletionTimeDeSerBench.testE2EDeSerializeDT
Disk 70PcLive OA avgt 15 805256.345 ±
15471.587 ns/op
[java] DeletionTimeDeSerBench.testE2EDeSerializeDT
Disk 30PcLive NC avgt 15 1583239.011 ±
50104.245 ns/op
[java] DeletionTimeDeSerBench.testE2EDeSerializeDT
Disk 30PcLive OA avgt 15 1439605.006 ±
64342.510 ns/op
[java] DeletionTimeDeSerBench.testE2ESerializeDT
RAM 70PcLive NC avgt 15 295711.217 ± 5432.507
ns/op
[java] DeletionTimeDeSerBench.testE2ESerializeDT
RAM 70PcLive OA avgt 15 305282.827 ± 1906.841
ns/op
[java] DeletionTimeDeSerBench.testE2ESerializeDT RAM
30PcLive NC avgt 15 446029.899 ± 4038.938 ns/op
[java] DeletionTimeDeSerBench.testE2ESerializeDT RAM
30PcLive OA avgt 15 479085.875 ± 10032.804 ns/op
[java] DeletionTimeDeSerBench.testE2ESerializeDT Disk
70PcLive NC avgt 15 1789434.838 ± 206455.771 ns/op
[java] DeletionTimeDeSerBench.testE2ESerializeDT
Disk 70PcLive OA avgt 15 589752.861 ±
31676.265 ns/op
[java] DeletionTimeDeSerBench.testE2ESerializeDT
Disk 30PcLive NC avgt 15 1754862.122 ±
164903.051 ns/op
[java] DeletionTimeDeSerBench.testE2ESerializeDT Disk
30PcLive OA avgt 15 1252162.253 ± 121626.818 ns/o
We can see big improvements when backed with the disk and little impact
from the new alg.
Given we're already introducing a new sstable format (OA) in 5.0 I would
like to try to get this in before the freeze. The point being that
sstables with lots of small partitions would benefit from a smaller DT
at partition level. My tests show a 3%-4% size reduction on disk.
Before proceeding though I'd like to bounce the idea against the
community for all the corner cases and scenarios I might have missed
where this could be a problem?
Thx in advance!
(1)
https://github.com/bereng/cassandra/blob/ldtdeser-trunk/test/microbench/org/apache/cassandra/test/microbench/DeletionTimeDeSerBench.java
(2)
https://github.com/bereng/cassandra/blob/ldtdeser-trunk/src/java/org/apache/cassandra/db/DeletionTime.java#L212