florian-jobs commented on PR #2398: URL: https://github.com/apache/systemds/pull/2398#issuecomment-3846326231
We changed the `ColGroupDDCLZWBenchmark` class to use `estimateInMemorySize()` instead of `getExactSizeOnDisk()` for memory estimation. While `getExactSizeOnDisk()` returns the exact serialized size produced by `write()`, `estimateInMemorySize()` is the intended method in SystemDS for estimating the in-memory footprint of column groups. We also updated `estimateInMemorySize()` to account for the LZW metadata and the LZW mapping. Observation from the “distributed” benchmark: - The absolute byte numbers differ between the two modes (expected: in-memory estimate includes JVM overhead, whereas on-disk size is a compact serialization format). - However, the qualitative behavior and relative trends are very similar between `estimateInMemorySize()` and `getExactSizeOnDisk()` across the tested (size, nUnique) points (i.e., where DDCLZW is beneficial/harmful stays consistent). - As expected, DDCLZW tends to be unfavorable for very small inputs (fixed overhead dominates), while for larger sizes and low-to-moderate nUnique it achieves strong reductions. Around typical DDC representation boundaries (e.g., 256→257, 65536→65537) the baseline DDC memory changes noticeably, which is reflected in the reported reductions as well. Below are the results from `benchmarkDistributed` using both types modes for comparison. ```java ================================================================================ Benchmark: benchmarkDistributed using estimateInMemorySize ================================================================================ ................................... Size: 100 ................................... Size: 100 | nUnique: 2 | Entropy: 100,00% | DDC: 172 bytes | DDCLZW: 248 bytes | Memory reduction: -44,19% | De-/Compression speedup: 0,01/0,00 times Size: 100 | nUnique: 3 | Entropy: 99,99% | DDC: 280 bytes | DDCLZW: 272 bytes | Memory reduction: 2,86% | De-/Compression speedup: 0,01/0,00 times Size: 100 | nUnique: 5 | Entropy: 100,00% | DDC: 296 bytes | DDCLZW: 312 bytes | Memory reduction: -5,41% | De-/Compression speedup: 0,02/0,00 times Size: 100 | nUnique: 10 | Entropy: 100,00% | DDC: 336 bytes | DDCLZW: 392 bytes | Memory reduction: -16,67% | De-/Compression speedup: 0,00/0,00 times Size: 100 | nUnique: 20 | Entropy: 100,00% | DDC: 416 bytes | DDCLZW: 552 bytes | Memory reduction: -32,69% | De-/Compression speedup: 0,00/0,00 times Size: 100 | nUnique: 50 | Entropy: 100,00% | DDC: 656 bytes | DDCLZW: 952 bytes | Memory reduction: -45,12% | De-/Compression speedup: 0,01/0,00 times Size: 100 | nUnique: 100 | Entropy: 100,00% | DDC: 1056 bytes | DDCLZW: 1352 bytes | Memory reduction: -28,03% | De-/Compression speedup: 0,00/0,00 times ................................... Size: 100000 ................................... Size: 100000 | nUnique: 2 | Entropy: 100,00% | DDC: 6420 bytes | DDCLZW: 2696 bytes | Memory reduction: 58,01% | De-/Compression speedup: 0,00/0,00 times Size: 100000 | nUnique: 3 | Entropy: 100,00% | DDC: 100184 bytes | DDCLZW: 3272 bytes | Memory reduction: 96,73% | De-/Compression speedup: 0,00/0,00 times Size: 100000 | nUnique: 5 | Entropy: 100,00% | DDC: 100200 bytes | DDCLZW: 4192 bytes | Memory reduction: 95,82% | De-/Compression speedup: 0,00/0,00 times Size: 100000 | nUnique: 10 | Entropy: 100,00% | DDC: 100240 bytes | DDCLZW: 5872 bytes | Memory reduction: 94,14% | De-/Compression speedup: 0,00/0,00 times Size: 100000 | nUnique: 20 | Entropy: 100,00% | DDC: 100320 bytes | DDCLZW: 8312 bytes | Memory reduction: 91,71% | De-/Compression speedup: 0,00/0,00 times Size: 100000 | nUnique: 50 | Entropy: 100,00% | DDC: 100560 bytes | DDCLZW: 13152 bytes | Memory reduction: 86,92% | De-/Compression speedup: 0,00/0,00 times Size: 100000 | nUnique: 100 | Entropy: 100,00% | DDC: 100960 bytes | DDCLZW: 18952 bytes | Memory reduction: 81,23% | De-/Compression speedup: 0,00/0,00 times Size: 100000 | nUnique: 200 | Entropy: 100,00% | DDC: 101760 bytes | DDCLZW: 27352 bytes | Memory reduction: 73,12% | De-/Compression speedup: 0,00/0,00 times Size: 100000 | nUnique: 256 | Entropy: 99,99% | DDC: 102208 bytes | DDCLZW: 30896 bytes | Memory reduction: 69,77% | De-/Compression speedup: 0,00/0,00 times Size: 100000 | nUnique: 257 | Entropy: 100,00% | DDC: 202216 bytes | DDCLZW: 30992 bytes | Memory reduction: 84,67% | De-/Compression speedup: 0,00/0,00 times Size: 100000 | nUnique: 500 | Entropy: 100,00% | DDC: 204160 bytes | DDCLZW: 44152 bytes | Memory reduction: 78,37% | De-/Compression speedup: 0,00/0,00 times Size: 100000 | nUnique: 1000 | Entropy: 100,00% | DDC: 208160 bytes | DDCLZW: 64152 bytes | Memory reduction: 69,18% | De-/Compression speedup: 0,00/0,00 times Size: 100000 | nUnique: 10000 | Entropy: 100,00% | DDC: 280160 bytes | DDCLZW: 240152 bytes | Memory reduction: 14,28% | De-/Compression speedup: 0,00/0,00 times Size: 100000 | nUnique: 65536 | Entropy: 71,34% | DDC: 724448 bytes | DDCLZW: 787632 bytes | Memory reduction: -8,72% | De-/Compression speedup: 0,00/0,00 times Size: 100000 | nUnique: 65537 | Entropy: 71,34% | DDC: 824496 bytes | DDCLZW: 787648 bytes | Memory reduction: 4,47% | De-/Compression speedup: 0,00/0,00 times Size: 100000 | nUnique: 80000 | Entropy: 84,43% | DDC: 940200 bytes | DDCLZW: 960952 bytes | Memory reduction: -2,21% | De-/Compression speedup: 0,00/0,00 times ================================================================================ Benchmark: benchmarkDistributed using getExactSizeOnDisk ================================================================================ ................................... Size: 100 ................................... Size: 100 | nUnique: 2 | Entropy: 100,00% | DDC: 52 bytes | DDCLZW: 119 bytes | Memory reduction: -128,85% | De-/Compression speedup: 0,01/0,00 times Size: 100 | nUnique: 3 | Entropy: 99,99% | DDC: 144 bytes | DDCLZW: 147 bytes | Memory reduction: -2,08% | De-/Compression speedup: 0,01/0,00 times Size: 100 | nUnique: 5 | Entropy: 100,00% | DDC: 160 bytes | DDCLZW: 183 bytes | Memory reduction: -14,38% | De-/Compression speedup: 0,01/0,00 times Size: 100 | nUnique: 10 | Entropy: 100,00% | DDC: 200 bytes | DDCLZW: 263 bytes | Memory reduction: -31,50% | De-/Compression speedup: 0,02/0,00 times Size: 100 | nUnique: 20 | Entropy: 100,00% | DDC: 280 bytes | DDCLZW: 423 bytes | Memory reduction: -51,07% | De-/Compression speedup: 0,00/0,00 times Size: 100 | nUnique: 50 | Entropy: 100,00% | DDC: 520 bytes | DDCLZW: 823 bytes | Memory reduction: -58,27% | De-/Compression speedup: 0,01/0,00 times Size: 100 | nUnique: 100 | Entropy: 100,00% | DDC: 920 bytes | DDCLZW: 1223 bytes | Memory reduction: -32,93% | De-/Compression speedup: 0,00/0,00 times ................................... Size: 100000 ................................... Size: 100000 | nUnique: 2 | Entropy: 100,00% | DDC: 12540 bytes | DDCLZW: 2567 bytes | Memory reduction: 79,53% | De-/Compression speedup: 0,00/0,00 times Size: 100000 | nUnique: 3 | Entropy: 100,00% | DDC: 100044 bytes | DDCLZW: 3147 bytes | Memory reduction: 96,85% | De-/Compression speedup: 0,00/0,00 times Size: 100000 | nUnique: 5 | Entropy: 100,00% | DDC: 100060 bytes | DDCLZW: 4063 bytes | Memory reduction: 95,94% | De-/Compression speedup: 0,00/0,00 times Size: 100000 | nUnique: 10 | Entropy: 100,00% | DDC: 100100 bytes | DDCLZW: 5743 bytes | Memory reduction: 94,26% | De-/Compression speedup: 0,00/0,00 times Size: 100000 | nUnique: 20 | Entropy: 100,00% | DDC: 100180 bytes | DDCLZW: 8183 bytes | Memory reduction: 91,83% | De-/Compression speedup: 0,00/0,00 times Size: 100000 | nUnique: 50 | Entropy: 100,00% | DDC: 100420 bytes | DDCLZW: 13023 bytes | Memory reduction: 87,03% | De-/Compression speedup: 0,00/0,00 times Size: 100000 | nUnique: 100 | Entropy: 100,00% | DDC: 100820 bytes | DDCLZW: 18823 bytes | Memory reduction: 81,33% | De-/Compression speedup: 0,00/0,00 times Size: 100000 | nUnique: 200 | Entropy: 100,00% | DDC: 101620 bytes | DDCLZW: 27223 bytes | Memory reduction: 73,21% | De-/Compression speedup: 0,00/0,00 times Size: 100000 | nUnique: 256 | Entropy: 99,99% | DDC: 102068 bytes | DDCLZW: 30767 bytes | Memory reduction: 69,86% | De-/Compression speedup: 0,00/0,00 times Size: 100000 | nUnique: 257 | Entropy: 100,00% | DDC: 202076 bytes | DDCLZW: 30867 bytes | Memory reduction: 84,73% | De-/Compression speedup: 0,00/0,00 times Size: 100000 | nUnique: 500 | Entropy: 100,00% | DDC: 204020 bytes | DDCLZW: 44023 bytes | Memory reduction: 78,42% | De-/Compression speedup: 0,00/0,00 times Size: 100000 | nUnique: 1000 | Entropy: 100,00% | DDC: 208020 bytes | DDCLZW: 64023 bytes | Memory reduction: 69,22% | De-/Compression speedup: 0,00/0,00 times Size: 100000 | nUnique: 10000 | Entropy: 100,00% | DDC: 280020 bytes | DDCLZW: 240023 bytes | Memory reduction: 14,28% | De-/Compression speedup: 0,00/0,00 times Size: 100000 | nUnique: 65536 | Entropy: 71,34% | DDC: 724308 bytes | DDCLZW: 787507 bytes | Memory reduction: -8,73% | De-/Compression speedup: 0,00/0,00 times Size: 100000 | nUnique: 65537 | Entropy: 71,34% | DDC: 824316 bytes | DDCLZW: 787519 bytes | Memory reduction: 4,46% | De-/Compression speedup: 0,00/0,00 times Size: 100000 | nUnique: 80000 | Entropy: 84,43% | DDC: 940020 bytes | DDCLZW: 960823 bytes | Memory reduction: -2,21% | De-/Compression speedup: 0,00/0,00 times ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
