LukaDeka commented on PR #2398:
URL: https://github.com/apache/systemds/pull/2398#issuecomment-3828074855
I have just tested it with the highest "optimal" values for DDC in the
"distributed" benchmark, so with datasets like:
`[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3]`
for `nUnique = 4, size = 16`.
```r
Size: 100000 | nUnique: 2 | Entropy: 100.00% | DDC: 12540 bytes |
DDCLZW: 2567 bytes | Memory reduction: 79.53% | De-/Compression speedup:
0.00/0.00 times
Size: 100000 | nUnique: 3 | Entropy: 100.00% | DDC: 100044 bytes |
DDCLZW: 3147 bytes | Memory reduction: 96.85% | De-/Compression speedup:
0.00/0.00 times
...
Size: 100000 | nUnique: 256 | Entropy: 99.99% | DDC: 102068 bytes |
DDCLZW: 30767 bytes | Memory reduction: 69.86% | De-/Compression speedup:
0.00/0.00 times
Size: 100000 | nUnique: 257 | Entropy: 100.00% | DDC: 202076 bytes |
DDCLZW: 30867 bytes | Memory reduction: 84.73% | De-/Compression speedup:
0.00/0.00 times
...
Size: 100000 | nUnique: 65536 | Entropy: 71.34% | DDC: 724308 bytes |
DDCLZW: 787507 bytes | Memory reduction: -8.73% | De-/Compression speedup:
0.00/0.00 times
Size: 100000 | nUnique: 65537 | Entropy: 71.34% | DDC: 824316 bytes |
DDCLZW: 787519 bytes | Memory reduction: 4.46% | De-/Compression speedup:
0.00/0.00 times
```
There is a big jump at the `2-3` margin, as well as `256-257`. But the
reduction from `65536-65537` isn't that substantial.
Nevertheless, whenever `nUnique/size` approaches `7/10`, DDC and DDCLZW get
similar memory usage results (for `size > 10000` approximately). For datasets
with this many unique values, simple compression is expected to make things
worse though.
I have also noticed that the entropy doesn't really influence the
compression rate that much since entropy measures "how distributed" the values
are and not "how they're arranged". So
`[ 0, 1, 2, 0, 1, 2 ]`
is going to have the same entropy as
`[ 0, 1, 2, 3, 4, 5]`
with both being 100%. The percentage is calculated by
`entropy/log_2{nUnique}` so divided by the possible max.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]