LukaDeka commented on PR #2398: URL: https://github.com/apache/systemds/pull/2398#issuecomment-3813675888
# Update for benchmarks ## Addressing the feedback > 1) What you are looking for is to control the entropy of your data. I wasn't able to "generate" data that matched a given entropy (percentage), but I added a helper function to calculate "Shannon-entropy" for the given arrays. It's displayed now in the benchmarks. > 2) You can generate data that has exploitable patterns specific to LZW. I added `genPatternLZWOptimal` which features "repeating patterns". Right now, it just repeats the same pattern (length 10) twice, but based on my observations, any repeating pattern is compressed very well. > 3) Do not worry about input data that is smaller than 100 elements for these experiments. I adjusted the sizes to `100, 1000, 10.000, 40.000`. > 4) ...explicitly mention the number of distinct items you have... `nUnique` is not displayed with the benchmarks. I also added another `for` loop so that both `nUnique` and `size` are incremented: ```r ================================================================================ Benchmark: benchmarkUniquesLZWOptimal ================================================================================ ................................... Size: 100 ................................... Size: 100 | nUnique: 2 | Entropy: 99.88% | DDC: 52 bytes | DDCLZW: 123 bytes | Memory reduction: -136.54% | De-/Compression speedup: 0.02/0.00 times Size: 100 | nUnique: 3 | Entropy: 99.66% | DDC: 144 bytes | DDCLZW: 151 bytes | Memory reduction: -4.86% | De-/Compression speedup: 0.01/0.00 times Size: 100 | nUnique: 5 | Entropy: 99.41% | DDC: 160 bytes | DDCLZW: 187 bytes | Memory reduction: -16.87% | De-/Compression speedup: 0.01/0.00 times Size: 100 | nUnique: 10 | Entropy: 99.03% | DDC: 200 bytes | DDCLZW: 263 bytes | Memory reduction: -31.50% | De-/Compression speedup: 0.01/0.00 times Size: 100 | nUnique: 20 | Entropy: 83.91% | DDC: 280 bytes | DDCLZW: 367 bytes | Memory reduction: -31.07% | De-/Compression speedup: 0.01/0.00 times Size: 100 | nUnique: 50 | Entropy: 64.25% | DDC: 520 bytes | DDCLZW: 607 bytes | Memory reduction: -16.73% | De-/Compression speedup: 0.01/0.00 times Size: 100 | nUnique: 100 | Entropy: 54.58% | DDC: 920 bytes | DDCLZW: 1007 bytes | Memory reduction: -9.46% | De-/Compression speedup: 0.01/0.00 times ................................... Size: 1000 ................................... Size: 1000 | nUnique: 2 | Entropy: 99.96% | DDC: 164 bytes | DDCLZW: 355 bytes | Memory reduction: -116.46% | De-/Compression speedup: 0.00/0.00 times Size: 1000 | nUnique: 3 | Entropy: 99.93% | DDC: 1044 bytes | DDCLZW: 439 bytes | Memory reduction: 57.95% | De-/Compression speedup: 0.00/0.00 times Size: 1000 | nUnique: 5 | Entropy: 99.86% | DDC: 1060 bytes | DDCLZW: 527 bytes | Memory reduction: 50.28% | De-/Compression speedup: 0.00/0.00 times Size: 1000 | nUnique: 10 | Entropy: 99.64% | DDC: 1100 bytes | DDCLZW: 659 bytes | Memory reduction: 40.09% | De-/Compression speedup: 0.00/0.00 times Size: 1000 | nUnique: 20 | Entropy: 98.53% | DDC: 1180 bytes | DDCLZW: 911 bytes | Memory reduction: 22.80% | De-/Compression speedup: 0.00/0.00 times Size: 1000 | nUnique: 50 | Entropy: 85.20% | DDC: 1420 bytes | DDCLZW: 1291 bytes | Memory reduction: 9.08% | De-/Compression speedup: 0.00/0.00 times Size: 1000 | nUnique: 100 | Entropy: 72.37% | DDC: 1820 bytes | DDCLZW: 1691 bytes | Memory reduction: 7.09% | De-/Compression speedup: 0.00/0.00 times Size: 1000 | nUnique: 200 | Entropy: 62.91% | DDC: 2620 bytes | DDCLZW: 2491 bytes | Memory reduction: 4.92% | De-/Compression speedup: 0.00/0.00 times Size: 1000 | nUnique: 500 | Entropy: 53.63% | DDC: 6020 bytes | DDCLZW: 4891 bytes | Memory reduction: 18.75% | De-/Compression speedup: 0.00/0.00 times Size: 1000 | nUnique: 1000 | Entropy: 48.25% | DDC: 10020 bytes | DDCLZW: 8891 bytes | Memory reduction: 11.27% | De-/Compression speedup: 0.00/0.00 times ................................... Size: 10000 ................................... Size: 10000 | nUnique: 2 | Entropy: 99.99% | DDC: 1292 bytes | DDCLZW: 1147 bytes | Memory reduction: 11.22% | De-/Compression speedup: 0.00/0.00 times Size: 10000 | nUnique: 3 | Entropy: 99.99% | DDC: 10044 bytes | DDCLZW: 1379 bytes | Memory reduction: 86.27% | De-/Compression speedup: 0.00/0.00 times Size: 10000 | nUnique: 5 | Entropy: 99.98% | DDC: 10060 bytes | DDCLZW: 1719 bytes | Memory reduction: 82.91% | De-/Compression speedup: 0.00/0.00 times Size: 10000 | nUnique: 10 | Entropy: 99.94% | DDC: 10100 bytes | DDCLZW: 2143 bytes | Memory reduction: 78.78% | De-/Compression speedup: 0.00/0.00 times Size: 10000 | nUnique: 20 | Entropy: 99.81% | DDC: 10180 bytes | DDCLZW: 2619 bytes | Memory reduction: 74.27% | De-/Compression speedup: 0.00/0.00 times Size: 10000 | nUnique: 50 | Entropy: 98.98% | DDC: 10420 bytes | DDCLZW: 3671 bytes | Memory reduction: 64.77% | De-/Compression speedup: 0.00/0.00 times Size: 10000 | nUnique: 100 | Entropy: 95.94% | DDC: 10820 bytes | DDCLZW: 4047 bytes | Memory reduction: 62.60% | De-/Compression speedup: 0.00/0.00 times Size: 10000 | nUnique: 200 | Entropy: 83.39% | DDC: 11620 bytes | DDCLZW: 4847 bytes | Memory reduction: 58.29% | De-/Compression speedup: 0.00/0.00 times Size: 10000 | nUnique: 500 | Entropy: 71.09% | DDC: 24020 bytes | DDCLZW: 7247 bytes | Memory reduction: 69.83% | De-/Compression speedup: 0.00/0.00 times Size: 10000 | nUnique: 1000 | Entropy: 63.96% | DDC: 28020 bytes | DDCLZW: 11247 bytes | Memory reduction: 59.86% | De-/Compression speedup: 0.00/0.00 times ................................... Size: 40000 ................................... Size: 40000 | nUnique: 2 | Entropy: 100.00% | DDC: 5044 bytes | DDCLZW: 2319 bytes | Memory reduction: 54.02% | De-/Compression speedup: 0.00/0.00 times Size: 40000 | nUnique: 3 | Entropy: 100.00% | DDC: 40044 bytes | DDCLZW: 2811 bytes | Memory reduction: 92.98% | De-/Compression speedup: 0.00/0.00 times Size: 40000 | nUnique: 5 | Entropy: 99.99% | DDC: 40060 bytes | DDCLZW: 3463 bytes | Memory reduction: 91.36% | De-/Compression speedup: 0.00/0.00 times Size: 40000 | nUnique: 10 | Entropy: 99.98% | DDC: 40100 bytes | DDCLZW: 4227 bytes | Memory reduction: 89.46% | De-/Compression speedup: 0.00/0.00 times Size: 40000 | nUnique: 20 | Entropy: 99.95% | DDC: 40180 bytes | DDCLZW: 5319 bytes | Memory reduction: 86.76% | De-/Compression speedup: 0.00/0.00 times Size: 40000 | nUnique: 50 | Entropy: 99.74% | DDC: 40420 bytes | DDCLZW: 7307 bytes | Memory reduction: 81.92% | De-/Compression speedup: 0.00/0.00 times Size: 40000 | nUnique: 100 | Entropy: 99.09% | DDC: 40820 bytes | DDCLZW: 8927 bytes | Memory reduction: 78.13% | De-/Compression speedup: 0.00/0.00 times Size: 40000 | nUnique: 200 | Entropy: 96.36% | DDC: 41620 bytes | DDCLZW: 8367 bytes | Memory reduction: 79.90% | De-/Compression speedup: 0.00/0.00 times Size: 40000 | nUnique: 500 | Entropy: 82.16% | DDC: 84020 bytes | DDCLZW: 10767 bytes | Memory reduction: 87.19% | De-/Compression speedup: 0.00/0.00 times Size: 40000 | nUnique: 1000 | Entropy: 73.91% | DDC: 88020 bytes | DDCLZW: 14767 bytes | Memory reduction: 83.22% | De-/Compression speedup: 0.00/0.00 times ``` ## Remarks The main difficulty was judging which benchmarks are useful since most of my entropy values were pretty high to max. Also `benchmarkGetIdx` doesn't make sense right now since the time signatures between DDC and DDCLZW don't match because of the "on-the-fly" sequential decompression, but the method could be swapped out trivially (so I kept the method). I also commented out the `benchmarkSlice` since it didn't look useful. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
