LukaDeka commented on PR #2398:
URL: https://github.com/apache/systemds/pull/2398#issuecomment-3813675888

   # Update for benchmarks
   
   ## Addressing the feedback
   
   > 1) What you are looking for is to control the entropy of your data.
   
   I wasn't able to "generate" data that matched a given entropy (percentage), 
but I added a helper function to calculate "Shannon-entropy" for the given 
arrays. It's displayed now in the benchmarks.
   
   > 2) You can generate data that has exploitable patterns specific to LZW.
   
   I added `genPatternLZWOptimal` which features "repeating patterns". Right 
now, it just repeats the same pattern (length 10) twice, but based on my 
observations, any repeating pattern is compressed very well.
   
   > 3) Do not worry about input data that is smaller than 100 elements for 
these experiments.
   
   I adjusted the sizes to `100, 1000, 10.000, 40.000`.
   
   > 4) ...explicitly mention the number of distinct items you have...
   
   `nUnique` is not displayed with the benchmarks.
   
   I also added another `for` loop so that both `nUnique` and `size` are 
incremented:
   ```r
   
================================================================================
   Benchmark: benchmarkUniquesLZWOptimal
   
================================================================================
   
   ................................... Size: 100 
...................................
   Size:     100 | nUnique:    2 | Entropy:  99.88% | DDC:      52 bytes | 
DDCLZW:     123 bytes | Memory reduction: -136.54% | De-/Compression speedup: 
0.02/0.00 times
   Size:     100 | nUnique:    3 | Entropy:  99.66% | DDC:     144 bytes | 
DDCLZW:     151 bytes | Memory reduction:   -4.86% | De-/Compression speedup: 
0.01/0.00 times
   Size:     100 | nUnique:    5 | Entropy:  99.41% | DDC:     160 bytes | 
DDCLZW:     187 bytes | Memory reduction:  -16.87% | De-/Compression speedup: 
0.01/0.00 times
   Size:     100 | nUnique:   10 | Entropy:  99.03% | DDC:     200 bytes | 
DDCLZW:     263 bytes | Memory reduction:  -31.50% | De-/Compression speedup: 
0.01/0.00 times
   Size:     100 | nUnique:   20 | Entropy:  83.91% | DDC:     280 bytes | 
DDCLZW:     367 bytes | Memory reduction:  -31.07% | De-/Compression speedup: 
0.01/0.00 times
   Size:     100 | nUnique:   50 | Entropy:  64.25% | DDC:     520 bytes | 
DDCLZW:     607 bytes | Memory reduction:  -16.73% | De-/Compression speedup: 
0.01/0.00 times
   Size:     100 | nUnique:  100 | Entropy:  54.58% | DDC:     920 bytes | 
DDCLZW:    1007 bytes | Memory reduction:   -9.46% | De-/Compression speedup: 
0.01/0.00 times
   ................................... Size: 1000 
...................................
   Size:    1000 | nUnique:    2 | Entropy:  99.96% | DDC:     164 bytes | 
DDCLZW:     355 bytes | Memory reduction: -116.46% | De-/Compression speedup: 
0.00/0.00 times
   Size:    1000 | nUnique:    3 | Entropy:  99.93% | DDC:    1044 bytes | 
DDCLZW:     439 bytes | Memory reduction:   57.95% | De-/Compression speedup: 
0.00/0.00 times
   Size:    1000 | nUnique:    5 | Entropy:  99.86% | DDC:    1060 bytes | 
DDCLZW:     527 bytes | Memory reduction:   50.28% | De-/Compression speedup: 
0.00/0.00 times
   Size:    1000 | nUnique:   10 | Entropy:  99.64% | DDC:    1100 bytes | 
DDCLZW:     659 bytes | Memory reduction:   40.09% | De-/Compression speedup: 
0.00/0.00 times
   Size:    1000 | nUnique:   20 | Entropy:  98.53% | DDC:    1180 bytes | 
DDCLZW:     911 bytes | Memory reduction:   22.80% | De-/Compression speedup: 
0.00/0.00 times
   Size:    1000 | nUnique:   50 | Entropy:  85.20% | DDC:    1420 bytes | 
DDCLZW:    1291 bytes | Memory reduction:    9.08% | De-/Compression speedup: 
0.00/0.00 times
   Size:    1000 | nUnique:  100 | Entropy:  72.37% | DDC:    1820 bytes | 
DDCLZW:    1691 bytes | Memory reduction:    7.09% | De-/Compression speedup: 
0.00/0.00 times
   Size:    1000 | nUnique:  200 | Entropy:  62.91% | DDC:    2620 bytes | 
DDCLZW:    2491 bytes | Memory reduction:    4.92% | De-/Compression speedup: 
0.00/0.00 times
   Size:    1000 | nUnique:  500 | Entropy:  53.63% | DDC:    6020 bytes | 
DDCLZW:    4891 bytes | Memory reduction:   18.75% | De-/Compression speedup: 
0.00/0.00 times
   Size:    1000 | nUnique: 1000 | Entropy:  48.25% | DDC:   10020 bytes | 
DDCLZW:    8891 bytes | Memory reduction:   11.27% | De-/Compression speedup: 
0.00/0.00 times
   ................................... Size: 10000 
...................................
   Size:   10000 | nUnique:    2 | Entropy:  99.99% | DDC:    1292 bytes | 
DDCLZW:    1147 bytes | Memory reduction:   11.22% | De-/Compression speedup: 
0.00/0.00 times
   Size:   10000 | nUnique:    3 | Entropy:  99.99% | DDC:   10044 bytes | 
DDCLZW:    1379 bytes | Memory reduction:   86.27% | De-/Compression speedup: 
0.00/0.00 times
   Size:   10000 | nUnique:    5 | Entropy:  99.98% | DDC:   10060 bytes | 
DDCLZW:    1719 bytes | Memory reduction:   82.91% | De-/Compression speedup: 
0.00/0.00 times
   Size:   10000 | nUnique:   10 | Entropy:  99.94% | DDC:   10100 bytes | 
DDCLZW:    2143 bytes | Memory reduction:   78.78% | De-/Compression speedup: 
0.00/0.00 times
   Size:   10000 | nUnique:   20 | Entropy:  99.81% | DDC:   10180 bytes | 
DDCLZW:    2619 bytes | Memory reduction:   74.27% | De-/Compression speedup: 
0.00/0.00 times
   Size:   10000 | nUnique:   50 | Entropy:  98.98% | DDC:   10420 bytes | 
DDCLZW:    3671 bytes | Memory reduction:   64.77% | De-/Compression speedup: 
0.00/0.00 times
   Size:   10000 | nUnique:  100 | Entropy:  95.94% | DDC:   10820 bytes | 
DDCLZW:    4047 bytes | Memory reduction:   62.60% | De-/Compression speedup: 
0.00/0.00 times
   Size:   10000 | nUnique:  200 | Entropy:  83.39% | DDC:   11620 bytes | 
DDCLZW:    4847 bytes | Memory reduction:   58.29% | De-/Compression speedup: 
0.00/0.00 times
   Size:   10000 | nUnique:  500 | Entropy:  71.09% | DDC:   24020 bytes | 
DDCLZW:    7247 bytes | Memory reduction:   69.83% | De-/Compression speedup: 
0.00/0.00 times
   Size:   10000 | nUnique: 1000 | Entropy:  63.96% | DDC:   28020 bytes | 
DDCLZW:   11247 bytes | Memory reduction:   59.86% | De-/Compression speedup: 
0.00/0.00 times
   ................................... Size: 40000 
...................................
   Size:   40000 | nUnique:    2 | Entropy: 100.00% | DDC:    5044 bytes | 
DDCLZW:    2319 bytes | Memory reduction:   54.02% | De-/Compression speedup: 
0.00/0.00 times
   Size:   40000 | nUnique:    3 | Entropy: 100.00% | DDC:   40044 bytes | 
DDCLZW:    2811 bytes | Memory reduction:   92.98% | De-/Compression speedup: 
0.00/0.00 times
   Size:   40000 | nUnique:    5 | Entropy:  99.99% | DDC:   40060 bytes | 
DDCLZW:    3463 bytes | Memory reduction:   91.36% | De-/Compression speedup: 
0.00/0.00 times
   Size:   40000 | nUnique:   10 | Entropy:  99.98% | DDC:   40100 bytes | 
DDCLZW:    4227 bytes | Memory reduction:   89.46% | De-/Compression speedup: 
0.00/0.00 times
   Size:   40000 | nUnique:   20 | Entropy:  99.95% | DDC:   40180 bytes | 
DDCLZW:    5319 bytes | Memory reduction:   86.76% | De-/Compression speedup: 
0.00/0.00 times
   Size:   40000 | nUnique:   50 | Entropy:  99.74% | DDC:   40420 bytes | 
DDCLZW:    7307 bytes | Memory reduction:   81.92% | De-/Compression speedup: 
0.00/0.00 times
   Size:   40000 | nUnique:  100 | Entropy:  99.09% | DDC:   40820 bytes | 
DDCLZW:    8927 bytes | Memory reduction:   78.13% | De-/Compression speedup: 
0.00/0.00 times
   Size:   40000 | nUnique:  200 | Entropy:  96.36% | DDC:   41620 bytes | 
DDCLZW:    8367 bytes | Memory reduction:   79.90% | De-/Compression speedup: 
0.00/0.00 times
   Size:   40000 | nUnique:  500 | Entropy:  82.16% | DDC:   84020 bytes | 
DDCLZW:   10767 bytes | Memory reduction:   87.19% | De-/Compression speedup: 
0.00/0.00 times
   Size:   40000 | nUnique: 1000 | Entropy:  73.91% | DDC:   88020 bytes | 
DDCLZW:   14767 bytes | Memory reduction:   83.22% | De-/Compression speedup: 
0.00/0.00 times
   ```
   
   ## Remarks
   The main difficulty was judging which benchmarks are useful since most of my 
entropy values were pretty high to max.
   
   Also `benchmarkGetIdx` doesn't make sense right now since the time 
signatures between DDC and DDCLZW don't match because of the "on-the-fly" 
sequential decompression, but the method could be swapped out trivially (so I 
kept the method).
   
   I also commented out the `benchmarkSlice` since it didn't look useful.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to