Baunsgaard commented on PR #2398:
URL: https://github.com/apache/systemds/pull/2398#issuecomment-3781620971
@LukaDeka
Good to see some numbers. However, the ones you have reported are a bit
unfortunate. I have a few points you should consider:
1. Random data is not very compressible, and in actuality, truly random data
would tend to make DDC superior for your use case. What you are looking for is
to control the entropy of your data. If the entropy is low, you should get more
benefits from LZW; if it is high, then your compression ratio should tend
towards DDC.
2. As an additional experiment, you can generate data that has exploitable
patterns specific to LZW. Try to generate some data that is in the "best"
possible structure. This should ideally show scaling close to (O(sqrt{n})) of
the input size with standard LZW, while DDC, being a dense format, always has
(O(n)).
3. Do not worry about input data that is smaller than 100 elements for these
experiments. For instance, experiments with 1 input row trivially show that
other encodings can perform better than DDC. It starts getting interesting at
larger sizes.
4. Control and explicitly mention the number of distinct items you have as a
parameter for your experiment. Additionally, calculate the entropy and use that
as an additional measure of compressibility of the data. These two changes will
improve the experiments.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]