Baunsgaard commented on PR #2398:
URL: https://github.com/apache/systemds/pull/2398#issuecomment-3781620971

   @LukaDeka  
   Good to see some numbers. However, the ones you have reported are a bit 
unfortunate. I have a few points you should consider:
   
   1. Random data is not very compressible, and in actuality, truly random data 
would tend to make DDC superior for your use case. What you are looking for is 
to control the entropy of your data. If the entropy is low, you should get more 
benefits from LZW; if it is high, then your compression ratio should tend 
towards DDC.
   
   2. As an additional experiment, you can generate data that has exploitable 
patterns specific to LZW. Try to generate some data that is in the "best" 
possible structure. This should ideally show scaling close to (O(sqrt{n})) of 
the input size with standard LZW, while DDC, being a dense format, always has 
(O(n)).
   
   3. Do not worry about input data that is smaller than 100 elements for these 
experiments. For instance, experiments with 1 input row trivially show that 
other encodings can perform better than DDC. It starts getting interesting at 
larger sizes.
   
   4. Control and explicitly mention the number of distinct items you have as a 
parameter for your experiment. Additionally, calculate the entropy and use that 
as an additional measure of compressibility of the data. These two changes will 
improve the experiments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to