florian-jobs commented on PR #2398: URL: https://github.com/apache/systemds/pull/2398#issuecomment-3824021872
> Okay, cool progress on the results! > > However, I'm a bit skeptical about your byte estimates for the sizes. Do you do extra packing based on the number of bits in your implementation? > > The ideal values for the current DDC implementation are 2, 256, and 65,536 unique values to avoid bit manipulations on lookup (see `AMapToData` specializations). Please explicitly compare against these cases and double-check your memory calculations. > > I'd love to see some results with your idealized input to get a range of what to expect vs. what you get. > > A recipe for X unique values at length L could be: > > 1. Use all X unique values once in sequence > (e.g., for X=4: `1,2,3,4`) > 2. Double repeatedly until you reach length L > > * Round 1: `1,2,3,4` → `1,2,3,4,1,2,3,4` (length 8) > * Round 2: → `1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4` (length 16) > * Round 3: → length 32 > * ...and so on > > I don't know if it's exactly optimal, but it should be pretty good. Good question! At the moment the codes are still stored as int values by the LZW logic, but I’m in the process of changing the storage representation. Instead of storing one code per array element, I’m implementing a bit-packed long wordstream, where codes are packed based on a fixed bit width (derived from the maximum emitted code), with the option to extend this to a growing bit-width policy later if needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
