On Sat, Aug 19, 2023 at 6:27 PM Matt Mahoney <[email protected]> wrote:
> Most > of the compression comes from using the previous column as context. > Similar columns are grouped, like 1990 population followed by 2000 > population, which is what makes these contexts useful. Obviously there > are other things to try, like sorting rows and columns by mutual > information, or predicting cells from previously coded cells and > coding the difference. Stay tuned. Thanks Matt. Using existing software with minor tweaks to squeeze out the low hanging statistical fruit is an important first step toward a meaningful competition. Ordering columns so they most closely correlate (as you are doing) is an indirect way of doing factor analysis as data compression which goes back to 1973 at least <https://ttu-ir.tdl.org/bitstream/handle/2346/15912/31295004619267.pdf?sequence=1> although that linked paper is a dead-end since no one cited it in subsequent work. There is some prior art involving DARPA cyberwar forensics for causal analysis of enterprise logs like: https://youtu.be/eK-E6242K-c?t=209 SEAL: Storage-efficient Causality Analysis on Enterprise Logs with Query-friendly Compression <https://www.usenix.org/system/files/sec21fall-fei.pdf> But I don't see open source for that or subsequently related work. ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/T30092c5d8380b42f-Mac20a76482ba8016301b6574 Delivery options: https://agi.topicbox.com/groups/agi/subscription
