An approach to path analysis of the Laboratory of the Counties data I took a few years ago was to increase the column count* by a factor of 3, adding two columns to each column:
1. sqrt(column) 2. log(column) The point is to linearize columns that may have nonlinear characteristics. This is because the vast majority of off-the-shelf statistical techniques are based on the age-old linear algebra factor analysis technique developed by Charles Spearman** to study IQ and its infamous "g-factor". One way to compress using factor analysis is to pick off the top eigenvalues for a system of linear equations that predict the column values, leaving error residual columns that are smaller magnitude hence can be arithmetically coded in fewer bits. Another approach that I started on yesterday is to parameter distill <https://arxiv.org/abs/2205.15308> a Restricted Boltzmann Machine <https://arxiv.org/abs/2212.04692> to, in effect, memorize the unit-scaled rows in a minimum number of "features" that are the counterpart fo Spearman's "factors", but which permit activation functions. This requires doing the inverse scaling on the RBM's "retrieved" unit-scaled rows to calculate error residuals that are arithmetically coded. * Something else that's really quite necessary is to, for each "count" column (ie: dollars spent on education in each county has an associated "rate" column such as "dollars/student" or, more typically something like "dolllars/population" etc.). This is because most statistical techniques assume normalized data to "keep things in perspective" so to speak. These "rate" columns would have to be given their own sqrt and log columns. This expands the table a lot but fortunately we have lots of transistors. ** It's interesting that Wikipedia doesn't have a specific "History" section for factor analysis describing its origin. Instead they have a bunch of sections on various applications of factor analysis that discuss the history of its application to that specialty. Spearman committed the crime of discovering the g-factor of intelligence, the denial of which causing catastrophic damage to humanity because such denial is a globally imposed religious belief that racial group differences in outcome are not significantly influenced by heritable aspects of intelligence. PS: This passage from a 2019 thesis by Yihan Gao <https://etd.ideals.illinois.edu/advisor/6UnwNMoQry8/file/65526/GAO-DISSERTATION-2019.pdf> documents the abysmal state of data table compression: > While the task of generic data compression [7, 8] has been extensively > studied in the past, compression of tabular datasets has received limited > attention. Generic algorithms (such as Lempel-Ziv [7]) do not exploit the > relational structure of the datasets at all. Some paper provide partial > solutions: Spartan [9] uses the relational structure by identifying > functional dependencies between attributes; ItCompress [10] finds the > clustering structure of tuples in the dataset to achieve better > compression; column-based approaches [11] exploit the skewness of attribute > values instead. On Sun, Aug 20, 2023 at 4:21 PM James Bowery <[email protected]> wrote: > > > On Sat, Aug 19, 2023 at 6:27 PM Matt Mahoney <[email protected]> > wrote: > >> Most >> of the compression comes from using the previous column as context. >> Similar columns are grouped, like 1990 population followed by 2000 >> population, which is what makes these contexts useful. Obviously there >> are other things to try, like sorting rows and columns by mutual >> information, or predicting cells from previously coded cells and >> coding the difference. Stay tuned. > > > Thanks Matt. Using existing software with minor tweaks to squeeze out the > low hanging statistical fruit is an important first step toward a > meaningful competition. > > Ordering columns so they most closely correlate (as you are doing) is an > indirect way of doing factor analysis as data compression which goes back > to 1973 at least > <https://ttu-ir.tdl.org/bitstream/handle/2346/15912/31295004619267.pdf?sequence=1> > although > that linked paper is a dead-end since no one cited it in subsequent work. > > There is some prior art involving DARPA cyberwar forensics for causal > analysis of enterprise logs like: > > https://youtu.be/eK-E6242K-c?t=209 > > SEAL: Storage-efficient Causality Analysis on Enterprise Logs with > Query-friendly Compression > <https://www.usenix.org/system/files/sec21fall-fei.pdf> > > But I don't see open source for that or subsequently related work. > > > ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/T30092c5d8380b42f-M75e0fb1bc3ae57b5db5f9101 Delivery options: https://agi.topicbox.com/groups/agi/subscription
