I decided to revisit factor analysis. Singular Value Decomposition is necessary with these tables since the old-timey matrix operations are so persnickety. But I did write a python SVD script today and got some interesting results by approximating the number of bits that would be required for the error terms with the log1p function on the absolute errors normalized to the standard deviation of their column errors plus a number of overhead bits per error. That, in combination with a search for the optimal number of singular values to minimize total coefficients and a given number of bits per coefficient can produce a kind of ham-handed estimate of the lossless compression ratio sans python code length estimate.
As I suspected, after dividing the compression ratio by 3, the estimated compression ratio benefits from the inclusion of the sqrt and log columns. Interestingly in neither case does it pay to go beyond the top singular value but that could be due to the way I'm estimating the bit costs. Under the previously stated calculation, I used 32 bits per retained SVD matrix coefficient and 4 bits of overhead per error in addition to the bits required to encode the error's magnitude. Like I said, all pretty ham-handed at this stage. On Tue, Aug 22, 2023 at 1:10 PM James Bowery <[email protected]> wrote: > An approach to path analysis of the Laboratory of the Counties data I took > a few years ago was to increase the column count* by a factor of 3, adding > two columns to each column: > > 1. sqrt(column) > 2. log(column) > > The point is to linearize columns that may have nonlinear > characteristics. This is because the vast majority of off-the-shelf > statistical techniques are based on the age-old linear algebra factor > analysis technique developed by Charles Spearman** to study IQ and its > infamous "g-factor". One way to compress using factor analysis is to pick > off the top eigenvalues for a system of linear equations that predict the > column values, leaving error residual columns that are smaller magnitude > hence can be arithmetically coded in fewer bits. > > Another approach that I started on yesterday is to parameter distill > <https://arxiv.org/abs/2205.15308> a Restricted Boltzmann Machine > <https://arxiv.org/abs/2212.04692> to, in effect, memorize the > unit-scaled rows in a minimum number of "features" that are the counterpart > fo Spearman's "factors", but which permit activation functions. This > requires doing the inverse scaling on the RBM's "retrieved" unit-scaled > rows to calculate error residuals that are arithmetically coded. > > * Something else that's really quite necessary is to, for each "count" > column (ie: dollars spent on education in each county has an associated > "rate" column such as "dollars/student" or, more typically something like > "dolllars/population" etc.). This is because most statistical techniques > assume normalized data to "keep things in perspective" so to speak. These > "rate" columns would have to be given their own sqrt and log columns. This > expands the table a lot but fortunately we have lots of transistors. > > ** It's interesting that Wikipedia doesn't have a specific "History" > section for factor analysis describing its origin. Instead they have a > bunch of sections on various applications of factor analysis that discuss > the history of its application to that specialty. Spearman committed the > crime of discovering the g-factor of intelligence, the denial of which > causing catastrophic damage to humanity because such denial is a globally > imposed religious belief that racial group differences in outcome are not > significantly influenced by heritable aspects of intelligence. > > PS: This passage from a 2019 thesis by Yihan Gao > <https://etd.ideals.illinois.edu/advisor/6UnwNMoQry8/file/65526/GAO-DISSERTATION-2019.pdf> > documents the abysmal state of data table compression: > >> While the task of generic data compression [7, 8] has been extensively >> studied in the past, compression of tabular datasets has received limited >> attention. Generic algorithms (such as Lempel-Ziv [7]) do not exploit the >> relational structure of the datasets at all. Some paper provide partial >> solutions: Spartan [9] uses the relational structure by identifying >> functional dependencies between attributes; ItCompress [10] finds the >> clustering structure of tuples in the dataset to achieve better >> compression; column-based approaches [11] exploit the skewness of attribute >> values instead. > > > On Sun, Aug 20, 2023 at 4:21 PM James Bowery <[email protected]> wrote: > >> >> >> On Sat, Aug 19, 2023 at 6:27 PM Matt Mahoney <[email protected]> >> wrote: >> >>> Most >>> of the compression comes from using the previous column as context. >>> Similar columns are grouped, like 1990 population followed by 2000 >>> population, which is what makes these contexts useful. Obviously there >>> are other things to try, like sorting rows and columns by mutual >>> information, or predicting cells from previously coded cells and >>> coding the difference. Stay tuned. >> >> >> Thanks Matt. Using existing software with minor tweaks to squeeze out >> the low hanging statistical fruit is an important first step toward a >> meaningful competition. >> >> Ordering columns so they most closely correlate (as you are doing) is an >> indirect way of doing factor analysis as data compression which goes >> back to 1973 at least >> <https://ttu-ir.tdl.org/bitstream/handle/2346/15912/31295004619267.pdf?sequence=1> >> although >> that linked paper is a dead-end since no one cited it in subsequent work. >> >> There is some prior art involving DARPA cyberwar forensics for causal >> analysis of enterprise logs like: >> >> https://youtu.be/eK-E6242K-c?t=209 >> >> SEAL: Storage-efficient Causality Analysis on Enterprise Logs with >> Query-friendly Compression >> <https://www.usenix.org/system/files/sec21fall-fei.pdf> >> >> But I don't see open source for that or subsequently related work. >> >> >> ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/T30092c5d8380b42f-Me8a94c62431748b7e87bcf1d Delivery options: https://agi.topicbox.com/groups/agi/subscription
