On Tue, 19 Sept 2023 at 03:25, Michael Conrad <[email protected]> wrote:
> On 9/18/23 06:14, Guillermo Rodriguez Garcia wrote: > > everything is compressed with gzip -7. This is the worst scenario. >> However, even in the worst scenario due to gzip one single bit of >> difference in the input generates a completely different compressed >> output: >> > > Compression (or any other deterministic manipulation of data) does not add > any entropy (or "unpredictability") since the processing is 100% > reproducible. > In terms of entropy the output of the function is as good (or as bad) as > the amount of entropy in the initial seed. > > Hi Michel, > Even aside from that, using gzip as some sort of hash function is not > going to be anywhere near as good as using an actual hash function, like > sha256, sha1 or even md5. > PREMISE Hashing functions and compression algorithms are two completely different kinds of mathematical tools. The most obvious difference is that 1. hash produces an output whose size is fixed whatever is the size of the input while compression output size might vary when input size changes 2. compression algorithms (f) have their counterpart (f⁻¹) that reverse the process while hash have not 3. because of point #2 the compression algorithm is bi-univocal functions the same input gave the same output and the same output brings back to the same input 4 We know that hashing functions are always injective functions: the same input gives the same output but the same output can have different input (collisions). Unless a hardware system is provided with a specific hardware component that produces constant entropy (white noise, preferably) the main problem is to create it from few reasonably good random inputs. As you can imagine, we can start a debate about the definition of "reasonably good random inputs" or "entropy". Or at the opposite, we can accept that those definitions are - restricted to our specific sector - simply meaning unpredictable data - unpredictable by an attacker or even better by the root admin of the system. Nanoseconds time granularity cannot be predicted by an attacker and also a system administrator could have a real hard time in doing that without sophisticated external hardware instruments. Unfortunately, not all systems are able to provide a nanosecond timing and the first main reason of this lack depend by the clock frequency: to have nanoseconds granularity (10⁻⁹) is necessary to have a GHz (10⁹) clock. MD5SUM, GZIP AND THE WHITE NOISE A relatively weak hash like MD5SUM is way better to create an unpredictable stream of data than any compression function. Ok, let see it: +:git-restore:recovery2:yamui> echo | md5sum 68b329da9893e34099c7d8ad5cb9c940 - As you can see I have 1/16 chance to guess the right next char in the md5sum output. At this point you notice that I am unfair because I used the output of the md5sum command-line (textual human-friendly representation) instead of the md5sum() binary output stream. Obviously, you are right. Hence, I make you notice that you did the same considering the gzip. You took the whole stream which also contains the information to decompress that stream of data. Modifying the function in a way that decompression information are not sent to the output, the output cannot be reversed anymore (f⁻¹) does not exist anymore. Because of this trick we can have a sort of length variable hashing function. It is a very bad hash function, because the fixed size of the output is a great feature. A great feature for the primary purpose for which hash functions are currently used, not so great for generating entropy. In fact, if we have 8 bit of entropy, the hash can provide us 512 bit of data stream - white noise - but the number of the 512 bit dataset that could provide us remains 256. In other words: O(8-bit-entropy) = 256 = O(sh512sum 8-bit-entropy). This is what Gulliermo wrote proposing as "entropy conservation principle". The entropy of a closed system never remains constant along the time but always increases and this is a currently accepted principle of physics. Guillermo confused the information with the entropy. However, the principle for which information is an immutable constant is not yet established in physics. I suggest abandoning this kind of consideration and remain confined in our specific sector. In our specific sector, the spectral analysis cannot confute this claim: removed all the data which are specifically tailored for decompressing, the compressed data stream can be statistically separated by white noise in a sensitive way. Well, it is not 100% true. The gzip -1 output can be discriminated by this kind of analysis from gzip -9 output with a certain degree of confidence. However, if we do the same with a real-world entropy generator - say based on thermal effects - we notice a slight rose-noise effect, a tiny-tiny-tiny-tiny predominance in low-frequencies. I used the word tiny 4 times because the black-body law has a 4 as an exponent, but please remember that correlation is not causation... ;-) At this point you may argue that the whole gzip output is made to be reversible and it is totally arbitrary to make a distinction between compressed data and decompressing information. Let me to be more specific, in most of the cases, it is enough to discard the de/compression symbols tables to earn the white-noise badge. The white noise is our rabbit to follow down to Alice's Wonderland hole. CREATING ENTROPY AT OUR WILL Well, I cannot grant that our will is enough but it is a starting point having a good attitude towards the idea that entropy can be multiplied, at least. Hence, let's start with 8 bits of entropy (unpredictable data). A way to increase this entropy in order that O(input) < O(output) cannot be a deterministic function. Gulliermo wrote. Gulliermo is wrong, not totally wrong, tiny-tiny-tiny-tiny wrong. But tiny⁴ is enough for trigger the chaos and leveraging the chaos theory we can multiply the entropy. Therefore, some deterministic functions running on some deterministic systems can multiply entropy. Before the practice, a bit of bare-simple theory, an example rather than a theory. The most sold book in the world is the "Don Quixote" wikipedia states [1] - someone can argue that the most sold book in the world is the Bible - fair, use it if you wish. The main point is - a certain amount of data that is available and knowledgeable to everyone in an easiest way. In double-keys cryptography, it would be the public key. +:git-restore:recovery2:yamui> wget https://www.gutenberg.org/files/996/996-0.txt -qO - | wc -c 2391549 +:git-restore:recovery2:yamui> wget https://www.gutenberg.org/files/996/996-0.txt -qO - | pigz -4c | dd bs=1 of=/dev/null 952365+0 records in +:git-restore:recovery2:yamui> { wget https://www.gutenberg.org/files/996/996-0.txt -qO -; echo Hello World\!; } | pigz -4c | dd bs=1 of=/dev/null 952375+0 records in The ASCII version of the "Don Quixote" is 2391549 while its compression version is 952365. We should account to remove de/compresison tables from this size but for the sake of simplicity, we are going to use these numbers in the example. Implementations might vary. We can also notice that adding a couple of words the compressed data size does not change very much. Now, we can start. 1. we have a byte of entropy 2. we have a public data 2.3M larger than our entropy 3. compressing that data can generates about 1M times bigger data stream than our initial entropy How can we leverage this to multiply our entropy creating a deterministic function running on an almost deterministic system to multiply entropy? Where almost means 100% deterministic - tiny⁴. The number of characters is 2391549 = 0x247DFD and can be expressed with a 24 bits integer. We can read 3 bytes from the /dev/urandom put in the variable X and X % 2391549 gives us a position. You may argue that this is unfair because 1. /dev/urandom is the system that we need to provide and 2. in its current state is not yet seated properly therefore is predictable. Ok, we have 8 bit of randomness, time to start to use them: randval % 4. Now we have 6 bits + 2 bits. r=6bit, we can read the sort( {r, 63-r, 0x3F xor r} ) occurrences from /dev/urandom and decide their order using the 2 bits. bit0: order or reversed order, bit1: the middle goes in front or remains in the middle. You insist that /dev/urandom cannot be used, ok. sha512sum(randval) = 64 bytes, enough to play the trick and select the 3 bytes. Guillermo is getting mad and screaming that the entropy has remained the same, I just played with bits. He is right, and I am gonna to play even more with the bits. Let me show you. With a 24bit integer, name it N, I can choose a position in the Don Quichotte ASCII text doing N % sizeof(text). In that position, I put my 8 bit of entropy. The compressed data will be different from the original text only after the compression function has met the changed char. Hence we consider the data after that point. That point is a cut-point. This brings a great variance in the size and time of compression. The size is not anymore predictable because I used the entropy to determine where to cut the text and insert the change and because I have inserted my 8 bit of entropy I can also have 256 different outputs. The size and the time still have a kind of correlation with the cutting-point but their least decimals do not. export LAST_NS_TIME=$(nstime); dd if=996-0.txt bs=1 skip=$(($RANDOM$RANDOM$RANDOM$RANDOM % 2391549)) status=none | pigz -4c >output; du -b output; nstime - 293575 output +1.415828601 Now, I have some random noise (239k) and at least a new 8-bit of ture entropy (much probably 16 bits) that I can use for the next round. Because 75 in the size and the 601 in the time are not predictable, at all. You might argue that even removing the de/compression tables a compressed data stream is not a white noise. If it would not, then some characters (frequency) would be much more frequent than others and this means that the compression algorithm is not working very well in compressing data (can be gzip -1 instead of gzip -9) or it is flawed (then we are going to discard it or fix it). I got white noise and I got entropy. TINY⁻4 VARIATIONS AND CHAOS Up to here we played with unpredictable tiny⁻⁴ variations to create white noise and a few bits of entropy with a large amount of work and some time. Now it is the time to trigger the chaos and provide us with a large amount of entropy which means elevating the white noise to the entropy status. How? Leveraging tiny⁻⁴ variation to create totally impredicatable outputs. export LAST_NS_TIME=$(nstime); dd if=996-0.txt bs=1 skip=$(($RANDOM$RANDOM$RANDOM$RANDOM % 2391549)) status=progress 2>&1 | pigz -4c >output; du -b output; nstime - 939672 output +5.748793786 export LAST_NS_TIME=$(nstime); dd if=996-0.txt bs=1 skip=$(($RANDOM$RANDOM$RANDOM$RANDOM % 2391549)) status=progress 2>&1 | pigz -4c >output; du -b output; nstime - 97220 output +1.507189217 With two different runs and 6 seconds, about 1Mb of entropy has been created. Notice that the stderr of the dd progress output is mixed with the stdout that goes into compression input. Do not trust me, do your tests and see with your own eyes that the way in which standard error and standard output mixup is not totally predictable. In general the output of something like this ( thread one >&1 & thread two >&2 & ) 2>&1 is not predictable especially on a SMP system. Imagine that thread one and thread two are just reading the first and the second halves of the Don Chichotte and the third is compressing a random mixing up of these two streams. To spread even better the butter on the bread, adding a usleep in the between using our 8-bit of initial randomness ( thread one >&1 & usleep $randval; thread two >&2 & ) 2>&1 Obviously, the throughput should be slow enough to make the mixture fine grained and bs=1 can give us a reasonable hope about it. IMPLEMENTATION In this function 3 flux of data are going to mix up among them in an unpredictable way: stderr from multiple source, stdout and set -x hwget_random_byte_value() { local dev=/dev/random test -e /dev/.random && dev=/dev/.random dd if=$dev bs=1 count=1 | hexdump -ve '1/1 "%d\n"'; } #endfunc urandom_initseed() { local n m u=/dev/urandom bbcmd=/usr/bin/busybox-static { set -x n=$(hwget_random_byte_value) $bbcmd dd if=$u bs=$((n+32)) count=1; m=$(date +%N); printf "$m" hcmd=$(echo "md5:sha1:sha3:sha256:sha512" | cut -d: -f$((1 + 1${m}%5))) (cd /proc; cat *stat softirqs|tee - >&2|eval ${hcmd}sum;) | $bbcmd dd set +x } 2>&1 | pigz -$((1 + n%9))c | dd bs=$((5+m%64)) skip=1 >$u 2>/dev/null dd if=$u bs=$((1024+n+m%256)) count=1 >/dev/null 2>&1 } #endfunc Removed the de/compression table from the compressed output and we created an entropy generator with a white noise spectral output. On a 2Kb the table is quite small and set at the beginning of the data stream. Therefore skipping something between [5,68] bytes can be enough. However, improvements are always possible and mistakes also. CONCLUSION Our computer systems are not totally deterministic, especially those working at high frequencies of clock (1GHz) and have multiple cores (SMP). These system are almost deterministic with tiny⁻⁴ variations. Because these systems are digitals, these variations never pop-up on the numbers. Otherwise the ECC memory and CPU checks will trap these errors. Although they provide a deterministic calculation, their functioning is not absolutely deterministic, especially about timings on SMP concurrent processes. This can be leveraged to create large amounts of white noise or triggering the chaos to produce entropy (an unpredictable sequence of numbers) with a white noise spectral profile (the best) in relatively large amounts. NOTE [1] https://en.wikipedia.org/wiki/List_of_best-selling_books
_______________________________________________ busybox mailing list [email protected] http://lists.busybox.net/mailman/listinfo/busybox
