On Tue, 19 Sept 2023 at 03:25, Michael Conrad <[email protected]>
wrote:

> On 9/18/23 06:14, Guillermo Rodriguez Garcia wrote:
>
> everything is compressed with gzip -7. This is the worst scenario.
>> However, even in the worst scenario due to gzip one single bit of
>> difference in the input generates a completely different compressed
>> output:
>>
>
> Compression (or any other deterministic manipulation of data) does not add
> any entropy (or "unpredictability") since the processing is 100%
> reproducible.
> In terms of entropy the output of the function is as good (or as bad) as
> the amount of entropy in the initial seed.
>
> Hi Michel,

> Even aside from that, using gzip as some sort of hash function is not
> going to be anywhere near as good as using an actual hash function, like
> sha256, sha1 or even md5.
>

PREMISE

Hashing functions and compression algorithms are two completely different
kinds of mathematical tools. The most obvious difference is that

1. hash produces an output whose size is fixed whatever is the size of the
input while compression output size might vary when input size changes
2. compression algorithms (f) have their counterpart (f⁻¹) that reverse the
process while hash have not
3. because of point #2 the compression algorithm is bi-univocal functions
the same input gave the same output and the same output brings back to the
same input
4 We know that hashing functions are always injective functions: the same
input gives the same output but the same output can have different input
(collisions).

Unless a hardware system is provided with a specific hardware component
that produces constant entropy (white noise, preferably) the main problem
is to create it from few reasonably good random inputs.

As you can imagine, we can start a debate about the definition of
"reasonably good random inputs" or "entropy". Or at the opposite, we can
accept that those definitions are - restricted to our specific sector -
simply meaning unpredictable data - unpredictable by an attacker or even
better by the root admin of the system. Nanoseconds time granularity cannot
be predicted by an attacker and also a system administrator could have a
real hard time in doing that without sophisticated external hardware
instruments. Unfortunately, not all systems are able to provide a
nanosecond timing and the first main reason of this lack depend by the
clock frequency: to have nanoseconds granularity (10⁻⁹) is necessary to
have a GHz (10⁹) clock.


MD5SUM, GZIP AND THE WHITE NOISE

A relatively weak hash like MD5SUM is way better to create an unpredictable
stream of data than any compression function. Ok, let see it:

+:git-restore:recovery2:yamui> echo | md5sum
68b329da9893e34099c7d8ad5cb9c940  -

As you can see I have 1/16 chance to guess the right next char in the
md5sum output. At this point you notice that I am unfair because I used the
output of the md5sum command-line (textual human-friendly representation)
instead of the md5sum() binary output stream. Obviously, you are right.
Hence, I make you notice that you did the same considering the gzip. You
took the whole stream which also contains the information to decompress
that stream of data. Modifying the function in a way that decompression
information are not sent to the output, the output cannot be reversed
anymore (f⁻¹) does not exist anymore.

Because of this trick we can have a sort of length variable
hashing function. It is a very bad hash function, because the fixed size of
the output is a great feature. A great feature for the primary purpose for
which hash functions are currently used, not so great for generating
entropy. In fact, if we have 8 bit of entropy, the hash can provide us 512
bit of data stream - white noise - but the number of the 512 bit dataset
that could provide us remains 256. In other words: O(8-bit-entropy) = 256 =
O(sh512sum 8-bit-entropy).

This is what Gulliermo wrote proposing as "entropy conservation principle".
The entropy of a closed system never remains constant along the time but
always increases and this is a currently accepted principle of physics.
Guillermo confused the information with the entropy. However, the principle
for which information is an immutable constant is not yet established in
physics. I suggest abandoning this kind of consideration and remain
confined in our specific sector.

In our specific sector, the spectral analysis cannot confute this claim:
removed all the data which are specifically tailored for decompressing, the
compressed data stream can be statistically separated by white noise in a
sensitive way. Well, it is not 100% true. The gzip -1 output can be
discriminated by this kind of analysis from gzip -9 output with a
certain degree of confidence. However, if we do the same with a real-world
entropy generator - say based on thermal effects - we notice a slight
rose-noise effect, a tiny-tiny-tiny-tiny predominance in low-frequencies. I
used the word tiny 4 times because the black-body law has a 4 as an
exponent, but please remember that correlation is not causation... ;-)

At this point you may argue that the whole gzip output is made to be
reversible and it is totally arbitrary to make a distinction between
compressed data and decompressing information. Let me to be more specific,
in most of the cases, it is enough to discard the de/compression symbols
tables to earn the white-noise badge. The white noise is our rabbit to
follow down to Alice's Wonderland hole.


CREATING ENTROPY AT OUR WILL

Well, I cannot grant that our will is enough but it is a starting point
having a good attitude towards the idea that entropy can be multiplied, at
least. Hence, let's start with 8 bits of entropy (unpredictable data). A
way to increase this entropy in order that O(input) < O(output) cannot be a
deterministic function. Gulliermo wrote. Gulliermo is wrong, not totally
wrong, tiny-tiny-tiny-tiny wrong. But tiny⁴ is enough for trigger the chaos
and leveraging the chaos theory we can multiply the entropy. Therefore,
some deterministic functions running on some deterministic systems can
multiply entropy.

Before the practice, a bit of bare-simple theory, an example rather than a
theory. The most sold book in the world is the "Don Quixote" wikipedia
states [1] - someone can argue that the most sold book in the world is the
Bible - fair, use it if you wish. The main point is - a certain amount of
data that is available and knowledgeable to everyone in an easiest way. In
double-keys cryptography, it would be the public key.

+:git-restore:recovery2:yamui>
wget https://www.gutenberg.org/files/996/996-0.txt -qO - | wc -c
2391549

+:git-restore:recovery2:yamui>
wget https://www.gutenberg.org/files/996/996-0.txt -qO - | pigz -4c | dd
bs=1 of=/dev/null
952365+0 records in

+:git-restore:recovery2:yamui>
{ wget https://www.gutenberg.org/files/996/996-0.txt -qO -; echo Hello
World\!; } | pigz -4c | dd bs=1 of=/dev/null
952375+0 records in

The ASCII version of the "Don Quixote" is 2391549 while its compression
version is 952365. We should account to remove de/compresison tables from
this size but for the sake of simplicity, we are going to use these numbers
in the example. Implementations might vary. We can also notice that adding
a couple of words the compressed data size does not change very much. Now,
we can start.

1. we have a byte of entropy
2. we have a public data 2.3M larger than our entropy
3. compressing that data can generates about 1M times bigger data stream
than our initial entropy

How can we leverage this to multiply our entropy creating a deterministic
function running on an almost deterministic system to multiply entropy?
Where almost means 100% deterministic - tiny⁴.

The number of characters is 2391549 = 0x247DFD and can be expressed with a
24 bits integer. We can read 3 bytes from the /dev/urandom put in the
variable X and X % 2391549 gives us a position. You may argue that this is
unfair because 1. /dev/urandom is the system that we need to provide and 2.
in its current state is not yet seated properly therefore is predictable.
Ok, we have 8 bit of randomness, time to start to use them: randval % 4.
Now we have 6 bits + 2 bits. r=6bit, we can read the sort( {r, 63-r, 0x3F
xor r} ) occurrences from /dev/urandom and decide their order using the 2
bits. bit0: order or reversed order, bit1: the middle goes in front or
remains in the middle. You insist that /dev/urandom cannot be used, ok.
sha512sum(randval) = 64 bytes, enough to play the trick and select the 3
bytes.

Guillermo is getting mad and screaming that the entropy has remained the
same, I just played with bits. He is right, and I am gonna to play even
more with the bits. Let me show you.

With a 24bit integer, name it N, I can choose a position in the Don
Quichotte ASCII text doing N % sizeof(text). In that position, I put my 8
bit of entropy. The compressed data will be different from the original
text only after the compression function has met the changed char. Hence we
consider the data after that point. That point is a cut-point. This brings
a great variance in the size and time of compression. The size is not
anymore predictable because I used the entropy to determine where to cut
the text and insert the change and because I have inserted my 8 bit of
entropy I can also have 256 different outputs. The size and the time still
have a kind of correlation with the cutting-point but their least decimals
do not.

export LAST_NS_TIME=$(nstime); dd if=996-0.txt bs=1
skip=$(($RANDOM$RANDOM$RANDOM$RANDOM % 2391549))
status=none | pigz -4c >output; du -b output; nstime -
293575 output
+1.415828601

Now, I have some random noise (239k) and at least a new 8-bit of ture
entropy (much probably 16 bits) that I can use for the next round. Because
75 in the size and the 601 in the time are not predictable, at all. You
might argue that even removing the de/compression tables a compressed data
stream is not a white noise. If it would not, then some characters
(frequency) would be much more frequent than others and this means that the
compression algorithm is not working very well in compressing data (can be
gzip -1 instead of gzip -9) or it is flawed (then we are going to discard
it or fix it). I got white noise and I got entropy.


TINY⁻4 VARIATIONS AND CHAOS

Up to here we played with unpredictable tiny⁻⁴ variations to create white
noise and a few bits of entropy with a large amount of work and some time.
Now it is the time to trigger the chaos and provide us with a large amount
of entropy which means elevating the white noise to the entropy status.
How? Leveraging tiny⁻⁴ variation to create totally impredicatable outputs.

export LAST_NS_TIME=$(nstime); dd if=996-0.txt bs=1
skip=$(($RANDOM$RANDOM$RANDOM$RANDOM % 2391549))
status=progress 2>&1 | pigz -4c >output; du -b output; nstime -
939672 output
+5.748793786

export LAST_NS_TIME=$(nstime); dd if=996-0.txt bs=1
skip=$(($RANDOM$RANDOM$RANDOM$RANDOM % 2391549))
status=progress 2>&1 | pigz -4c >output; du -b output; nstime -
97220 output
+1.507189217

With two different runs and 6 seconds, about 1Mb of entropy has been
created. Notice that the stderr of the dd progress output is mixed with the
stdout that goes into compression input. Do not trust me, do your tests and
see with your own eyes that the way in which standard error and standard
output mixup is not totally predictable. In general the output of something
like this ( thread one >&1 & thread two >&2 & ) 2>&1 is not predictable
especially on a SMP system. Imagine that thread one and thread two are just
reading the first and the second halves of the Don Chichotte and the third
is compressing a random mixing up of these two streams. To spread even
better the butter on the bread, adding a usleep in the between using our
8-bit of initial randomness ( thread one >&1 & usleep $randval; thread two
>&2 & ) 2>&1

Obviously, the throughput should be slow enough to make the mixture fine
grained and bs=1 can give us a reasonable hope about it.

IMPLEMENTATION

In this function 3 flux of data are going to mix up among them in an
unpredictable way: stderr from multiple source, stdout and set -x

hwget_random_byte_value() {
    local dev=/dev/random
    test -e /dev/.random && dev=/dev/.random
    dd if=$dev bs=1 count=1 | hexdump -ve '1/1 "%d\n"';
} #endfunc

urandom_initseed() {
    local n m u=/dev/urandom bbcmd=/usr/bin/busybox-static
    {   set -x
        n=$(hwget_random_byte_value)
        $bbcmd dd if=$u bs=$((n+32)) count=1; m=$(date +%N); printf "$m"
        hcmd=$(echo "md5:sha1:sha3:sha256:sha512"  | cut -d: -f$((1 +
1${m}%5)))
        (cd /proc; cat *stat softirqs|tee - >&2|eval ${hcmd}sum;) | $bbcmd
dd
        set +x
    } 2>&1 | pigz -$((1 + n%9))c | dd bs=$((5+m%64)) skip=1 >$u 2>/dev/null
    dd if=$u bs=$((1024+n+m%256)) count=1 >/dev/null 2>&1
} #endfunc

Removed the de/compression table from the compressed output and we created
an entropy generator with a white noise spectral output. On a 2Kb the table
is quite small and set at the beginning of the data stream. Therefore
skipping something between [5,68] bytes can be enough. However,
improvements are always possible and mistakes also.

CONCLUSION

Our computer systems are not totally deterministic, especially those
working at high frequencies of clock (1GHz) and have multiple cores (SMP).
These system are almost deterministic with tiny⁻⁴ variations. Because these
systems are digitals, these variations never pop-up on the numbers.
Otherwise the ECC memory and CPU checks will trap these errors. Although
they provide a deterministic calculation, their functioning is not
absolutely deterministic, especially about timings on SMP concurrent
processes. This can be leveraged to create large amounts of white noise or
triggering the chaos to produce entropy (an unpredictable sequence of
numbers) with a white noise spectral profile (the best) in relatively large
amounts.

NOTE

[1] https://en.wikipedia.org/wiki/List_of_best-selling_books
_______________________________________________
busybox mailing list
[email protected]
http://lists.busybox.net/mailman/listinfo/busybox

Reply via email to