[gentoo-user] Re: OT Best way to compress files with digits
On 2014-11-02, Matti Nykyri matti.nyk...@iki.fi wrote: On Nov 1, 2014, at 23:56, David W Noon dwn...@ntlworld.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Sat, 01 Nov 2014 22:47:15 +0200, Alan Mckinnon (alan.mckin...@gmail.com) wrote about Re: [gentoo-user] Re: OT Best way to compress files with digits (in 545546d3.3030...@gmail.com): On 01/11/2014 19:59, meino.cra...@gmx.de wrote: [snip] Ah! By the way...I was astonished to read, that the digits of PI are called random on the one hand and on the other hand there is a formula [1] to calculate a certain digit of PI without calculation of the previous digits... Calculated random? Are nature constants the purest form of PRNGs ??? ;) (Quantum physics is everywhere... ;;)) [1]: http://en.wikipedia.org/wiki/Bailey%E2%80%93Borwein%E2%80%93Plouffe_formula The sequence of digits that make up pi are a random sequence - you can analyze the order any way you want and you'll find no inherent pattern. Actually, the sequence of digits is most definitely *not* random. If the sequence of digits is written any other way then the value is not Pi. Hence the sequence is unique, not random. I think what you are grasping for is that the frequency of distinct digits tends to be uniform: 0's occur as often as 1's as often ... as 9's. Note that the as often as operator is really approximate for Well all the digit of pi can be compressed to the following: =pi(); Nah. Just switch to base-Pi, and then it compresses to: 1 -- Grant Edwards grant.b.edwardsYow! Are we THERE yet? at gmail.com
Re: [gentoo-user] Re: OT Best way to compress files with digits
On Sunday 02 Nov 2014 22:03:13 Peter Humphrey wrote: On Sunday 02 November 2014 21:55:31 Alan McKinnon wrote: English is a heavily overloaded language and there's always more than one way to communicate something Even the simplest cases usually have three words for the same thing: one from French, one from Latin and one from Anglo-Saxon. I won't even mention words that have come down from Old German and so on, but at least we don't have many words from Italian or Spanish. (Zucchini? What's that?) That's clearly baloney! -- Regards, Mick signature.asc Description: This is a digitally signed message part.
Re: [gentoo-user] Re: OT Best way to compress files with digits
On Monday 03 November 2014 19:37:52 Mick wrote: Even the simplest cases usually have three words for the same thing: one from French, one from Latin and one from Anglo-Saxon. I won't even mention words that have come down from Old German and so on, but at least we don't have many words from Italian or Spanish. (Zucchini? What's that?) That's clearly baloney! Explain. -- Rgds Peter
Re: [gentoo-user] Re: OT Best way to compress files with digits
On Tuesday 04 Nov 2014 02:04:45 Peter Humphrey wrote: On Monday 03 November 2014 19:37:52 Mick wrote: Even the simplest cases usually have three words for the same thing: one from French, one from Latin and one from Anglo-Saxon. I won't even mention words that have come down from Old German and so on, but at least we don't have many words from Italian or Spanish. (Zucchini? What's that?) That's clearly baloney! Explain. http://en.wikipedia.org/wiki/Bologna_sausage :-) -- Regards, Mick signature.asc Description: This is a digitally signed message part.
Re: [gentoo-user] Re: OT Best way to compress files with digits
On Nov 1, 2014, at 23:56, David W Noon dwn...@ntlworld.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Sat, 01 Nov 2014 22:47:15 +0200, Alan Mckinnon (alan.mckin...@gmail.com) wrote about Re: [gentoo-user] Re: OT Best way to compress files with digits (in 545546d3.3030...@gmail.com): On 01/11/2014 19:59, meino.cra...@gmx.de wrote: [snip] Ah! By the way...I was astonished to read, that the digits of PI are called random on the one hand and on the other hand there is a formula [1] to calculate a certain digit of PI without calculation of the previous digits... Calculated random? Are nature constants the purest form of PRNGs ??? ;) (Quantum physics is everywhere... ;;)) [1]: http://en.wikipedia.org/wiki/Bailey%E2%80%93Borwein%E2%80%93Plouffe_formula The sequence of digits that make up pi are a random sequence - you can analyze the order any way you want and you'll find no inherent pattern. Actually, the sequence of digits is most definitely *not* random. If the sequence of digits is written any other way then the value is not Pi. Hence the sequence is unique, not random. I think what you are grasping for is that the frequency of distinct digits tends to be uniform: 0's occur as often as 1's as often ... as 9's. Note that the as often as operator is really approximate for finite sub-sequences, but is asymptotically accurate. Moreover, this is the same in any number base: the binary representation has 0's occurring as often as 1's; the ternary representation has 0's occurring as often as 1' and as often as 2's; etc., etc. Such numbers are called normal. It was a poor choice of name, but we are stuck with it. I would have called them digit soup numbers - -- an oblique reference to alphabet soup. Well all the digit of pi can be compressed to the following: =pi(); If you have the infinite series that calculates the digits :) However, any given digit in the sequence is 100% predictable, as you just showed :-) Randomness has got to be the second most mind-boggling thing out there, first being quantumness (that's not a waord, I just made it up. You you should get the meaning OK from context ;-) ) I would say that probability theory is more mind boggling, as it underpins much of quantum theory. But, as someone who majored in probability theory, I might be biased. [Incidentally, there is a small statistical joke in that last sentence.] Getting back to Meino's original request, one of the optimum compression algorithms for this would be custom Huffman encoding. To do this the algorithm requires that all the data (i.e. digits) be read and a frequency table built. The only problem is that to read all the digits of Pi could take rather a long time. ... :-) That would take infinite time :) - -- Regards, Dave [RLU #314465] *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* dwn...@ntlworld.com (David W Noon) *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* -BEGIN PGP SIGNATURE- Version: GnuPG v2 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlRVVyQACgkQRQ2Fs59Psv/9qwCeKwuLz/7RGEV06X+RdDQryDe+ /xwAoK1qMgb9RZXkQByBUMqB8eqs20bG =XUPB -END PGP SIGNATURE-
Re: [gentoo-user] Re: OT Best way to compress files with digits
On 01/11/2014 23:56, David W Noon wrote: The sequence of digits that make up pi are a random sequence - you can analyze the order any way you want and you'll find no inherent pattern. Actually, the sequence of digits is most definitely *not* random. If the sequence of digits is written any other way then the value is not Pi. Hence the sequence is unique, not random. I think what you are grasping for is that the frequency of distinct digits tends to be uniform: 0's occur as often as 1's as often ... as 9's. Note that the as often as operator is really approximate for finite sub-sequences, but is asymptotically accurate. Moreover, this is the same in any number base: the binary representation has 0's occurring as often as 1's; the ternary representation has 0's occurring as often as 1' and as often as 2's; etc., etc. Such numbers are called normal. It was a poor choice of name, but we are stuck with it. I would have called them digit soup numbers -- an oblique reference to alphabet soup. You grasp correctly what I was saying :-) I'm not formally trained in mathematics so I often get the terminology wrong or just don't know the accepted words for a concept. Lucky for me though, English is a heavily overloaded language and there's always more than one way to communicate something -- Alan McKinnon alan.mckin...@gmail.com
Re: [gentoo-user] Re: OT Best way to compress files with digits
On Sunday 02 November 2014 21:55:31 Alan McKinnon wrote: English is a heavily overloaded language and there's always more than one way to communicate something Even the simplest cases usually have three words for the same thing: one from French, one from Latin and one from Anglo-Saxon. I won't even mention words that have come down from Old German and so on, but at least we don't have many words from Italian or Spanish. (Zucchini? What's that?) -- Rgds Peter
[gentoo-user] Re: OT Best way to compress files with digits
meino.cramer at gmx.de writes: I have a lot of files with digits of PI. The digits are the characters of 0-9. Currently they are ZIPped, which I think is not the best way to do that. Hello Meino, It's a bit of effort, but the world's recognized authority on algorithms is Don Knuth. [1] He's old now, but his pioneering attempt at categorizing most algorithms: The art of computer programming and his MMIX alogrithm implementations (kinda like assembler) are certainly part of many first-step research efforts on algorithms and their implementations. It's not a cookbook; more of a scholarly (high_brow) reference, just to supplement all the good postings by your peers on gentoo user. Alan may loan you his copy? (ha ha ha)? hth, James [1] http://www-cs-faculty.stanford.edu/~uno/
Re: [gentoo-user] Re: OT Best way to compress files with digits
On 01/11/2014 19:15, James wrote: meino.cramer at gmx.de writes: I have a lot of files with digits of PI. The digits are the characters of 0-9. Currently they are ZIPped, which I think is not the best way to do that. Hello Meino, It's a bit of effort, but the world's recognized authority on algorithms is Don Knuth. [1] He's old now, but his pioneering attempt at categorizing most algorithms: The art of computer programming and his MMIX alogrithm implementations (kinda like assembler) are certainly part of many first-step research efforts on algorithms and their implementations. It's not a cookbook; more of a scholarly (high_brow) reference, just to supplement all the good postings by your peers on gentoo user. Alan may loan you his copy? (ha ha ha)? hth, James [1] http://www-cs-faculty.stanford.edu/~uno/ ha ha, fat chance :-) When Alan does eventually get his hands on his very own personal copy[1], it will be lent to nobody. There are just some things a man never lends out: his bike, his firearm, his wife. And Knuth :-) Back on topic: You're 100% right - to learn about algorithms in general, Knuth is the man. Essential reading for anyone taking CS seriously -- Alan McKinnon alan.mckin...@gmail.com
Re: [gentoo-user] Re: OT Best way to compress files with digits
James wirel...@tampabay.rr.com [14-11-01 18:16]: meino.cramer at gmx.de writes: I have a lot of files with digits of PI. The digits are the characters of 0-9. Currently they are ZIPped, which I think is not the best way to do that. Hello Meino, It's a bit of effort, but the world's recognized authority on algorithms is Don Knuth. [1] He's old now, but his pioneering attempt at categorizing most algorithms: The art of computer programming and his MMIX alogrithm implementations (kinda like assembler) are certainly part of many first-step research efforts on algorithms and their implementations. It's not a cookbook; more of a scholarly (high_brow) reference, just to supplement all the good postings by your peers on gentoo user. Alan may loan you his copy? (ha ha ha)? hth, James [1] http://www-cs-faculty.stanford.edu/~uno/ Hello james, Don Knuth ... oh YES! :) For a long time I am using and prefering TeX over anything else (ok...for ASCII I use vim... ;). And beside his computer wisdom I also like his kind of humor a lot... for example this one: https://www.youtube.com/watch?v=eKaI78K_rgAlist=PLUu0XRts4lK6Ri7-xaCNYqTHx7We95Rk8index=10 But my initial question was more targeted to practical computing as to groundshakeing and fundamental research topics. More like what tool to pick?... I did some compression tests myself and currently I have this: From http://piworld.calico.jp/ (http://piworld.calico.jp/estart.html) I got zipped package of 1000 million places of PI each (~57MB for one ZIP). I unpacked the first package and recompressed it with different methods of 7zip, gzip and bzip2. For gzip and bzip2 I used the highest compression mode (-9). When a files name matches /.*ultra.*/, I used the highest compression mode (-mx=9), else I only set the compression method and leave the rest untouched (defaults). 11996 2014-10-31 16:44 pi-0001.txt 57105419 2014-10-31 16:47 pi-0001.txt.gz 52632832 2014-10-31 16:48 pi-0001.txt.bz2 52045827 2014-10-31 16:54 pi-0001.txt.ppmd.7z 57110291 2014-10-31 17:23 pi-0001.zip 51766683 2014-10-31 17:26 pi-0001.txt.lzma.7z 51668838 2014-10-31 17:34 pi-0001.txt.lzma.ultra.7z 52862115 2014-10-31 17:36 pi-0001.txt.ppmd.ultra.7z 51668838 2014-10-31 17:39 pi-0001.txt.ultra.7z 7zip's lzma wins here, which is also the default method of 7zip. I set the ultra mode for this by hand. From other sites which offer PI for download I know of methods, which store the ASCII-digits in binary and compresses then. Would be interesting, whether this creates a more handy input from 7zips point of view... Ah! By the way...I was astonished to read, that the digits of PI are called random on the one hand and on the other hand there is a formula [1] to calculate a certain digit of PI without calculation of the previous digits... Calculated random? Are nature constants the purest form of PRNGs ??? ;) (Quantum physics is everywhere... ;;)) [1]: http://en.wikipedia.org/wiki/Bailey%E2%80%93Borwein%E2%80%93Plouffe_formula Best regards, Meino
Re: [gentoo-user] Re: OT Best way to compress files with digits
On Nov 1, 2014, at 19:26, Alan McKinnon alan.mckin...@gmail.com wrote: On 01/11/2014 19:15, James wrote: meino.cramer at gmx.de writes: I have a lot of files with digits of PI. The digits are the characters of 0-9. Currently they are ZIPped, which I think is not the best way to do that. Hello Meino, It's a bit of effort, but the world's recognized authority on algorithms is Don Knuth. [1] He's old now, but his pioneering attempt at categorizing most algorithms: The art of computer programming and his MMIX alogrithm implementations (kinda like assembler) are certainly part of many first-step research efforts on algorithms and their implementations. It's not a cookbook; more of a scholarly (high_brow) reference, just to supplement all the good postings by your peers on gentoo user. Alan may loan you his copy? (ha ha ha)? hth, James [1] http://www-cs-faculty.stanford.edu/~uno/ ha ha, fat chance :-) When Alan does eventually get his hands on his very own personal copy[1], it will be lent to nobody. There are just some things a man never lends out: his bike, his firearm, his wife. And Knuth :-) Why not lend your wife? ;) Back on topic: You're 100% right - to learn about algorithms in general, Knuth is the man. Essential reading for anyone taking CS seriously -- Alan McKinnon alan.mckin...@gmail.com
Re: [gentoo-user] Re: OT Best way to compress files with digits
On 01/11/2014 19:59, meino.cra...@gmx.de wrote: James wirel...@tampabay.rr.com [14-11-01 18:16]: meino.cramer at gmx.de writes: I have a lot of files with digits of PI. The digits are the characters of 0-9. Currently they are ZIPped, which I think is not the best way to do that. Hello Meino, It's a bit of effort, but the world's recognized authority on algorithms is Don Knuth. [1] He's old now, but his pioneering attempt at categorizing most algorithms: The art of computer programming and his MMIX alogrithm implementations (kinda like assembler) are certainly part of many first-step research efforts on algorithms and their implementations. It's not a cookbook; more of a scholarly (high_brow) reference, just to supplement all the good postings by your peers on gentoo user. Alan may loan you his copy? (ha ha ha)? hth, James [1] http://www-cs-faculty.stanford.edu/~uno/ Hello james, Don Knuth ... oh YES! :) For a long time I am using and prefering TeX over anything else (ok...for ASCII I use vim... ;). And beside his computer wisdom I also like his kind of humor a lot... for example this one: https://www.youtube.com/watch?v=eKaI78K_rgAlist=PLUu0XRts4lK6Ri7-xaCNYqTHx7We95Rk8index=10 But my initial question was more targeted to practical computing as to groundshakeing and fundamental research topics. More like what tool to pick?... I did some compression tests myself and currently I have this: From http://piworld.calico.jp/ (http://piworld.calico.jp/estart.html) I got zipped package of 1000 million places of PI each (~57MB for one ZIP). I unpacked the first package and recompressed it with different methods of 7zip, gzip and bzip2. For gzip and bzip2 I used the highest compression mode (-9). When a files name matches /.*ultra.*/, I used the highest compression mode (-mx=9), else I only set the compression method and leave the rest untouched (defaults). 11996 2014-10-31 16:44 pi-0001.txt 57105419 2014-10-31 16:47 pi-0001.txt.gz 52632832 2014-10-31 16:48 pi-0001.txt.bz2 52045827 2014-10-31 16:54 pi-0001.txt.ppmd.7z 57110291 2014-10-31 17:23 pi-0001.zip 51766683 2014-10-31 17:26 pi-0001.txt.lzma.7z 51668838 2014-10-31 17:34 pi-0001.txt.lzma.ultra.7z 52862115 2014-10-31 17:36 pi-0001.txt.ppmd.ultra.7z 51668838 2014-10-31 17:39 pi-0001.txt.ultra.7z 7zip's lzma wins here, which is also the default method of 7zip. I set the ultra mode for this by hand. From other sites which offer PI for download I know of methods, which store the ASCII-digits in binary and compresses then. Would be interesting, whether this creates a more handy input from 7zips point of view... Ah! By the way...I was astonished to read, that the digits of PI are called random on the one hand and on the other hand there is a formula [1] to calculate a certain digit of PI without calculation of the previous digits... Calculated random? Are nature constants the purest form of PRNGs ??? ;) (Quantum physics is everywhere... ;;)) [1]: http://en.wikipedia.org/wiki/Bailey%E2%80%93Borwein%E2%80%93Plouffe_formula The sequence of digits that make up pi are a random sequence - you can analyze the order any way you want and you'll find no inherent pattern. However, any given digit in the sequence is 100% predictable, as you just showed :-) Randomness has got to be the second most mind-boggling thing out there, first being quantumness (that's not a waord, I just made it up. You you should get the meaning OK from context ;-) ) -- Alan McKinnon alan.mckin...@gmail.com
[gentoo-user] Re: OT Best way to compress files with digits
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Sat, 01 Nov 2014 22:47:15 +0200, Alan Mckinnon (alan.mckin...@gmail.com) wrote about Re: [gentoo-user] Re: OT Best way to compress files with digits (in 545546d3.3030...@gmail.com): On 01/11/2014 19:59, meino.cra...@gmx.de wrote: [snip] Ah! By the way...I was astonished to read, that the digits of PI are called random on the one hand and on the other hand there is a formula [1] to calculate a certain digit of PI without calculation of the previous digits... Calculated random? Are nature constants the purest form of PRNGs ??? ;) (Quantum physics is everywhere... ;;)) [1]: http://en.wikipedia.org/wiki/Bailey%E2%80%93Borwein%E2%80%93Plouffe_formula The sequence of digits that make up pi are a random sequence - you can analyze the order any way you want and you'll find no inherent pattern. Actually, the sequence of digits is most definitely *not* random. If the sequence of digits is written any other way then the value is not Pi. Hence the sequence is unique, not random. I think what you are grasping for is that the frequency of distinct digits tends to be uniform: 0's occur as often as 1's as often ... as 9's. Note that the as often as operator is really approximate for finite sub-sequences, but is asymptotically accurate. Moreover, this is the same in any number base: the binary representation has 0's occurring as often as 1's; the ternary representation has 0's occurring as often as 1' and as often as 2's; etc., etc. Such numbers are called normal. It was a poor choice of name, but we are stuck with it. I would have called them digit soup numbers - -- an oblique reference to alphabet soup. However, any given digit in the sequence is 100% predictable, as you just showed :-) Randomness has got to be the second most mind-boggling thing out there, first being quantumness (that's not a waord, I just made it up. You you should get the meaning OK from context ;-) ) I would say that probability theory is more mind boggling, as it underpins much of quantum theory. But, as someone who majored in probability theory, I might be biased. [Incidentally, there is a small statistical joke in that last sentence.] Getting back to Meino's original request, one of the optimum compression algorithms for this would be custom Huffman encoding. To do this the algorithm requires that all the data (i.e. digits) be read and a frequency table built. The only problem is that to read all the digits of Pi could take rather a long time. ... :-) - -- Regards, Dave [RLU #314465] *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* dwn...@ntlworld.com (David W Noon) *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* -BEGIN PGP SIGNATURE- Version: GnuPG v2 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlRVVyQACgkQRQ2Fs59Psv/9qwCeKwuLz/7RGEV06X+RdDQryDe+ /xwAoK1qMgb9RZXkQByBUMqB8eqs20bG =XUPB -END PGP SIGNATURE-
[gentoo-user] Re: OT Best way to compress files with digits
On 2014-10-31, Rich Freeman ri...@gentoo.org wrote: On Fri, Oct 31, 2014 at 2:55 PM, David Haller gen...@dhaller.de wrote: On Fri, 31 Oct 2014, Rich Freeman wrote: I can't imagine that any tool will do much better than something like lzo, gzip, xz, etc. You'll definitely benefit from compression though - your text files full of digits are encoding 3.3 bits of information in an 8-bit ascii character and even if the order of digits in pi can be treated as purely random just about any compression algorithm is going to get pretty close to that 3.3 bits per digit figure. Good estimate: $ calc '101000/(8/3.3)' 41662.5 and I get from (lzip) $ calc 44543*8/101000 3.528...(bits/digit) to zip: $ calc 49696*8/101000 ~3.93 (bits/digit) Actually, I'm surprised how far off of this the various methods are. I was expecting SOME overhead, but not this much. A fairly quick algorithm would be to encode every possible set of 96 digits into a 40 byte code (that is just a straight decimal-binary conversion). Then read a word at a time and translate it. This will only waste 0.011 bits per digit. You're cheating. The algorithm you tested will compress strings of arbitrary 8-bit values. The algorithm you proposed will only compress strings of bytes where each byte can have only one of 10 values. -- Grant Edwards grant.b.edwardsYow! I want another at RE-WRITE on my CEASAR gmail.comSALAD!!
Re: [gentoo-user] Re: OT Best way to compress files with digits
On Fri, Oct 31, 2014 at 4:25 PM, Grant Edwards grant.b.edwa...@gmail.com wrote: You're cheating. The algorithm you tested will compress strings of arbitrary 8-bit values. The algorithm you proposed will only compress strings of bytes where each byte can have only one of 10 values. Of course. I wasn't expecting the general-purpose algorithm to do as well. In some sense, part of the information that is being encoded is actually in the compression algorithm itself (the mapping), while in a general-purpose compression algorithm that information has to be part of the compressed data stream. I was just expecting gzip/etc to get much closer to the theoretical limit. I figured that it might be a few percent higher, but I wasn't expecting a 10+% difference. -- Rich