On Fri, Mar 04, 2022 at 08:08:03AM -0500, Robert Haas wrote: > On Fri, Mar 4, 2022 at 6:44 AM Justin Pryzby <pry...@telsasoft.com> wrote: >> In my 1-off test, it gets 610/633 = 96% of the benefit at 209/273 = 77% of >> the >> cost.
Hmm, it may be good to start afresh and compile numbers in a single chart. I did that here with some numbers on the user and system CPU: https://www.postgresql.org/message-id/YMmlvyVyAFlxZ+/h...@paquier.xyz For this test, regarding ZSTD, the lowest level did not have much difference with the default level, and at the highest level the user CPU spiked for little gain in compression. All of them compressed more than LZ4, with more CPU used in each case, but the default or a level value lower than the default gives me the impression that it won't matter much in terms of compression gains and CPU usage. > I agree with Michael. Your 1-off test is exactly that, and the results > will have depended on the data you used for the test. I'm not saying > we could never decide to default to a compression level other than the > library's default, but I do not think we should do it casually or as > the result of any small number of tests. There should be a strong > presumption that the authors of the library have a good idea what is > sensible in general unless we can explain compellingly why our use > case is different from typical ones. > > There's an ease-of-use concern here too. It's not going to make things > any easier for users to grok if zstd is available in different parts > of the system but has different defaults in each place. It wouldn't be > the end of the world if that happened, but neither would it be ideal. I'd like to believe that anybody who writes his/her own compression algorithm have a good idea of the default behavior they want to show, so we could remain simple, and believe in them. Now, I would not object to see some fresh numbers, and assuming that all FPIs have the same page size, we could go down to designing a couple of test cases that produce a fixed number of FPIs and measure the compressability in a single session. Repeatability and randomness of data counts, we could have for example one case with a set of 5~7 int attributes, a second with text values that include random data, up to 10~12 bytes each to count on the tuple header to be able to compress some data, and a third with more repeatable data, like one attribute with an int column populate with generate_series(). Just to give an idea. -- Michael
signature.asc
Description: PGP signature