On Fri, Jan 27, 2023 at 12:34 PM Jeff Davis <pg...@j-davis.com> wrote: > It is non-deterministic, but I tried with two generated files, and got > similar results.
Jeff and I coordinated off-list. It turned out that the nondeterministic nature of the program to generate test data was behind my initial inability to recreate Jeff's results. Once Jeff provided me with the exact data that he saw the problem with, I was able to recreate the problematic case for abbreviated keys. It turns out that this was due to aborting abbreviation way too late in the process. It would happen relatively late in the process, when more than 50% of all tuples had already had abbreviations generated by ICU. This was a marginal case for abbreviated keys, which is precisely why it only happened this long into the process. That factor is also likely why I couldn't recreate the problem at first, even though I had test data that was substantially the same as the data required to show the problem. Attached patch fixes the issue. It teaches varstr_abbrev_abort to do something similar to every other abbreviated keys abort function: stop estimating cardinality entirely (give up on giving up) once there are a certain number of distinct abbreviated keys, regardless of any other factor. This is very closely based on existing code from numeric_abbrev_abort, though I use a cutoff of 10k rather than a cutoff of 100k. This difference is justified by the special considerations for text, where we authoritative comparisons have further optimizations such as strcoll caching and the memcmp equality fast path. It's also required to actually fix the test case at hand -- 100k isn't enough to avoid the performance issue Jeff reported. I think that this should be committed to HEAD only. -- Peter Geoghegan
v1-0001-Abort-abbreviated-keys-for-text-less-eagerly.patch
Description: Binary data