On Sat, Jul 25, 2020 at 4:56 PM Jeff Davis <pg...@j-davis.com> wrote: > I wrote a quick patch to use HyperLogLog to estimate the number of > groups contained in a spill file. It seems to reduce the > overpartitioning effect, and is a more principled approach than what I > was doing before.
This pretty much fixes the issue that I observed with overparitioning. At least in the sense that the number of partitions grows more predictably -- even when the number of partitions planned is reduced the change in the number of batches seems smooth-ish. It "looks nice". > It does seem to hurt the runtime slightly when spilling to disk in some > cases. I haven't narrowed down whether this is because we end up > recursing multiple times, or if it's just more efficient to > overpartition, or if the cost of doing the HLL itself is significant. I'm glad that this better principled approach is possible. It's hard to judge how much of a problem this really is, though. We'll need to think about this aspect some more. Thanks -- Peter Geoghegan