On Mon, Jun 11, 2018 at 11:29 AM, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: > I think the underlying question is whether we want to treat this as an > emergency safety against OOM (which should be a rare thing in most cases) or > something allowing us to pick hash aggregate more often. > > It would be great to get something that performs better than just falling > back to sort (and I was advocating for that), but I'm worried we might be > moving the goalposts way too far.
Well, the scope of the first patch is always negotiable, but I think we should try to think beyond the first patch in design discussions, so that we don't paint ourselves into a corner. I also think it's worth noting that an emergency safety which also causes a 10x performance degradation is not necessarily much better than OOM. I suspect that part of the reason why both Andres and David proposed algorithms that combined hashing and sorting is out of a desire to cater somewhat to both few-tuples-per-group and many-tuples-per-group. AFAICS, Andres's algorithm will be better for few-tuples-per-group (maybe with a few large groups mixed in) and David's algorithm will be better for many-tuples-per-group (maybe with some small groups mixed in), but in each case the inclusion of a sorting component seems like it will ease the pain in the opposite use case. However, I wonder whether it will be better still to just acknowledge that those two cases really need separate algorithms and then think about the best approach for each one individually. One big advantage to just partitioning the input tuples by hash code and then proceeding batch by batch is that it works for any aggregate that can support hash aggregation in the first place. You do not need a btree opclass. You do not need support for serialize/deserialize or combine. If you have serialize/deserialize, it's better, because now you can spill growing transition values out of the hash table on the fly. But the basic contours of the algorithm don't change even if such support is lacking. That's a significant advantage in my view, because with some of these more complex strategies, we'll end up with code paths for data types lacking the necessary support which are almost never exercised in practice. That's likely to lead to bugs. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company