On Fri, Jun 24, 2016 at 4:29 PM, Andres Freund <and...@anarazel.de> wrote:
> As a motivation, here's a somewhat juicy example of the benefits that
> can be gained (disabled parallelism, results vary too much):
> SELECT l_linestatus, SUM(l_quantity) FROM lineitem GROUP BY 1 ORDER BY 2 DESC
> LIMIT 10 ;
> non-optimized: 9075.835 ms
> optimized: 5194.024 ms
> and that's far from doing all the possible work for that query;
> especially the batching is far from optimal.
That's pretty great. Let me first just say that I think that your
broadly have the right idea here, and that it will likely be a big
area to work on in the years ahead. This may become a big, general
area with many sub-projects, kind of like parallelism. Also, I think
it's very much complementary to parallelism.
> So far I've mostly looked at performance for tpc-h, not because it's a
> particularly good workload, but because it's available. If somebody has
> good suggestions for an analytics heavy workload...
I think that it's the only thing available that isn't a pain to setup.
> My observations about the performance bottlenecks I found, and partially
> addressed, are in rough order of importance (there's interdependence
> between most of them):
> (List of various bottlenecks)
> I'm planning to start individual subthreads for some of these, with
> in-detail discussions and/or patches. But I wanted to present this on a
> higher level once, before spending even more time.
> Have I missed concrete issues others noticed? Missed significant
> improvements (please, please, only realistic stuff)?
I'll add one: We should have a way to make routines like
heap_copytuple() "allocate" into the caller's own batch memory. We may
be able to come up with a reasonably generic abstraction that
minimizes the special case handling of things like exhausting caller's
batch allocation. Failure to do something like this wastes an awful
lot of memory, but more importantly causes a lot of unnecessary cache
misses in tuplestore.c, and even with the 9.6 work in tuplesort.c.
Someone should also work on palloc() fragmentation, but I tend to
think that that's a less efficient way of improving matters. I'd leave
that until later. Having an excessive number of retail palloc() calls
for cases where the caller really does process tuples in a batch
fashion is inherently inefficient. I'm all for the simplicity of the
memory context abstraction, but ISTM that a little specialization gets
you very far.
> Unfortunately these items are heavily interdependent. For some queries
> you'll need issue n) addressed before n+2) shows a benefit, for some
> it's the other way, and such. Given the size of the overall task, the
> only way I can see this being reasonably addressable, is trying to get
> smaller steps in individually, even if not a huge win on their own. And
> then focus on the the larger architectural changes (primarily 3))
I could not agree more. This should be strongly emphasized.
I have parallel CREATE INDEX working pretty well now. I don't think I
could have done that without the 9.6 work on external sorting's cache
efficiency, so in some sense that earlier work has enabled parallelism
in a non-obvious way. I think that abbreviated keys will prove really
important for getting the best out of parallel sort (and that a cost
model will have to consider stuff like leading attribute cardinality
-- in other words, it must consider CPU cache efficiency). I also
suspect that solving the problems with heap_deform_tuple() first would
make my pending parallel CREATE INDEX patch look a lot better relative
to a serial CREATE INDEX. That's the kind of intuitive speculation
that's really hard to prove. It may be that it doesn't matter there,
because working on fixing that yields obvious benefits. But, as you
suggest, we should consider that that might not always be true. That's
really tricky, and has historically been the kind of thing we've
managed very badly.
Sent via pgsql-hackers mailing list (firstname.lastname@example.org)
To make changes to your subscription: