Re: [HACKERS] Progress on fast path sorting, btree index creation time

Peter Geoghegan Wed, 15 Feb 2012 05:30:04 -0800

On 15 February 2012 06:16, Robert Haas <[email protected]> wrote:
> On Fri, Feb 10, 2012 at 10:30 AM, Peter Geoghegan <[email protected]> 
> wrote:
>> [ new patch ]
>
> I spent quite a bit of time looking at this today - the patch
> specifically, and the issue of making quicksort fast more generally.
> It seemed to me that if we're going to have separate copies of the
> quicksort code for tuple sorting, we might as well go whole hog and
> specialize those copies to the particular needs of tuplesort.c as much
> as possible.  Accordingly, I whacked the code around so that it knows
> that it is sorting SortTuple objects and need not conditionalize at
> runtime on the size of the objects being swapped.  You suggested
> upthread that this might be worthwhile, and it seems that it is, so I
> think we should do it.


Cool. I agree that we should do this. It doesn't need to be justified
as a performance optimisation - it makes sense to refactor in this
way. If that makes things faster, then so much the better.

> Your patch removes the CHECK_FOR_INTERRUPTS() call from
> comparetup_heap, which is no good.  However, since I'd already decided
> to specialize the copies of quicksort intended for sorting as much as
> possible, it made sense to me to refactor things so that the qsort
> routine itself, rather than the comparator, is responsible for calling
> CHECK_FOR_INTERRUPTS().  This slightly reduces the number of times we
> CHECK_FOR_INTERRUPTS(), but never allows more than a few comparisons
> before doing it.

Well, the idea of that was to have one and only CHECK_FOR_INTERRUPTS()
call in the leading comparator. I should perhaps have further stressed
that that patch was only intended to establish the idea of having that
function pointer call within an inlined "leading-key's comparator
calling" outer comparator.

> I find that your pg_always_inline macro is equivalent to just plain
> inline on my system (MacOS X v10.6.8, gcc 4.2.1).  It seems to need
> something like this:
>
> +#elif __GNUC__
> +#define pg_always_inline inline __attribute__((always_inline))
>
> ...but I'm not very happy about relying on that, because I don't know
> that it will work on every gcc version (never mind non-gcc compilers),
> and I'm not convinced it's actually improving performance even on this
> one.  The documentation seems to indicate that this is intended to
> force inlining even when not optimizing, which may have something to
> do with the lack of effect: that's not really the point here anyway.

There is a scant reference that suggests this, yes, but that's
contradicted by other scanty sources. The macro was observed to be
helpful on the multi-key case, and the effect was quite apparent and
reproducible. At any rate its appearance in my most recenty patch was
vestigial - I think that there ought to be no need for it, now that
the compiler cost/benefit analysis probably has things right by
inlining.

> What I did instead is to replace template_qsort_arg.h with a script
> called gen_qsort_tuple.pl, which generates a file called qsort_tuple.c
> that tuplesort.c then #includes.  This seems more flexible to me than
> the macro-based approach.  In particular, it allows me to generate
> versions of qsort with different call signatures.  The attached patch
> generates two:
>
> static void qsort_tuple(SortTuple *a, size_t n, SortTupleComparator
> cmp_tuple, Tuplesortstate *state);
> static void qsort_ssup(SortTuple *a, size_t n, SortSupport ssup);
>
> The first of these is a drop-in replacement for qsort_arg() - any
> tuplesort can use it, not just heap sorts.  But it is faster than
> qsort_arg() because of the specializations for the SortTuple data
> type.  The second handles the special case where we are sorting by a
> single key that has an associated SortSupport object.  In this case we
> don't need to carry the overhead of passing around the Tuplesortstate
> and dereferencing it, nor do we need the SortTupleComparator: we can
> just pass the SortSupport itself.  Maybe there's a way to get this
> effect using macros, but I couldn't figure it out.

I've seen the pattern where generic programming is aped with the
preprocessor in in the style of my patch in, of all things, wxWidgets,
where it continues to be used as part of pgAdmin, or was when I worked
on wxWidgets 3 support exactly one year ago. One thing that is
particularly bizarre about that code is that clients actually #include
a cpp file. This is, I'd guess, a legacy of having to target pre C++98
compilers without decent template support. MSVC 6, in particular, was
completely inadequate in this area.

I am inclined to agree that given that we already use Perl to generate
source code like this, it seems natural that we should prefer to do
that, if only to avoid paranoia about the inclusion of a dial-a-bloat
knob. I am at a disadvantage here, since I've never written a line of
Perl.

> With this patch, I get the following results, as compared with your
> 2012-02-10 version and master, using the same test cases I tested
> before.
>
> select * from nodups order by g offset 10001;
> tps on master: 289.471274, 289.967984, 289.595958
> tps on 2012-02-10 version: 359.150280, 356.284723, 356.888900
> tps on attached version: 388.212793, 386.085083, 386.867478
>
> select * from twocol order by a, b offset 10000;
> tps on master: 261.676611, 260.440886, 259.529362
> tps on 2012-02-10 version: 283.941312, 279.981723, 283.140208
> tps on attached version: 283.146463, 278.344827, 280.727798
>
> select * from f8 order by g offset 10000;
> tps on master: 228.299924, 222.650355, 227.408506
> tps on 2012-02-10 version: 260.289273, 257.181035, 256.377456
> tps on attached version: 276.985299, 275.341575, 274.428095
>
> There's some variation (which I can't account for) between the results
> on master now and the results on master before - possibly just code
> shifting around between cache lines due to unrelated changes, or maybe
> some inadvertent change in my test setup.  But it looks to me like
> your 2012-02-10 version, without any type-specific optimizations, does
> pretty much just as well on multi-key sorting as your previous
> version, which had them - or if there is a difference, it's pretty
> small.

I'd say that that's a fair assessment. It does just as well in that
case, but with far fewer additional instructions.

> Overall, I think the numbers for the version I'm attaching here look
> pretty good: the single-key performance is clearly better than your
> last version, and the multi-key performance is very slightly worse.  I
> think that slight worsening is a good trade-off, though, because this
> version can use qsort_tuple() for all kinds of tuplesorts, not just
> heap tuplesorts.  Still, it seems like we ought to be able to do even
> better: the multi-key specialization that you had in your patch can be
> coded in this framework, too, and in theory those are ndependent of
> the swapcode improvements.  I tried coding up a multi-key
> specialization which I believe to be quite similar to what you did,
> but it didn't seem to do much.  I'm attaching it here; maybe there's
> some way to improve it (or a different test case where it pays off).

Hmm. I'll run some tests myself when I get a chance, and attempt to
determine what's going on here. I don't have immediate access to a
good benchmarking server, which would be preferable, and I'd like to
post a revision of pg_stat_statements normalisation today, as it's
been too long. Nice work though.

> It strikes me that if we wanted to take this further, we could look at
> squeezing out ApplySortComparator.  For example, suppose that, upon
> discovering that we can do an in-memory quicksort on a single sort
> key, we make an initial pass over the data where we check whether it's
> sorted and, as we go, swap all the entries with isnull1 = true to the
> end of the memtuples array.  We then sort the isnull1 = true entries
> with the standard comparator, and the isnull1 = false entries with an
> optimized comparator that elides most of ApplySortComparator and
> instead just calls the comparison function directly.  We then decide
> on whether to emit the isnull1 = true entries first or last based on
> NULLS FIRST/LAST, and decide whether to emit the remaining entries in
> forward or reverse order based on ASC/DESC.  Or maybe not exactly that
> thing, but something like that, so that we sort the null and non-null
> entries separately.  The additional complexity in the read-out logic
> would probably be more than offset by being able to use a simpler
> comparator.

While it's really easy to be wrong about these things, I'd venture to
guess that branch prediction makes this a less than compelling win in
practice. That said, if you want to know how good a win it might be,
you need only knock out a quick-and-dirty prototype and see how fast a
sort is - however well this does will be pretty close to your best
case, but not quite, since presumably such a prototype wouldn't worry
about the applicability of the optimisation.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Progress on fast path sorting, btree index creation time

Reply via email to