On Mon, Jul 7, 2014 at 5:37 PM, Peter Geoghegan <p...@heroku.com> wrote:
> On Mon, Jul 7, 2014 at 12:57 PM, Robert Haas <robertmh...@gmail.com> wrote:
>> 5. It's tempting to look at other ways of solving the parallel sort
>> problem that don't need an allocator - perhaps by simply packing all
>> the tuples into a DSM one after the next.  But this is not easy to do,
>> or at least it's not easy to do efficiently.  Tuples enter tuplesort.c
>> through one of the tuplesort_put*() functions, most of which end up
>> calling one of the copytup_*() functions.  But you can't easily just
>> change those functions to stick the tuple someplace else, because they
>> don't necessarily get the address of the space to be used from palloc
>> directly.  In particular, copytup_heap() calls
>> ExecCopySlotMinimalTuple(), and copytup_cluster() calls
>> heap_copytuple().  heap_copytuple() is simple enough that you might be
>> able to finagle a new API that would make it work, but I'm not sure
>> what we could really do about ExecCopySlotMinimalTuple() except call
>> it and then copy the result.  Perhaps that'd be OK for a first
>> version.
> Perhaps. If my experience with sort support for text is anything to go
> on, the cost of copying over the tuples isn't really that high
> relative to the cost of performing the sort, particularly when there
> are many tuples to sort.

The testing I did showed about a 5% overhead on REINDEX INDEX
pgbench_accounts_pkey from one extra tuple copy (cf.
9f03ca915196dfc871804a1f8aad26207f601fd6).  Of course that could vary
by circumstance for a variety of reasons.

> OTOH, I'd be concerned about the cost of main
> memory accesses for pass-by-reference types during parallel tuple
> sorts in particular. The same concern applies to cases where the tuple
> proper must be accessed, although presumably that occurs less
> frequently.

I do think that's a problem with our sort implementation, but it's not
clear to me whether it's *more* of a problem for parallel sort than it
is for single-backend sort.

> The LLVM project's new libc++ library passes remark on a similar issue
> in their rationale for yet another C++ standard library: "For example,
> it is generally accepted that building std::string using the "short
> string optimization" instead of using Copy On Write (COW) is a
> superior approach for multicore machines...". It seems likely that
> they're mostly referencing the apparent trend of memory bandwidth per
> core stagnation [1], [2]. To get the real benefit of a parallel sort,
> we must do anything we can to avoid having cores/workers remain idle
> during the sort while waiting on memory access. I believe that that's
> the bigger problem.

Yes, I agree that's a problem.  There are algorithms for parallel
quicksort in the literature which seem to offer promising solutions,
but unfortunately these infrastructure issues have prevented me from
reaching them.

Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to