On Mon, Jul 7, 2014 at 5:37 PM, Peter Geoghegan <p...@heroku.com> wrote: > On Mon, Jul 7, 2014 at 12:57 PM, Robert Haas <robertmh...@gmail.com> wrote: >> 5. It's tempting to look at other ways of solving the parallel sort >> problem that don't need an allocator - perhaps by simply packing all >> the tuples into a DSM one after the next. But this is not easy to do, >> or at least it's not easy to do efficiently. Tuples enter tuplesort.c >> through one of the tuplesort_put*() functions, most of which end up >> calling one of the copytup_*() functions. But you can't easily just >> change those functions to stick the tuple someplace else, because they >> don't necessarily get the address of the space to be used from palloc >> directly. In particular, copytup_heap() calls >> ExecCopySlotMinimalTuple(), and copytup_cluster() calls >> heap_copytuple(). heap_copytuple() is simple enough that you might be >> able to finagle a new API that would make it work, but I'm not sure >> what we could really do about ExecCopySlotMinimalTuple() except call >> it and then copy the result. Perhaps that'd be OK for a first >> version. > > Perhaps. If my experience with sort support for text is anything to go > on, the cost of copying over the tuples isn't really that high > relative to the cost of performing the sort, particularly when there > are many tuples to sort.
The testing I did showed about a 5% overhead on REINDEX INDEX pgbench_accounts_pkey from one extra tuple copy (cf. 9f03ca915196dfc871804a1f8aad26207f601fd6). Of course that could vary by circumstance for a variety of reasons. > OTOH, I'd be concerned about the cost of main > memory accesses for pass-by-reference types during parallel tuple > sorts in particular. The same concern applies to cases where the tuple > proper must be accessed, although presumably that occurs less > frequently. I do think that's a problem with our sort implementation, but it's not clear to me whether it's *more* of a problem for parallel sort than it is for single-backend sort. > The LLVM project's new libc++ library passes remark on a similar issue > in their rationale for yet another C++ standard library: "For example, > it is generally accepted that building std::string using the "short > string optimization" instead of using Copy On Write (COW) is a > superior approach for multicore machines...". It seems likely that > they're mostly referencing the apparent trend of memory bandwidth per > core stagnation , . To get the real benefit of a parallel sort, > we must do anything we can to avoid having cores/workers remain idle > during the sort while waiting on memory access. I believe that that's > the bigger problem. Yes, I agree that's a problem. There are algorithms for parallel quicksort in the literature which seem to offer promising solutions, but unfortunately these infrastructure issues have prevented me from reaching them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers