On 27 July 2012 16:39, Jeff Janes <jeff.ja...@gmail.com> wrote: >> Can you suggest a benchmark that will usefully exercise this patch? > > I think the given sizes below work on most 64 bit machines.
My apologies for not getting around to taking a look at this sooner. I tested this patch on my x86_64 laptop: [peter@peterlaptop tests]$ uname -r 3.4.4-4.fc16.x86_64 I have been able to recreate your results with the work_mem setting you supplied, 16384 (both queries that you suggested are executed together, for a less sympathetic workload): [peter@peterlaptop tests]$ cat sortmemgrow_sort.sql select count(distinct foo) from (select random() as foo from generate_series(1,524200)) asdf; select count(distinct foo) from (select random() as foo from generate_series(1,524300)) asdf; [peter@peterlaptop tests]$ # For master: [peter@peterlaptop tests]$ pgbench -f sortmemgrow_sort.sql -T 45 -n transaction type: Custom query scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 45 s number of transactions actually processed: 45 tps = 0.992526 (including connections establishing) tps = 0.992633 (excluding connections establishing) [peter@peterlaptop tests]$ # For patch: [peter@peterlaptop tests]$ pgbench -f sortmemgrow_sort.sql -T 45 -n transaction type: Custom query scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 45 s number of transactions actually processed: 66 tps = 1.461739 (including connections establishing) tps = 1.461911 (excluding connections establishing) I didn't find trace_sort all that useful for my earlier work on optimising in memory-sorting (at least when running benchmarks), as the granularity is too small for simple, relatively inexpensive queries with sort nodes (or some tuplesort usage, like the datum sort of your example). Also, the instrumentation can have a Heisenberg effect. I've avoided using it here. The less expensive query's executions costs are essentially the same with and without the patch. This patch works by not doubling the size of state->memtupsize when available memory is less than half of allowed memory (i.e. we are one call to grow_memtuples() away from reporting ). Rather, in that situation, it is set to a size that extrapolates the likely size of the total amount of memory needed in a way that is quite flawed, but likely to work well for the pass-by-value Datum case. Now, on the face of it, this may appear to be a re-hashing of the question of "how paranoid do we need to be about wasting memory in memory-constrained situations - should we consider anything other than a geometric growth rate, to be parsimonious with memory at the risk of thrashing?". However, this is not really the case, because this is only a single, last-ditch effort to avoid a particularly undesirable eventuality. We have little to lose and quite a bit to gain. A cheap guestimation seems reasonable. I have attached a revision for your consideration, with a few editorialisations, mostly style-related. I removed this entirely: + * XXXX is the new method still follow this? The last allocation is no + * longer necessarily a power of 2, but that is not freed. I did so because, according to aset.c: * Further improvement 12/00: as the code stood, request sizes in the * midrange between "small" and "large" were handled very inefficiently, * because any sufficiently large free chunk would be used to satisfy a * request, even if it was much larger than necessary. This led to more * and more wasted space in allocated chunks over time. To fix, get rid * of the midrange behavior: we now handle only "small" power-of-2-size * chunks as chunks. Anything "large" is passed off to malloc(). Change * the number of freelists to change the small/large boundary. So, at the top of grow_memtuples, this remark still holds: * and so there will be no unexpected addition to what we ask for. (The * minimum array size established in tuplesort_begin_common is large * enough to force palloc to treat it as a separate chunk, so this * assumption should be good. But let's check it.) It continues to hold because we're still invariably passing off this request to malloc() (or, in this particular case, realloc()) regardless of the alignment of the request size. Certainly, the extant code verifies !LACKMEM, and the regression tests pass when this patch is applied. However, there is still a danger of LACKMEM. This revision has the following logic for extrapolating newmemtupsize (This is almost the same as the original patch): + newmemtupsize = (int) floor(oldmemtupsize * allowedMem / memNowUsed); Suppose that the code was: + newmemtupsize = (int) ceil(oldmemtupsize * allowedMem / memNowUsed); We'd now have an error with your latter example query because !LACKMEM, due to the inclusion of a single additional SortTuple. This is because GetMemoryChunkSpace (and, ultimately, AllocSetGetChunkSpace) return header size too. We were getting away with that before with the doubling strategy, but now we have to factor in that header's size into allowedMem. I don't think my revision satisfactory in what it does here. It isn't obvious what a better principled implementation would do though. Does anyone have any other ideas? I think this patch (or at least your observation about I/O waits within vmstat) may point to a more fundamental issue with our sort code: Why are we not using asynchronous I/O in our implementation? There are anecdotal reports of other RDBMS implementations doing far better than we do here, and I believe asynchronous I/O, pipelining, and other such optimisations have a lot to do with that. It's something I'd hoped to find the time to look at in detail, but probably won't in the 9.3 cycle. One of the more obvious ways of optimising an external sort is to use asynchronous I/O so that one run of data can be sorted or merged while other runs are being read from or written to disk. Our current implementation seems naive about this. There are some interesting details about how this is exposed by POSIX here: http://www.gnu.org/software/libc/manual/html_node/Asynchronous-I_002fO.html It's already anticipated that we might take advantage of libaio for the benefit of FilePrefetch() (see its accompanying comments - it uses posix_fadvise itself - effective_io_concurrency must be > 0 for this to ever be called). It perhaps could be considered parallel "low-hanging fruit" in that it allows us to offer limited though useful backend parallelism without first resolving thorny issues around what abstraction we might use, or how we might eventually make backends thread-safe. AIO supports registering signal callbacks (a SIGPOLL handler can be called), which seems relatively uncontroversial. Platform support for AIO might be a bit lacking, but then you can say the same about posix_fadvise. We don't assume that poll(2) is available, but we already use it where it is within the latch code. Besides, in-kernel support can be emulated if POSIX threads is available, which I believe would make this broadly useful on unix-like platforms. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
sortmem_grow-v2.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers