On Tue, Feb 14, 2012 at 1:44 AM, Hitoshi Harada <umi.tan...@gmail.com> wrote:
> On Sat, Feb 11, 2012 at 11:34 AM, Jeff Janes <jeff.ja...@gmail.com> wrote:
>> On Wed, Feb 8, 2012 at 1:01 AM, Hitoshi Harada <umi.tan...@gmail.com> wrote:
>>> On Sun, Jan 15, 2012 at 4:59 PM, Jeff Janes <jeff.ja...@gmail.com> wrote:
>>>>
>>>> The attached patch allows it to reuse that memory.  On my meager
>>>> system it reduced the building of an index on an integer column in a
>>>> skinny 200 million row totally randomly ordered table by about 3% from
>>>> a baseline of 25 minutes.
>>>>
>>>
>>> Just to give a standard review, this patch is one line change and
>>> applies cleanly, builds ok.
>>>
>>> I'm not pretty sure what exactly you're trying to accomplish, but it
>>> seems to me that it's avoiding the first dumptuples cycle by forcing
>>> availMem = 0 even if it's negative.
>>
>> Yes.  Currently when it switches to the TSS_BUILDRUNS part of a
>> tape-sort, it starts by calling WRITETUP a large number of time
>> consecutively, to work off the memory deficit incurred by the 3 blocks
>> per tape of tape overhead, and then after that calls WRITETUP about
>> once per puttuple..   Under my patch, it would only call WRITETUP
>> about once per puttuple, right from the beginning.
>>
>>> I read your comments as it'd be
>>> avoiding to alternate reading/writing back and force with scattered
>>> memory failing memory cache much during merge phase, but actually it
>>> doesn't affect merge phase but only init-dump phase, does it?
>>
>> It effects the building of the runs.  But this building of the runs is
>> not a simple dump, it is itself a mini merge phase, in which it merges
>> the existing in-memory priority queue against the still-incoming
>> tuples from the node which invoked the sort.  By using less memory
>> than it could, this means that the resulting runs are smaller than
>> they could be, and so will sometimes necessitate an additional layer
>> of merging later on.   (This effect is particularly large for the very
>> first run being built.  Generally by merging incoming tuples into the
>> memory-tuples, you can create runs that are 1.7 times the size of fits
>> in memory.  By wasting some memory, we are getting 1.7 the size of a
>> smaller starting point.  But for the first run, it is worse than that.
>>  Most of the benefit that leads to that 1.7 multiplier comes at the
>> very early stage of each run-build.  But by initially using the full
>> memory, then writing out a bunch of tuples without doing any merge of
>> the incoming, we have truncated the part that gives the most benefit.)
>>
>> My analysis that the "freed" memory is never reused (because we refuse
>> to reuse it ourselves and it is too fragmented to be reused by anyone
>> else, like the palloc or VM system) only applies to the run-building
>> phase.  So never was a bit of an overstatement.  By the time the last
>> initial run is completely written out to tape, the heap used for the
>> priority queue should be totally empty.  So at this point the
>> allocator would have the chance to congeal all of the fragmented
>> memory back into larger chunks, or maybe it parcels the allocations
>> back out again in an order so that the unused space is contiguous and
>> could be meaningfully paged out.
>>
>> But, it is it worth worrying about how much we fragment memory and if
>> we overshoot our promises by 10 or 20%?
>>
>>> If so,
>>> I'm not so convinced your benchmark gave 3 % gain by this change.
>>> Correct me as I'm probably wrong.
>>
>> I've now done more complete testing.  Building an index on an
>> 200,000,000 row table with an integer column populated in random order
>> with integers from 1..500,000,000, non-unique, on a machine with 2GB
>> of RAM and 600MB of shared_buffers.
>>
>> It improves things by 6-7 percent at the end of working mem size, the
>> rest are in the noise except at 77936 KB, where it reproducibly makes
>> things 4% worse, for reasons I haven't figured out.  So maybe the best
>> thing to do is, rather than micromanaging memory usage, simply don't
>> set maintenance_work_mem way to low.  (But, it is the default).
>
> I've tested here with only a million rows mix of integer/text (table
> size is 80MB or so) with default setting, running CREATE INDEX
> continuously, but couldn't find performance improvement.  The number
> varies from -2% to +2%, which I think is just error.
>
> While I agree with your insist that avoiding the first dump would make
> sense, I guess it depends on situations; if the dump goes with lots of
> tuples (which should happen when availMem is big), writing tuples a
> lot at a time will be faster than writing little by little later.
>
> I'm not sure about the conclusion, but given this discussion, I'm
> inclined to mark this Returned with Feedback.

OK, thanks.  Does anyone have additional feed-back on how tightly we
wish to manage memory usage?  Is trying to make us use as much memory
as we are allowed to without going over a worthwhile endeavor at all,
or is it just academic nitpicking?

Also, since the default value of work_mem is quite low, should the
docs be more aggressive in suggesting that people using any
realistically modern hardware to tune it upwards?  If you have a few
thousand large sorts running concurrently, I think that having all of
them spill to disk intentionally is probably going to destroy your
server's performance about as much as having them spill to disk via
swapping will.  If you find yourself in this situation, the answer is
probably to use a connection pooler with admission control, not a
work_mem of 1MB.

Cheers,

Jeff

Thanks,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to