On 20 June 2012 17:41, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Peter Geoghegan <pe...@2ndquadrant.com> writes:
>> No, I'm suggesting it would probably be at least a bit of a win here
>> to cache the constant, and only have to do a strxfrm() + strcmp() per
>> comparison.
> Um, have you got any hard evidence to support that notion?  The
> traditional advice is that strcoll is faster than using strxfrm unless
> the same strings are to be compared repeatedly.  I'm not convinced that
> saving the strxfrm on just one side will swing the balance.

Fair enough,  but I'd suggest that that traditional advice assumes
that strcoll()'s space-efficiency and general avoidance of dynamic
allocation is an important factor, which, for us, it clearly isn't,
since we can re-use a single buffer in the manner of Robert's text
sortsupport patch for each and every per-tuple strxfrm() blob (when
traversing an index). This is a direct quote from glibc's strcoll_l.c

     Perform the first pass over the string and while doing this find
     and store the weights for each character.  Since we want this to
     be as fast as possible we are using `alloca' to store the temporary
     values.  But since there is no limit on the length of the string
     we have to use `malloc' if the string is too long.  We should be
     very conservative here.

Here, alloca is used to allocate space in a stack frame. I believe
that this is an entirely inappropriate trade-off for Postgres to be
making. strxfrm(), in constrast, leaves buffer sizing and management
up to the caller. That has to be a big part of the problem here.

>> The fact is that this is likely to be a fairly significant
>> performance win, because strxfrm() is quite simply the way you're
>> supposed to do collation-aware sorting, and is documented as such. For
>> that reason, C standard library implementations should not be expected
>> to emphasize its performance - they assume that you're using strxfrm()
>> + their highly optimised strcmp()
> Have you got any evidence in support of this claim, or is it just
> wishful thinking about what's likely to be inside libc?

According to the single-unix specification's strcoll() documentation,
"The strxfrm() and strcmp() functions should be used for sorting large
lists". If that isn't convincing enough for you, there is the fact
that glibc's strcmp() is clearly highly optimised for each and every
architecture, and that we are currently throwing away an extra
strcmp() in the event of strcoll() equality.

> I'd also note that any comparisons you may have seen about this are certainly 
> not
> accounting for the effects of data bloat from strxfrm (ie, possible
> spill to disk, more merge passes, etc).

What about the fact that strcoll() may be repeatedly allocating and
freeing memory per comparison? The blobs really aren't that much
larger than the strings to be sorted, which are typically quite short.

> In any case, if you have to redefine the meaning of equality in order
> to justify a performance patch, I'm prepared to walk away at the start.

The advantage of my proposed implementation is precisely that I won't
have to redefine the meaning of equality, and that only the text
datatype will have to care about equivalency, so you can just skip
over an explanation of equivalency for most audiences.

If you feel that strongly about it, and I have no possible hope of
getting this accepted, I'm glad that I know now rather than after
completing a significant amount of work on this. I would like to hear
other people's opinions before I drop it though.

> The range of likely performance costs/benefits across different locales
> and different implementations is so wide that if you can't show it to be
> a win even with the strcmp tiebreaker, it's not likely to be a reliable
> win without that.

glibc is the implementation that really matters. My test-case used
en_US.UTF-8 as its locale, which has to be one of the least stressful
to strcoll() - I probably could have shown a larger improvement just
by selecting a locale that was known to have to make more passes,
like, say, hu_HU.UTF-8.

I expect to have access to a 16 core server next week, which Bull have
made available to me. Maybe I'll get some interesting performance
numbers from it.

The reason that I don't want to use the blob with original string hack
is because it's ugly, space-inefficient, unnecessary and objectively
incorrect, since it forces us to violate the conformance requirement
C9 of Unicode 3.0, marginal though that may be.

Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to