On Thu, Aug 7, 2014 at 1:16 PM, Peter Geoghegan <p...@heroku.com> wrote:
> On Thu, Aug 7, 2014 at 8:07 AM, Robert Haas <robertmh...@gmail.com> wrote:
>> So here.  You may not agree that the mitigation strategies for which
>> others are asking for are worthwhile, but you can't expect everyone
>> else to agree with your assessment of which cases are likely to occur
>> in practice.  The case of a cohort of strings to be sorted which share
>> a long fixed prefix and have different stuff at the end does not seem
>> particularly pathological to me.  It doesn't, in other words, require
>> an adversary: some real-world data sets will look like that.  I will
>> forebear repeating examples I've given before, but I've seen that kind
>> of thing more than once in real data sets that people (well, me,
>> anyway) actually wanted to put into a PostgreSQL database.  So I'm
>> personally of the opinion that the time you've put into trying to
>> protect against those cases is worthwhile.  I understand that you may
>> disagree with that, and that's fine: we're not all going to agree on
>> everything.
>
> I actually agree with you here.

/me pops cork.

> Sorting text is important, so we
> should spend a lot of time avoiding regressions. Your example is
> reasonable - I believe that people do that to a non-trivial degree.
> The fact is, I probably can greatly ameliorate that case. However, to
> give an example of a case I have less sympathy for, I'll name the case
> you mention, *plus* the strings are already in logical order, and so
> our qsort() can (dubiously) take advantage of that, so that there is a
> 1:1 ratio of wasted strxfrm() calls and strcoll() tie-breaker calls.
> There might be other cases like that that crop up.

I think that's actually not a very unrealistic case at all.  In
general, I think that if a particular data distribution is a
reasonable scenario, that data distribution plus "it's already sorted"
is also reasonable.  Data gets presorted in all kinds of ways:
sometimes it gets loaded in sorted (or nearly-sorted) order from
another data source; sometimes we do an index scan that produces data
in order by column A and then later sort by A, B; sometimes somebody
clusters the table.  Respecting the right of other people to have
different opinions, I don't think I'll ever be prepared to concede the
presorted case as either uncommon or unimportant.

That's not to prejudge anything that may or may not be in your patch,
which I have not studied in enormous detail.  It's just what I think
about the subject in general.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to