On Jun21, 2012, at 11:55 , Peter Geoghegan wrote: > On 21 June 2012 10:24, Florian Pflug <f...@phlo.org> wrote: >> On Jun21, 2012, at 02:22 , Peter Geoghegan wrote: >>> I've written a very small C++ program, which I've attached, that >>> basically proves that this can still make a fairly large difference - >>> I hope it's okay that that it's C++, but that allowed me to write the >>> program quickly, with no dependencies for anyone who cares to try this >>> out, other than a C++ compiler and the standard library. >> >> Uh, are you sure the results aren't swapped? string_wrapper takes a >> parameter called use_strcoll, and it indeed uses strcoll() if that >> parameter is true, and strxfm() otherwise. But then it passes *false* >> to determine the speed of strcoll(), and *true* to determine the speed >> of strxfm()... > > Egads, you're right. Apparently in my haste I gave the wrong boolean > argument to the class constructor in each case. I am completely wrong > about this. I apologise for the careless mistake. At least I advanced > the debate, though - we don't want to do any ad-hoc generation of > strxfrm() output within comparators, even when one side can have a > cached value.
FWIW, I've played around a bit with a fixed version of your set_test.cpp. I made it use the cached sort key of the RHS also, made it preserve sort keys during copy construction, even used a boost::shared_ptr to avoid actually copying the cached sort key. The strxfrm() version still performed about 30% worse than the strcoll() version. >From that, I figured that tree inserts don't trigger enough comparisons to make strxfrm() worthwhile. So I mangled set_test.cpp a bit more and instead of a set::set, I created an (C-style) array of string_wrapper instances, and sorted them with std::sort(). Which made strcoll() twice as fast as strxfrm()... At this point, my theory is that your choice of "random" strings prevents strxfrm() from ever winning over strcoll(). The reason being that you pick each letter uniformly distributed from a-z, resulting in a probability of two string differing in the first of 1 - 1/26 =~ 96%. Thus, even for extremely complex collation rules, strcoll() probably only needs to compare a few characters to determine the order. Whereas strxfrm() has to compute the whole sort key, no matter what. The question is thus, how good a model are your "random" strings for the input of a typical sorting step in postgres? My guess is, a quite good one actually, since people probably don't deal with a lot of very similar strings very often. Which makes we wonder if using strxfrm() during sorting wouldn't be a net loss, all things considered… My modified set_test.cpp (which should now be called array_test.cpp) is attached. best regards, Florian Pflug
Description: Binary data
-- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers