subject:"avoid toLower in std.algorithm.sort compare alias"

Re: avoid toLower in std.algorithm.sort compare alias

2012-04-25 Thread Marco Leise

Am Sun, 22 Apr 2012 09:23:45 +0200
schrieb "Jay Norwood" :

> On Sunday, 22 April 2012 at 06:26:42 UTC, Jonathan M Davis wrote:
> >
> > You can look at the code. It checks each of the characters in 
> > place. Unlike
> > toLower, it doesn't need to generate a new string. But as far 
> > as the
> > comparison goes, they're the same - hence that line in the docs.
> >
> > - Jonathan M Davis
> 
> ok, I did look at the code just now, and I'll sleep better
> knowing that it doesn't do the whole string conversion.  I
> misunderstood your pseudo-code to mean that two lower case
> strings were being created prior to the compare.
> 
> However,  icmp code does appear to call the toLower conversion on
> both characters without first comparing the characters for
> equality, which misses the chance to do a simple compare that
> would avoid the two calls.

/- check for equality :)
v
cmp!"a != b && std.uni.toLower(a) < std.uni.toLower(b)"(r1, r2)

-- 
Marco

Re: avoid toLower in std.algorithm.sort compare alias

2012-04-24 Thread Jonathan M Davis

On Tuesday, April 24, 2012 12:24:44 Regan Heath wrote:
> On Mon, 23 Apr 2012 16:43:20 +0100, Steven Schveighoffer
> 
>  wrote:
> > While dealing with unicode in my std.stream rewrite, I've found that
> > hand-decoding dchars is way faster than using library calls.
> 
> After watching Andrei's talk on generic and generative programming I have
> to ask, which routines are you avoiding .. it seems we need to make them
> as good as the hand coded code you've written...

In general, when operating on strings generically, you up having to treat them 
as ranges of dchar and decode everything, but there are a lot of cases where 
you can special-case algorithms for narrow strings and avoid decoding them. 
Phobos does this a lot (though it can probably do a better job of it in a 
number of places), so by using functions from there rather than rolling your 
own, the problem is reduced, but any time that you're doing a lot of generic 
string processing, there's a decent chance that you're going to have to 
special case some stuff for arrays of char, wchar, and dchar in order to fully 
optimize it. And I don't think that there's really a way out of that beyond 
having a lot of functions already available (and already optimized) to do a 
lot of the string processing for you. There's a definite tension between 
genericity and effciency in the case of string processing - due primarily to 
variable length encodings.

- Jonathan M Davis

Re: avoid toLower in std.algorithm.sort compare alias

2012-04-24 Thread Steven Schveighoffer

On Tuesday, 24 April 2012 at 14:54:48 UTC, Steven Schveighoffer 
wrote:

On Tuesday, 24 April 2012 at 11:24:44 UTC, Regan Heath wrote:
After watching Andrei's talk on generic and generative 
programming I have to ask, which routines are you avoiding .. 
it seems we need to make them as good as the hand coded code 
you've written...


from memory (don't have the code in front of me right now), it 
was std.uni.decode, and using foreach(dchar d; str) (which 
cannot be inlined currently).


IIRC, std.uni.decode was not being inlined.  So I tried 
hand-inlining it (I also discovered some optimizations it was 
not using), and it made a huge difference.


BTW, you can check out my github branch of phobos named new-io2, 
look at the textstream struct to see what I've inlined.


-Steve

Re: avoid toLower in std.algorithm.sort compare alias

2012-04-24 Thread Steven Schveighoffer


On Tuesday, 24 April 2012 at 11:24:44 UTC, Regan Heath wrote:
On Mon, 23 Apr 2012 16:43:20 +0100, Steven Schveighoffer 
 wrote:


While dealing with unicode in my std.stream rewrite, I've 
found that hand-decoding dchars is way faster than using 
library calls.


After watching Andrei's talk on generic and generative 
programming I have to ask, which routines are you avoiding .. 
it seems we need to make them as good as the hand coded code 
you've written...


from memory (don't have the code in front of me right now), it 
was std.uni.decode, and using foreach(dchar d; str) (which cannot 
be inlined currently).


IIRC, std.uni.decode was not being inlined.  So I tried 
hand-inlining it (I also discovered some optimizations it was not 
using), and it made a huge difference.


In regards to this discussion, I think icmp can also be improved 
when run on a char array, by doing a byte comparison (no dchar 
decoding) until it finds a difference.  That might be a huge 
speedup.  Right now, all dchars are being decoded, and translated 
to the toLower counterpart.  It may have an opposite effect, 
however, if there are a lot of strings that are equivalent when 
ignoring case, but not exactly the same.


-Steve

Re: avoid toLower in std.algorithm.sort compare alias

2012-04-24 Thread Regan Heath

On Mon, 23 Apr 2012 16:43:20 +0100, Steven Schveighoffer  
 wrote:


While dealing with unicode in my std.stream rewrite, I've found that  
hand-decoding dchars is way faster than using library calls.


After watching Andrei's talk on generic and generative programming I have  
to ask, which routines are you avoiding .. it seems we need to make them  
as good as the hand coded code you've written...


R

--
Using Opera's revolutionary email client: http://www.opera.com/mail/

Re: avoid toLower in std.algorithm.sort compare alias

2012-04-23 Thread Steven Schveighoffer


On Mon, 23 Apr 2012 09:49:50 -0400, Jay Norwood  wrote:


On Monday, 23 April 2012 at 11:27:40 UTC, Steven Schveighoffer
wrote:

I think using std.string.icmp is the best solution.  I would expect it  
to outperform even schwartz sort.


-Steve


icmp took longer... added about 1 sec vs  0.3 sec (for
schwartzSort ) to the program execution time.

bool myComp(string x, string y) { return std.string.icmp(x,y)<0; }
std.algorithm.sort!(myComp)(dirs);

finished! time:1396 ms


Well, that's surprising :)  Perhaps there's some room for improvement in  
icmp or std.uni.toLower.  There may be some constructs that are preventing  
inlining (enforce is the worst offender).


While dealing with unicode in my std.stream rewrite, I've found that  
hand-decoding dchars is way faster than using library calls.


-Steve

Re: avoid toLower in std.algorithm.sort compare alias

2012-04-23 Thread Jay Norwood


On Monday, 23 April 2012 at 11:27:40 UTC, Steven Schveighoffer
wrote:

I think using std.string.icmp is the best solution.  I would 
expect it to outperform even schwartz sort.


-Steve


icmp took longer... added about 1 sec vs  0.3 sec (for
schwartzSort ) to the program execution time.

bool myComp(string x, string y) { return std.string.icmp(x,y)<0; }
std.algorithm.sort!(myComp)(dirs);

finished! time:1396 ms

Re: avoid toLower in std.algorithm.sort compare alias

2012-04-23 Thread Steven Schveighoffer


On Sat, 21 Apr 2012 19:24:56 -0400, Jay Norwood  wrote:

While playing with sorting the unzip archive entries I tried use of the  
last example in http://dlang.org/phobos/std_algorithm.html#sort


std.algorithm.sort!("toLower(a.name) <  
toLower(b.name)",std.algorithm.SwapStrategy.stable)(entries);


It was terribly slow  for sorting the 34k entries in my test case.  I'd  
guess it is doing the toLower call on both strings for every compare.


It was much, much faster to add an entries.lowerCaseName =  
std.string.toLower(entries.name) foreach entry prior to the sort  
execution and then use


std.algorithm.sort!("a.lowerCaseName < b.lowerCaseName  
",std.algorithm.SwapStrategy.stable)(entries);


The difference was on the order of 10 secs vs no noticeable delay when  
executing the sort operation in the debugger.


I'll point out what I haven't seen yet:

the issue is not so much toLower being called on every comparison, but  
more that toLower allocates (and then throws away!).


I think using std.string.icmp is the best solution.  I would expect it to  
outperform even schwartz sort.


Note, to answer your question elsewhere, the comment is accurate,  
std.uni.toLower(a) is a function that accepts a dchar, not a string.  What  
the comment is saying is that for the "ranges" (i.e. strings) given, it  
runs the given comparison on the std.uni.toLower() result for each element  
(i.e. dchar).


-Steve

Re: avoid toLower in std.algorithm.sort compare alias

2012-04-22 Thread Jay Norwood


On Sunday, 22 April 2012 at 00:36:19 UTC, bearophile wrote:
Performing the toLower every time the cmp function is called 
doesn't change the O complexity. In Phobos there is an 
alternative sorting (Schwartzian sorting routime) that applies 
a function to each item before sorting them, usually is much 
slower than the normal sorting, but maybe this time it's 
convenient.

Bye,
bearophile


The shwartzSort works fine.
Thanks,
Jay

std.algorithm.schwartzSort!(std.string.toLower, "a < b")(dirs);

G:\d\rmdir2\rmdir2\rmdir2\Debug>rmdir2
removing: g:\tz
finished! time:699 ms

Re: avoid toLower in std.algorithm.sort compare alias

2012-04-22 Thread SomeDude


On Saturday, 21 April 2012 at 23:24:57 UTC, Jay Norwood wrote:
While playing with sorting the unzip archive entries I tried 
use of the last example in 
http://dlang.org/phobos/std_algorithm.html#sort


std.algorithm.sort!("toLower(a.name) < 
toLower(b.name)",std.algorithm.SwapStrategy.stable)(entries);


It was terribly slow  for sorting the 34k entries in my test 
case.  I'd guess it is doing the toLower call on both strings 
for every compare.


It was much, much faster to add an entries.lowerCaseName = 
std.string.toLower(entries.name) foreach entry prior to the 
sort execution and then use


std.algorithm.sort!("a.lowerCaseName < b.lowerCaseName 
",std.algorithm.SwapStrategy.stable)(entries);


The difference was on the order of 10 secs vs no noticeable 
delay when executing the sort operation in the debugger.


Good point. Perhaps this should be added in the documentation of 
std.algorithm ? It's easy to get trapped.

Re: avoid toLower in std.algorithm.sort compare alias

2012-04-22 Thread Jacob Carlborg


On 2012-04-22 01:24, Jay Norwood wrote:

While playing with sorting the unzip archive entries I tried use of the
last example in http://dlang.org/phobos/std_algorithm.html#sort

std.algorithm.sort!("toLower(a.name) <
toLower(b.name)",std.algorithm.SwapStrategy.stable)(entries);

It was terribly slow for sorting the 34k entries in my test case. I'd
guess it is doing the toLower call on both strings for every compare.

It was much, much faster to add an entries.lowerCaseName =
std.string.toLower(entries.name) foreach entry prior to the sort
execution and then use

std.algorithm.sort!("a.lowerCaseName < b.lowerCaseName
",std.algorithm.SwapStrategy.stable)(entries);

The difference was on the order of 10 secs vs no noticeable delay when
executing the sort operation in the debugger.


Perhaps a function that does case folding would be better in this case. 
But as far as I know Phobos doesn't have a function for that.


--
/Jacob Carlborg

Re: toLower() and Unicode are incomplete was: Re: avoid toLower in std.algorithm.sort compare alias

2012-04-22 Thread Dmitry Olshansky

On 22.04.2012 5:43, Ali Çehreli wrote:

On 04/21/2012 04:24 PM, Jay Norwood wrote:
 > While playing with sorting the unzip archive entries I tried use of the
 > last example in http://dlang.org/phobos/std_algorithm.html#sort
 >
 > std.algorithm.sort!("toLower(a.name) <
 > toLower(b.name)",std.algorithm.SwapStrategy.stable)(entries);

Stealing this thread to point out that converting a letter to upper or
lower case cannot be done without knowing the writing system. Phobos's
toLower() documentation currently says: "Returns a string which is
identical to s except that all of its characters are lowercase (in
unicode, not just ASCII)."

Oh, come on. This function wasn't updated for ages. I bet this wording 
here is intact since unicode 4.0 ;)

Unicode cannot define the conversions of at least the following letters
without knowing the actual alphabet that the text is written in:

- Lowercase of I is ı in some alphabets[*] and i in many others.

- Uppercase of i is İ in some alphabets[*] and I in many others.

Fair point. The list however is not that long and a system may choose to 
support this or not (changing behavior based on writing system is called 
tailoring I believe).

Ali

[*] Turkish, Azeri, Chrimean Tatar, Gagauz, Celtic, etc.

--
Dmitry Olshansky

Re: avoid toLower in std.algorithm.sort compare alias

2012-04-22 Thread Jay Norwood


On Sunday, 22 April 2012 at 06:26:42 UTC, Jonathan M Davis wrote:


You can look at the code. It checks each of the characters in 
place. Unlike
toLower, it doesn't need to generate a new string. But as far 
as the

comparison goes, they're the same - hence that line in the docs.

- Jonathan M Davis


ok, I did look at the code just now, and I'll sleep better
knowing that it doesn't do the whole string conversion.  I
misunderstood your pseudo-code to mean that two lower case
strings were being created prior to the compare.

However,  icmp code does appear to call the toLower conversion on
both characters without first comparing the characters for
equality, which misses the chance to do a simple compare that
would avoid the two calls.

Re: avoid toLower in std.algorithm.sort compare alias

2012-04-21 Thread Jonathan M Davis

On Sunday, April 22, 2012 08:20:13 Jay Norwood wrote:
> The comment below worries me a little bit about std.string.icmp.
> What if they are two 1MB strings that differ in he first
> character?  Does it really convert both strings to lower case
> before comparing the first character?
> 
> http://dlang.org/phobos/std_string.html#icmp
> 
> "Technically, icmp(r1, r2) is equivalent to
> cmp!"std.uni.toLower(a) < std.uni.toLower(b)"(r1, r2). "

You can look at the code. It checks each of the characters in place. Unlike 
toLower, it doesn't need to generate a new string. But as far as the 
comparison goes, they're the same - hence that line in the docs.

- Jonathan M Davis

Re: avoid toLower in std.algorithm.sort compare alias

2012-04-21 Thread Jay Norwood


On Sunday, 22 April 2012 at 02:29:45 UTC, Jonathan M Davis wrote:

Regardless of whether it's the Big(O) complexity or the 
constant factor that's
the problem here, clearly there's enough additional overhead 
that it's causing
problems for Jay's particular case. It's also the sort of thing 
that can be
easy to miss and then end up wondering why your code is so slow 
(if it

actually matters in your particular situation).

- Jonathan M Davis


I haven't looked at strncmpi code, but I suspect it is a lot more 
efficient.  For example, in comparing


AbbbCdddEfffXabcdEfgh
AbbbCdddEfffYabcdEfgh

it is not necessary to do case conversion on anything except X 
and Y, and if isUpper(X)==isUpper(Y) then X and Y can be compared 
without conversion, and since X and Y are not equal the remaining 
characters don't have to be converted.


The comment below worries me a little bit about std.string.icmp.  
What if they are two 1MB strings that differ in he first 
character?  Does it really convert both strings to lower case 
before comparing the first character?


http://dlang.org/phobos/std_string.html#icmp

"Technically, icmp(r1, r2) is equivalent to 
cmp!"std.uni.toLower(a) < std.uni.toLower(b)"(r1, r2). "

Re: avoid toLower in std.algorithm.sort compare alias

2012-04-21 Thread Jonathan M Davis

On Saturday, April 21, 2012 18:26:42 H. S. Teoh wrote:
> Actually, I don't think the nested loops affect Big-O complexity at all.
> The O(l) complexity (where l = string length) is already inherent in the
> string comparison "str < otherStr". Adding more loops over the strings
> doesn't change the Big-O complexity (you just make O(l) into 3*O(l)
> which is the same as O(l)).

If the loop is really nested, then it will increase the Big(O) complexity, 
whereas if it's parallel to a similar loop, it will just increase the constant 
factor. Which it really is in this case, I don't know without sitting down and 
sketching out the exact algorithm, but certainly upon a first glance, it was my 
conclusion that the loops were nested. If they're not, then they're not.

Regardless of whether it's the Big(O) complexity or the constant factor that's 
the problem here, clearly there's enough additional overhead that it's causing 
problems for Jay's particular case. It's also the sort of thing that can be 
easy to miss and then end up wondering why your code is so slow (if it 
actually matters in your particular situation).

- Jonathan M Davis

Re: toLower() and Unicode are incomplete was: Re: avoid toLower in std.algorithm.sort compare alias

2012-04-21 Thread Jonathan M Davis

On Saturday, April 21, 2012 18:43:23 Ali Çehreli wrote:
> On 04/21/2012 04:24 PM, Jay Norwood wrote:
>  > While playing with sorting the unzip archive entries I tried use of the
>  > last example in http://dlang.org/phobos/std_algorithm.html#sort
>  > 
>  > std.algorithm.sort!("toLower(a.name) <
>  > toLower(b.name)",std.algorithm.SwapStrategy.stable)(entries);
> 
> Stealing this thread to point out that converting a letter to upper or
> lower case cannot be done without knowing the writing system. Phobos's
> toLower() documentation currently says: "Returns a string which is
> identical to s except that all of its characters are lowercase (in
> unicode, not just ASCII)."
> 
> Unicode cannot define the conversions of at least the following letters
> without knowing the actual alphabet that the text is written in:
> 
> - Lowercase of I is ı in some alphabets[*] and i in many others.
> 
> - Uppercase of i is İ in some alphabets[*] and I in many others.
> 
> Ali
> 
> [*] Turkish, Azeri, Chrimean Tatar, Gagauz, Celtic, etc.

toLower and toUpper get pretty screwing with unicode. I don't know enough 
about non-English alphabets to know what affects what, but at minimum, there 
are a number of cases where toLower does not reverse toUpper (and vice versa). 
Rather, it converts the character into yet another letter. So, toLower to 
toUpper with unicode and definitely a bit iffy. I suppose that they do the job 
if you call them enough on the string that it doesn't change anymore, but I 
don't know.

I also don't know how they act with regards to the various alphabets and how 
their implementation was decided upon. IIRC, Walter wrote them, and I'm sure 
that they're based on the unicode standard, but what that amounts to, I don't 
know.

- Jonathan M Davis

Re: avoid toLower in std.algorithm.sort compare alias

2012-04-21 Thread Jonathan M Davis

On Sunday, April 22, 2012 03:47:30 Jay Norwood wrote:
> On Saturday, 21 April 2012 at 23:54:26 UTC, Jonathan M Davis
> 
> wrote:
> > Yeah. toLower would be called on both strings on _every_
> > compare. And since
> > that involves a loop, that would make the overall call to sort
> > an order of
> > magnitude worse than if you didn't call toLower at all. I'm not
> > sure if it's
> > an order of magnitude worse than your solution though, since
> > it's been a while
> > since I studied Big(O) notation (doing the conversion only once
> > is still more
> > expensive than not converting, but I'm not sure how much more
> > expensive - it
> > might cost less than sort such that it actually doesn't matter
> > as for as
> > Big(O) goes though).
> > 
> > - Jonathan M Davis
> 
> use of toLower in the sort expression adds around 11.2 secs
> overhead to a 0.3 sec operation which reads and sorts the 34k
> directory entries in this 2GB layout on the ssd drive, so it
> isn't an option for me.
> 
> finished! time:312 ms
> 
> finished! time:11598 ms
> 
> std.algorithm.sort!("toLower(a) <
> toLower(b)",std.algorithm.SwapStrategy.stable)(dirs);
> //std.algorithm.sort!("a < b",
> std.algorithm.SwapStrategy.stable)(dirs);

It wasn't saying that it _was_ an option. I was just trying to point out the 
likely reason why it's so bad with toLower - algorithmically-speaking. This 
definitely appears to be a case where doing some extra computation ahead of 
time will save you a lot later.

- Jonathan M Davis

Re: avoid toLower in std.algorithm.sort compare alias

2012-04-21 Thread Jay Norwood

On Saturday, 21 April 2012 at 23:54:26 UTC, Jonathan M Davis 
wrote:
Yeah. toLower would be called on both strings on _every_ 
compare. And since
that involves a loop, that would make the overall call to sort 
an order of
magnitude worse than if you didn't call toLower at all. I'm not 
sure if it's
an order of magnitude worse than your solution though, since 
it's been a while
since I studied Big(O) notation (doing the conversion only once 
is still more
expensive than not converting, but I'm not sure how much more 
expensive - it
might cost less than sort such that it actually doesn't matter 
as for as

Big(O) goes though).

- Jonathan M Davis


use of toLower in the sort expression adds around 11.2 secs 
overhead to a 0.3 sec operation which reads and sorts the 34k 
directory entries in this 2GB layout on the ssd drive, so it 
isn't an option for me.


finished! time:312 ms

finished! time:11598 ms

std.algorithm.sort!("toLower(a) < 
toLower(b)",std.algorithm.SwapStrategy.stable)(dirs);
//std.algorithm.sort!("a < b", 
std.algorithm.SwapStrategy.stable)(dirs);

toLower() and Unicode are incomplete was: Re: avoid toLower in std.algorithm.sort compare alias

2012-04-21 Thread Ali Çehreli

On 04/21/2012 04:24 PM, Jay Norwood wrote:
> While playing with sorting the unzip archive entries I tried use of the
> last example in http://dlang.org/phobos/std_algorithm.html#sort
>
> std.algorithm.sort!("toLower(a.name) <
> toLower(b.name)",std.algorithm.SwapStrategy.stable)(entries);

Stealing this thread to point out that converting a letter to upper or 
lower case cannot be done without knowing the writing system. Phobos's 
toLower() documentation currently says: "Returns a string which is 
identical to s except that all of its characters are lowercase (in 
unicode, not just ASCII)."

Unicode cannot define the conversions of at least the following letters 
without knowing the actual alphabet that the text is written in:

- Lowercase of I is ı in some alphabets[*] and i in many others.

- Uppercase of i is İ in some alphabets[*] and I in many others.

Ali

[*] Turkish, Azeri, Chrimean Tatar, Gagauz, Celtic, etc.

Re: avoid toLower in std.algorithm.sort compare alias

2012-04-21 Thread H. S. Teoh

On Sat, Apr 21, 2012 at 05:45:35PM -0700, Jonathan M Davis wrote:
> On Saturday, April 21, 2012 20:36:18 bearophile wrote:
> > Jonathan M Davis:
> > > I'm not sure if it's an order of magnitude worse than your
> > > solution though, since it's been a while since I studied Big(O)
> > > notation (doing the conversion only once is still more expensive
> > > than not converting, but I'm not sure how much more expensive - it
> > > might cost less than sort such that it actually doesn't matter as
> > > for as Big(O) goes though).
> > 
> > Performing the toLower every time the cmp function is called doesn't change
> > the O complexity.
> 
> Yes it does. It adds a loop to each comparison (two loops actually,
> but since they're not nested, Big(O) only cares about the one), since
> toLower has to loop over all of the characters. As sort loops over
> each of the strings to compare them for moving them into a sorted
> position or not, it calls toLower, which adds a nested loop, so it
> increases the Big(O) complexity. Something like this
> 
> foreach(str; strings)
> str < otherStr;
> 
> becomes
> 
> foreach(str; strings)
> {
> foreach(dchar c; str)
> //lower characters
> 
> foreach(dchar c; otherStr)
> //lower characters
> 
> str < otherStr;
> }
> 
> though that's obviously very pseudo-code-ish and not exactly what sort
> does.  Regardless, those extra loops when the comparison happens are
> nested and therefore increase the Big(O) complexity of the overall
> algorithm.
[...]

Actually, I don't think the nested loops affect Big-O complexity at all.
The O(l) complexity (where l = string length) is already inherent in the
string comparison "str < otherStr". Adding more loops over the strings
doesn't change the Big-O complexity (you just make O(l) into 3*O(l)
which is the same as O(l)).

However, always remember that Big-O notation hides constant factors and
terms. These factors are negligible given a sufficiently large problem
size, but for small problem sizes, the hidden constants are very much
significant. An O(n log n) algorithm may actually take n*log(10*n) steps
or 1000*n*log(n) steps; for large enough n, they're approximately the
same, but when n is small, the 1000 has a very significant hit on
observed performance.

That's why optimizing the constant factors is sometimes necessary
(provided you've already minimized the big-O complexity to the full,
since otherwise you're just pinching pennies yet freely spending $100
bills). Inner loop optimization, like strength reduction, loop invariant
hoisting, etc., are examples of where constant factors get optimized. If
you have a loop:

real x;
for (i=0; i < n; i++) {
// bigComplexCalculation is independent of i
real n = bigComplexCalculation(x);

// make use of n in some way
}

Then moving the bigComplexCalculation() call outside the loop improves
the observed performance significantly, even though you're not changing
the big-O complexity: a small change from 10*n to 8*n means a 20%
improvement in observed performance, even though the algorithm still
degrades linearly with problem size just like before.

T

-- 
Microsoft is to operating systems & security ... what McDonalds is to gourmet 
cooking.

Re: avoid toLower in std.algorithm.sort compare alias

2012-04-21 Thread Jonathan M Davis

On Saturday, April 21, 2012 20:36:18 bearophile wrote:
> Jonathan M Davis:
> > I'm not sure if it's
> > an order of magnitude worse than your solution though, since it's been a
> > while since I studied Big(O) notation (doing the conversion only once is
> > still more expensive than not converting, but I'm not sure how much more
> > expensive - it might cost less than sort such that it actually doesn't
> > matter as for as Big(O) goes though).
> 
> Performing the toLower every time the cmp function is called doesn't change
> the O complexity.

Yes it does. It adds a loop to each comparison (two loops actually, but since 
they're not nested, Big(O) only cares about the one), since toLower has to 
loop over all of the characters. As sort loops over each of the strings to 
compare them for moving them into a sorted position or not, it calls toLower, 
which adds a nested loop, so it increases the Big(O) complexity. Something 
like this

foreach(str; strings)
str < otherStr;

becomes

foreach(str; strings)
{
foreach(dchar c; str)
//lower characters

foreach(dchar c; otherStr)
//lower characters

str < otherStr;
}

though that's obviously very pseudo-code-ish and not exactly what sort does. 
Regardless, those extra loops when the comparison happens are nested and 
therefore increase the Big(O) complexity of the overall algorithm.

- Jonathan M Davis

Re: avoid toLower in std.algorithm.sort compare alias

2012-04-21 Thread bearophile

Jonathan M Davis:

> I'm not sure if it's 
> an order of magnitude worse than your solution though, since it's been a 
> while 
> since I studied Big(O) notation (doing the conversion only once is still more 
> expensive than not converting, but I'm not sure how much more expensive - it 
> might cost less than sort such that it actually doesn't matter as for as 
> Big(O) goes though).

Performing the toLower every time the cmp function is called doesn't change the 
O complexity. In Phobos there is an alternative sorting (Schwartzian sorting 
routime) that applies a function to each item before sorting them, usually is 
much slower than the normal sorting, but maybe this time it's convenient.
The performance improvement in the OP message is large, maybe it's a problem of 
memory allocations of the converted strings... More info on the original code 
is needed to give better answers.

Bye,
bearophile

Re: avoid toLower in std.algorithm.sort compare alias

2012-04-21 Thread Jonathan M Davis

On Sunday, April 22, 2012 01:24:56 Jay Norwood wrote:
> While playing with sorting the unzip archive entries I tried use
> of the last example in
> http://dlang.org/phobos/std_algorithm.html#sort
> 
> std.algorithm.sort!("toLower(a.name) <
> toLower(b.name)",std.algorithm.SwapStrategy.stable)(entries);
> 
> It was terribly slow  for sorting the 34k entries in my test
> case.  I'd guess it is doing the toLower call on both strings for
> every compare.
> 
> It was much, much faster to add an entries.lowerCaseName =
> std.string.toLower(entries.name) foreach entry prior to the sort
> execution and then use
> 
> std.algorithm.sort!("a.lowerCaseName < b.lowerCaseName
> ",std.algorithm.SwapStrategy.stable)(entries);
> 
> The difference was on the order of 10 secs vs no noticeable delay
> when executing the sort operation in the debugger.

Yeah. toLower would be called on both strings on _every_ compare. And since 
that involves a loop, that would make the overall call to sort an order of 
magnitude worse than if you didn't call toLower at all. I'm not sure if it's 
an order of magnitude worse than your solution though, since it's been a while 
since I studied Big(O) notation (doing the conversion only once is still more 
expensive than not converting, but I'm not sure how much more expensive - it 
might cost less than sort such that it actually doesn't matter as for as 
Big(O) goes though).

- Jonathan M Davis

avoid toLower in std.algorithm.sort compare alias

2012-04-21 Thread Jay Norwood

While playing with sorting the unzip archive entries I tried use 
of the last example in 
http://dlang.org/phobos/std_algorithm.html#sort


std.algorithm.sort!("toLower(a.name) < 
toLower(b.name)",std.algorithm.SwapStrategy.stable)(entries);


It was terribly slow  for sorting the 34k entries in my test 
case.  I'd guess it is doing the toLower call on both strings for 
every compare.


It was much, much faster to add an entries.lowerCaseName = 
std.string.toLower(entries.name) foreach entry prior to the sort 
execution and then use


std.algorithm.sort!("a.lowerCaseName < b.lowerCaseName 
",std.algorithm.SwapStrategy.stable)(entries);


The difference was on the order of 10 secs vs no noticeable delay 
when executing the sort operation in the debugger.

Re: avoid toLower in std.algorithm.sort compare alias

Re: avoid toLower in std.algorithm.sort compare alias

Re: avoid toLower in std.algorithm.sort compare alias

Re: avoid toLower in std.algorithm.sort compare alias

Re: avoid toLower in std.algorithm.sort compare alias

Re: avoid toLower in std.algorithm.sort compare alias

Re: avoid toLower in std.algorithm.sort compare alias

Re: avoid toLower in std.algorithm.sort compare alias

Re: avoid toLower in std.algorithm.sort compare alias

Re: avoid toLower in std.algorithm.sort compare alias

Re: avoid toLower in std.algorithm.sort compare alias

Re: toLower() and Unicode are incomplete was: Re: avoid toLower in std.algorithm.sort compare alias

Re: avoid toLower in std.algorithm.sort compare alias

Re: avoid toLower in std.algorithm.sort compare alias

Re: avoid toLower in std.algorithm.sort compare alias

Re: avoid toLower in std.algorithm.sort compare alias

Re: toLower() and Unicode are incomplete was: Re: avoid toLower in std.algorithm.sort compare alias

Re: avoid toLower in std.algorithm.sort compare alias

Re: avoid toLower in std.algorithm.sort compare alias

toLower() and Unicode are incomplete was: Re: avoid toLower in std.algorithm.sort compare alias

Re: avoid toLower in std.algorithm.sort compare alias

Re: avoid toLower in std.algorithm.sort compare alias

Re: avoid toLower in std.algorithm.sort compare alias

Re: avoid toLower in std.algorithm.sort compare alias

avoid toLower in std.algorithm.sort compare alias

25 matches

Site Navigation

Mail list logo

Footer information