Re: avoid toLower in std.algorithm.sort compare alias

2012-04-26 Thread Marco Leise
Am Sun, 22 Apr 2012 09:23:45 +0200
schrieb Jay Norwood j...@prismnet.com:

 On Sunday, 22 April 2012 at 06:26:42 UTC, Jonathan M Davis wrote:
 
  You can look at the code. It checks each of the characters in 
  place. Unlike
  toLower, it doesn't need to generate a new string. But as far 
  as the
  comparison goes, they're the same - hence that line in the docs.
 
  - Jonathan M Davis
 
 ok, I did look at the code just now, and I'll sleep better
 knowing that it doesn't do the whole string conversion.  I
 misunderstood your pseudo-code to mean that two lower case
 strings were being created prior to the compare.
 
 However,  icmp code does appear to call the toLower conversion on
 both characters without first comparing the characters for
 equality, which misses the chance to do a simple compare that
 would avoid the two calls.

/- check for equality :)
v
cmp!a != b  std.uni.toLower(a)  std.uni.toLower(b)(r1, r2)

-- 
Marco



Re: avoid toLower in std.algorithm.sort compare alias

2012-04-24 Thread Regan Heath
On Mon, 23 Apr 2012 16:43:20 +0100, Steven Schveighoffer  
schvei...@yahoo.com wrote:


While dealing with unicode in my std.stream rewrite, I've found that  
hand-decoding dchars is way faster than using library calls.


After watching Andrei's talk on generic and generative programming I have  
to ask, which routines are you avoiding .. it seems we need to make them  
as good as the hand coded code you've written...


R

--
Using Opera's revolutionary email client: http://www.opera.com/mail/


Re: avoid toLower in std.algorithm.sort compare alias

2012-04-24 Thread Steven Schveighoffer

On Tuesday, 24 April 2012 at 11:24:44 UTC, Regan Heath wrote:
On Mon, 23 Apr 2012 16:43:20 +0100, Steven Schveighoffer 
schvei...@yahoo.com wrote:


While dealing with unicode in my std.stream rewrite, I've 
found that hand-decoding dchars is way faster than using 
library calls.


After watching Andrei's talk on generic and generative 
programming I have to ask, which routines are you avoiding .. 
it seems we need to make them as good as the hand coded code 
you've written...


from memory (don't have the code in front of me right now), it 
was std.uni.decode, and using foreach(dchar d; str) (which cannot 
be inlined currently).


IIRC, std.uni.decode was not being inlined.  So I tried 
hand-inlining it (I also discovered some optimizations it was not 
using), and it made a huge difference.


In regards to this discussion, I think icmp can also be improved 
when run on a char array, by doing a byte comparison (no dchar 
decoding) until it finds a difference.  That might be a huge 
speedup.  Right now, all dchars are being decoded, and translated 
to the toLower counterpart.  It may have an opposite effect, 
however, if there are a lot of strings that are equivalent when 
ignoring case, but not exactly the same.


-Steve


Re: avoid toLower in std.algorithm.sort compare alias

2012-04-24 Thread Steven Schveighoffer
On Tuesday, 24 April 2012 at 14:54:48 UTC, Steven Schveighoffer 
wrote:

On Tuesday, 24 April 2012 at 11:24:44 UTC, Regan Heath wrote:
After watching Andrei's talk on generic and generative 
programming I have to ask, which routines are you avoiding .. 
it seems we need to make them as good as the hand coded code 
you've written...


from memory (don't have the code in front of me right now), it 
was std.uni.decode, and using foreach(dchar d; str) (which 
cannot be inlined currently).


IIRC, std.uni.decode was not being inlined.  So I tried 
hand-inlining it (I also discovered some optimizations it was 
not using), and it made a huge difference.


BTW, you can check out my github branch of phobos named new-io2, 
look at the textstream struct to see what I've inlined.


-Steve


Re: avoid toLower in std.algorithm.sort compare alias

2012-04-24 Thread Jonathan M Davis
On Tuesday, April 24, 2012 12:24:44 Regan Heath wrote:
 On Mon, 23 Apr 2012 16:43:20 +0100, Steven Schveighoffer
 
 schvei...@yahoo.com wrote:
  While dealing with unicode in my std.stream rewrite, I've found that
  hand-decoding dchars is way faster than using library calls.
 
 After watching Andrei's talk on generic and generative programming I have
 to ask, which routines are you avoiding .. it seems we need to make them
 as good as the hand coded code you've written...

In general, when operating on strings generically, you up having to treat them 
as ranges of dchar and decode everything, but there are a lot of cases where 
you can special-case algorithms for narrow strings and avoid decoding them. 
Phobos does this a lot (though it can probably do a better job of it in a 
number of places), so by using functions from there rather than rolling your 
own, the problem is reduced, but any time that you're doing a lot of generic 
string processing, there's a decent chance that you're going to have to 
special case some stuff for arrays of char, wchar, and dchar in order to fully 
optimize it. And I don't think that there's really a way out of that beyond 
having a lot of functions already available (and already optimized) to do a 
lot of the string processing for you. There's a definite tension between 
genericity and effciency in the case of string processing - due primarily to 
variable length encodings.

- Jonathan M Davis


Re: avoid toLower in std.algorithm.sort compare alias

2012-04-23 Thread Steven Schveighoffer

On Sat, 21 Apr 2012 19:24:56 -0400, Jay Norwood j...@prismnet.com wrote:

While playing with sorting the unzip archive entries I tried use of the  
last example in http://dlang.org/phobos/std_algorithm.html#sort


std.algorithm.sort!(toLower(a.name)   
toLower(b.name),std.algorithm.SwapStrategy.stable)(entries);


It was terribly slow  for sorting the 34k entries in my test case.  I'd  
guess it is doing the toLower call on both strings for every compare.


It was much, much faster to add an entries.lowerCaseName =  
std.string.toLower(entries.name) foreach entry prior to the sort  
execution and then use


std.algorithm.sort!(a.lowerCaseName  b.lowerCaseName  
,std.algorithm.SwapStrategy.stable)(entries);


The difference was on the order of 10 secs vs no noticeable delay when  
executing the sort operation in the debugger.


I'll point out what I haven't seen yet:

the issue is not so much toLower being called on every comparison, but  
more that toLower allocates (and then throws away!).


I think using std.string.icmp is the best solution.  I would expect it to  
outperform even schwartz sort.


Note, to answer your question elsewhere, the comment is accurate,  
std.uni.toLower(a) is a function that accepts a dchar, not a string.  What  
the comment is saying is that for the ranges (i.e. strings) given, it  
runs the given comparison on the std.uni.toLower() result for each element  
(i.e. dchar).


-Steve


Re: avoid toLower in std.algorithm.sort compare alias

2012-04-23 Thread Jay Norwood

On Monday, 23 April 2012 at 11:27:40 UTC, Steven Schveighoffer
wrote:

I think using std.string.icmp is the best solution.  I would 
expect it to outperform even schwartz sort.


-Steve


icmp took longer... added about 1 sec vs  0.3 sec (for
schwartzSort ) to the program execution time.

bool myComp(string x, string y) { return std.string.icmp(x,y)0; }
std.algorithm.sort!(myComp)(dirs);

finished! time:1396 ms






Re: avoid toLower in std.algorithm.sort compare alias

2012-04-23 Thread Steven Schveighoffer

On Mon, 23 Apr 2012 09:49:50 -0400, Jay Norwood j...@prismnet.com wrote:


On Monday, 23 April 2012 at 11:27:40 UTC, Steven Schveighoffer
wrote:

I think using std.string.icmp is the best solution.  I would expect it  
to outperform even schwartz sort.


-Steve


icmp took longer... added about 1 sec vs  0.3 sec (for
schwartzSort ) to the program execution time.

bool myComp(string x, string y) { return std.string.icmp(x,y)0; }
std.algorithm.sort!(myComp)(dirs);

finished! time:1396 ms


Well, that's surprising :)  Perhaps there's some room for improvement in  
icmp or std.uni.toLower.  There may be some constructs that are preventing  
inlining (enforce is the worst offender).


While dealing with unicode in my std.stream rewrite, I've found that  
hand-decoding dchars is way faster than using library calls.


-Steve


Re: avoid toLower in std.algorithm.sort compare alias

2012-04-22 Thread Jay Norwood

On Sunday, 22 April 2012 at 02:29:45 UTC, Jonathan M Davis wrote:

Regardless of whether it's the Big(O) complexity or the 
constant factor that's
the problem here, clearly there's enough additional overhead 
that it's causing
problems for Jay's particular case. It's also the sort of thing 
that can be
easy to miss and then end up wondering why your code is so slow 
(if it

actually matters in your particular situation).

- Jonathan M Davis


I haven't looked at strncmpi code, but I suspect it is a lot more 
efficient.  For example, in comparing


AbbbCdddEfffXabcdEfgh
AbbbCdddEfffYabcdEfgh

it is not necessary to do case conversion on anything except X 
and Y, and if isUpper(X)==isUpper(Y) then X and Y can be compared 
without conversion, and since X and Y are not equal the remaining 
characters don't have to be converted.


The comment below worries me a little bit about std.string.icmp.  
What if they are two 1MB strings that differ in he first 
character?  Does it really convert both strings to lower case 
before comparing the first character?


http://dlang.org/phobos/std_string.html#icmp

Technically, icmp(r1, r2) is equivalent to 
cmp!std.uni.toLower(a)  std.uni.toLower(b)(r1, r2). 





Re: avoid toLower in std.algorithm.sort compare alias

2012-04-22 Thread Jonathan M Davis
On Sunday, April 22, 2012 08:20:13 Jay Norwood wrote:
 The comment below worries me a little bit about std.string.icmp.
 What if they are two 1MB strings that differ in he first
 character?  Does it really convert both strings to lower case
 before comparing the first character?
 
 http://dlang.org/phobos/std_string.html#icmp
 
 Technically, icmp(r1, r2) is equivalent to
 cmp!std.uni.toLower(a)  std.uni.toLower(b)(r1, r2). 

You can look at the code. It checks each of the characters in place. Unlike 
toLower, it doesn't need to generate a new string. But as far as the 
comparison goes, they're the same - hence that line in the docs.

- Jonathan M Davis


Re: avoid toLower in std.algorithm.sort compare alias

2012-04-22 Thread Jay Norwood

On Sunday, 22 April 2012 at 06:26:42 UTC, Jonathan M Davis wrote:


You can look at the code. It checks each of the characters in 
place. Unlike
toLower, it doesn't need to generate a new string. But as far 
as the

comparison goes, they're the same - hence that line in the docs.

- Jonathan M Davis


ok, I did look at the code just now, and I'll sleep better
knowing that it doesn't do the whole string conversion.  I
misunderstood your pseudo-code to mean that two lower case
strings were being created prior to the compare.

However,  icmp code does appear to call the toLower conversion on
both characters without first comparing the characters for
equality, which misses the chance to do a simple compare that
would avoid the two calls.





Re: toLower() and Unicode are incomplete was: Re: avoid toLower in std.algorithm.sort compare alias

2012-04-22 Thread Dmitry Olshansky

On 22.04.2012 5:43, Ali Çehreli wrote:

On 04/21/2012 04:24 PM, Jay Norwood wrote:
  While playing with sorting the unzip archive entries I tried use of the
  last example in http://dlang.org/phobos/std_algorithm.html#sort
 
  std.algorithm.sort!(toLower(a.name) 
  toLower(b.name),std.algorithm.SwapStrategy.stable)(entries);

Stealing this thread to point out that converting a letter to upper or
lower case cannot be done without knowing the writing system. Phobos's
toLower() documentation currently says: Returns a string which is
identical to s except that all of its characters are lowercase (in
unicode, not just ASCII).


Oh, come on. This function wasn't updated for ages. I bet this wording 
here is intact since unicode 4.0 ;)




Unicode cannot define the conversions of at least the following letters
without knowing the actual alphabet that the text is written in:

- Lowercase of I is ı in some alphabets[*] and i in many others.

- Uppercase of i is İ in some alphabets[*] and I in many others.



Fair point. The list however is not that long and a system may choose to 
support this or not (changing behavior based on writing system is called 
tailoring I believe).



Ali

[*] Turkish, Azeri, Chrimean Tatar, Gagauz, Celtic, etc.




--
Dmitry Olshansky


Re: avoid toLower in std.algorithm.sort compare alias

2012-04-22 Thread Jacob Carlborg

On 2012-04-22 01:24, Jay Norwood wrote:

While playing with sorting the unzip archive entries I tried use of the
last example in http://dlang.org/phobos/std_algorithm.html#sort

std.algorithm.sort!(toLower(a.name) 
toLower(b.name),std.algorithm.SwapStrategy.stable)(entries);

It was terribly slow for sorting the 34k entries in my test case. I'd
guess it is doing the toLower call on both strings for every compare.

It was much, much faster to add an entries.lowerCaseName =
std.string.toLower(entries.name) foreach entry prior to the sort
execution and then use

std.algorithm.sort!(a.lowerCaseName  b.lowerCaseName
,std.algorithm.SwapStrategy.stable)(entries);

The difference was on the order of 10 secs vs no noticeable delay when
executing the sort operation in the debugger.


Perhaps a function that does case folding would be better in this case. 
But as far as I know Phobos doesn't have a function for that.


--
/Jacob Carlborg


Re: avoid toLower in std.algorithm.sort compare alias

2012-04-22 Thread SomeDude

On Saturday, 21 April 2012 at 23:24:57 UTC, Jay Norwood wrote:
While playing with sorting the unzip archive entries I tried 
use of the last example in 
http://dlang.org/phobos/std_algorithm.html#sort


std.algorithm.sort!(toLower(a.name)  
toLower(b.name),std.algorithm.SwapStrategy.stable)(entries);


It was terribly slow  for sorting the 34k entries in my test 
case.  I'd guess it is doing the toLower call on both strings 
for every compare.


It was much, much faster to add an entries.lowerCaseName = 
std.string.toLower(entries.name) foreach entry prior to the 
sort execution and then use


std.algorithm.sort!(a.lowerCaseName  b.lowerCaseName 
,std.algorithm.SwapStrategy.stable)(entries);


The difference was on the order of 10 secs vs no noticeable 
delay when executing the sort operation in the debugger.


Good point. Perhaps this should be added in the documentation of 
std.algorithm ? It's easy to get trapped.


Re: avoid toLower in std.algorithm.sort compare alias

2012-04-22 Thread Jay Norwood

On Sunday, 22 April 2012 at 00:36:19 UTC, bearophile wrote:
Performing the toLower every time the cmp function is called 
doesn't change the O complexity. In Phobos there is an 
alternative sorting (Schwartzian sorting routime) that applies 
a function to each item before sorting them, usually is much 
slower than the normal sorting, but maybe this time it's 
convenient.

Bye,
bearophile


The shwartzSort works fine.
Thanks,
Jay

std.algorithm.schwartzSort!(std.string.toLower, a  b)(dirs);

G:\d\rmdir2\rmdir2\rmdir2\Debugrmdir2
removing: g:\tz
finished! time:699 ms



Re: avoid toLower in std.algorithm.sort compare alias

2012-04-21 Thread Jonathan M Davis
On Sunday, April 22, 2012 01:24:56 Jay Norwood wrote:
 While playing with sorting the unzip archive entries I tried use
 of the last example in
 http://dlang.org/phobos/std_algorithm.html#sort
 
 std.algorithm.sort!(toLower(a.name) 
 toLower(b.name),std.algorithm.SwapStrategy.stable)(entries);
 
 It was terribly slow  for sorting the 34k entries in my test
 case.  I'd guess it is doing the toLower call on both strings for
 every compare.
 
 It was much, much faster to add an entries.lowerCaseName =
 std.string.toLower(entries.name) foreach entry prior to the sort
 execution and then use
 
 std.algorithm.sort!(a.lowerCaseName  b.lowerCaseName
 ,std.algorithm.SwapStrategy.stable)(entries);
 
 The difference was on the order of 10 secs vs no noticeable delay
 when executing the sort operation in the debugger.

Yeah. toLower would be called on both strings on _every_ compare. And since 
that involves a loop, that would make the overall call to sort an order of 
magnitude worse than if you didn't call toLower at all. I'm not sure if it's 
an order of magnitude worse than your solution though, since it's been a while 
since I studied Big(O) notation (doing the conversion only once is still more 
expensive than not converting, but I'm not sure how much more expensive - it 
might cost less than sort such that it actually doesn't matter as for as 
Big(O) goes though).

- Jonathan M Davis


Re: avoid toLower in std.algorithm.sort compare alias

2012-04-21 Thread Jonathan M Davis
On Saturday, April 21, 2012 20:36:18 bearophile wrote:
 Jonathan M Davis:
  I'm not sure if it's
  an order of magnitude worse than your solution though, since it's been a
  while since I studied Big(O) notation (doing the conversion only once is
  still more expensive than not converting, but I'm not sure how much more
  expensive - it might cost less than sort such that it actually doesn't
  matter as for as Big(O) goes though).
 
 Performing the toLower every time the cmp function is called doesn't change
 the O complexity.

Yes it does. It adds a loop to each comparison (two loops actually, but since 
they're not nested, Big(O) only cares about the one), since toLower has to 
loop over all of the characters. As sort loops over each of the strings to 
compare them for moving them into a sorted position or not, it calls toLower, 
which adds a nested loop, so it increases the Big(O) complexity. Something 
like this

foreach(str; strings)
str  otherStr;

becomes

foreach(str; strings)
{
foreach(dchar c; str)
//lower characters

foreach(dchar c; otherStr)
//lower characters

str  otherStr;
}

though that's obviously very pseudo-code-ish and not exactly what sort does. 
Regardless, those extra loops when the comparison happens are nested and 
therefore increase the Big(O) complexity of the overall algorithm.

- Jonathan M Davis


Re: avoid toLower in std.algorithm.sort compare alias

2012-04-21 Thread H. S. Teoh
On Sat, Apr 21, 2012 at 05:45:35PM -0700, Jonathan M Davis wrote:
 On Saturday, April 21, 2012 20:36:18 bearophile wrote:
  Jonathan M Davis:
   I'm not sure if it's an order of magnitude worse than your
   solution though, since it's been a while since I studied Big(O)
   notation (doing the conversion only once is still more expensive
   than not converting, but I'm not sure how much more expensive - it
   might cost less than sort such that it actually doesn't matter as
   for as Big(O) goes though).
  
  Performing the toLower every time the cmp function is called doesn't change
  the O complexity.
 
 Yes it does. It adds a loop to each comparison (two loops actually,
 but since they're not nested, Big(O) only cares about the one), since
 toLower has to loop over all of the characters. As sort loops over
 each of the strings to compare them for moving them into a sorted
 position or not, it calls toLower, which adds a nested loop, so it
 increases the Big(O) complexity. Something like this
 
 foreach(str; strings)
 str  otherStr;
 
 becomes
 
 foreach(str; strings)
 {
 foreach(dchar c; str)
 //lower characters
 
 foreach(dchar c; otherStr)
 //lower characters
 
 str  otherStr;
 }
 
 though that's obviously very pseudo-code-ish and not exactly what sort
 does.  Regardless, those extra loops when the comparison happens are
 nested and therefore increase the Big(O) complexity of the overall
 algorithm.
[...]

Actually, I don't think the nested loops affect Big-O complexity at all.
The O(l) complexity (where l = string length) is already inherent in the
string comparison str  otherStr. Adding more loops over the strings
doesn't change the Big-O complexity (you just make O(l) into 3*O(l)
which is the same as O(l)).

However, always remember that Big-O notation hides constant factors and
terms. These factors are negligible given a sufficiently large problem
size, but for small problem sizes, the hidden constants are very much
significant. An O(n log n) algorithm may actually take n*log(10*n) steps
or 1000*n*log(n) steps; for large enough n, they're approximately the
same, but when n is small, the 1000 has a very significant hit on
observed performance.

That's why optimizing the constant factors is sometimes necessary
(provided you've already minimized the big-O complexity to the full,
since otherwise you're just pinching pennies yet freely spending $100
bills). Inner loop optimization, like strength reduction, loop invariant
hoisting, etc., are examples of where constant factors get optimized. If
you have a loop:

real x;
for (i=0; i  n; i++) {
// bigComplexCalculation is independent of i
real n = bigComplexCalculation(x);

// make use of n in some way
}

Then moving the bigComplexCalculation() call outside the loop improves
the observed performance significantly, even though you're not changing
the big-O complexity: a small change from 10*n to 8*n means a 20%
improvement in observed performance, even though the algorithm still
degrades linearly with problem size just like before.


T

-- 
Microsoft is to operating systems  security ... what McDonalds is to gourmet 
cooking.


toLower() and Unicode are incomplete was: Re: avoid toLower in std.algorithm.sort compare alias

2012-04-21 Thread Ali Çehreli

On 04/21/2012 04:24 PM, Jay Norwood wrote:
 While playing with sorting the unzip archive entries I tried use of the
 last example in http://dlang.org/phobos/std_algorithm.html#sort

 std.algorithm.sort!(toLower(a.name) 
 toLower(b.name),std.algorithm.SwapStrategy.stable)(entries);

Stealing this thread to point out that converting a letter to upper or 
lower case cannot be done without knowing the writing system. Phobos's 
toLower() documentation currently says: Returns a string which is 
identical to s except that all of its characters are lowercase (in 
unicode, not just ASCII).


Unicode cannot define the conversions of at least the following letters 
without knowing the actual alphabet that the text is written in:


- Lowercase of I is ı in some alphabets[*] and i in many others.

- Uppercase of i is İ in some alphabets[*] and I in many others.

Ali

[*] Turkish, Azeri, Chrimean Tatar, Gagauz, Celtic, etc.



Re: avoid toLower in std.algorithm.sort compare alias

2012-04-21 Thread Jay Norwood
On Saturday, 21 April 2012 at 23:54:26 UTC, Jonathan M Davis 
wrote:
Yeah. toLower would be called on both strings on _every_ 
compare. And since
that involves a loop, that would make the overall call to sort 
an order of
magnitude worse than if you didn't call toLower at all. I'm not 
sure if it's
an order of magnitude worse than your solution though, since 
it's been a while
since I studied Big(O) notation (doing the conversion only once 
is still more
expensive than not converting, but I'm not sure how much more 
expensive - it
might cost less than sort such that it actually doesn't matter 
as for as

Big(O) goes though).

- Jonathan M Davis


use of toLower in the sort expression adds around 11.2 secs 
overhead to a 0.3 sec operation which reads and sorts the 34k 
directory entries in this 2GB layout on the ssd drive, so it 
isn't an option for me.


finished! time:312 ms

finished! time:11598 ms

std.algorithm.sort!(toLower(a)  
toLower(b),std.algorithm.SwapStrategy.stable)(dirs);
//std.algorithm.sort!(a  b, 
std.algorithm.SwapStrategy.stable)(dirs);





Re: avoid toLower in std.algorithm.sort compare alias

2012-04-21 Thread Jonathan M Davis
On Sunday, April 22, 2012 03:47:30 Jay Norwood wrote:
 On Saturday, 21 April 2012 at 23:54:26 UTC, Jonathan M Davis
 
 wrote:
  Yeah. toLower would be called on both strings on _every_
  compare. And since
  that involves a loop, that would make the overall call to sort
  an order of
  magnitude worse than if you didn't call toLower at all. I'm not
  sure if it's
  an order of magnitude worse than your solution though, since
  it's been a while
  since I studied Big(O) notation (doing the conversion only once
  is still more
  expensive than not converting, but I'm not sure how much more
  expensive - it
  might cost less than sort such that it actually doesn't matter
  as for as
  Big(O) goes though).
  
  - Jonathan M Davis
 
 use of toLower in the sort expression adds around 11.2 secs
 overhead to a 0.3 sec operation which reads and sorts the 34k
 directory entries in this 2GB layout on the ssd drive, so it
 isn't an option for me.
 
 finished! time:312 ms
 
 finished! time:11598 ms
 
 std.algorithm.sort!(toLower(a) 
 toLower(b),std.algorithm.SwapStrategy.stable)(dirs);
 //std.algorithm.sort!(a  b,
 std.algorithm.SwapStrategy.stable)(dirs);

It wasn't saying that it _was_ an option. I was just trying to point out the 
likely reason why it's so bad with toLower - algorithmically-speaking. This 
definitely appears to be a case where doing some extra computation ahead of 
time will save you a lot later.

- Jonathan M Davis


Re: toLower() and Unicode are incomplete was: Re: avoid toLower in std.algorithm.sort compare alias

2012-04-21 Thread Jonathan M Davis
On Saturday, April 21, 2012 18:43:23 Ali Çehreli wrote:
 On 04/21/2012 04:24 PM, Jay Norwood wrote:
   While playing with sorting the unzip archive entries I tried use of the
   last example in http://dlang.org/phobos/std_algorithm.html#sort
   
   std.algorithm.sort!(toLower(a.name) 
   toLower(b.name),std.algorithm.SwapStrategy.stable)(entries);
 
 Stealing this thread to point out that converting a letter to upper or
 lower case cannot be done without knowing the writing system. Phobos's
 toLower() documentation currently says: Returns a string which is
 identical to s except that all of its characters are lowercase (in
 unicode, not just ASCII).
 
 Unicode cannot define the conversions of at least the following letters
 without knowing the actual alphabet that the text is written in:
 
 - Lowercase of I is ı in some alphabets[*] and i in many others.
 
 - Uppercase of i is İ in some alphabets[*] and I in many others.
 
 Ali
 
 [*] Turkish, Azeri, Chrimean Tatar, Gagauz, Celtic, etc.

toLower and toUpper get pretty screwing with unicode. I don't know enough 
about non-English alphabets to know what affects what, but at minimum, there 
are a number of cases where toLower does not reverse toUpper (and vice versa). 
Rather, it converts the character into yet another letter. So, toLower to 
toUpper with unicode and definitely a bit iffy. I suppose that they do the job 
if you call them enough on the string that it doesn't change anymore, but I 
don't know.

I also don't know how they act with regards to the various alphabets and how 
their implementation was decided upon. IIRC, Walter wrote them, and I'm sure 
that they're based on the unicode standard, but what that amounts to, I don't 
know.

- Jonathan M Davis


Re: avoid toLower in std.algorithm.sort compare alias

2012-04-21 Thread Jonathan M Davis
On Saturday, April 21, 2012 18:26:42 H. S. Teoh wrote:
 Actually, I don't think the nested loops affect Big-O complexity at all.
 The O(l) complexity (where l = string length) is already inherent in the
 string comparison str  otherStr. Adding more loops over the strings
 doesn't change the Big-O complexity (you just make O(l) into 3*O(l)
 which is the same as O(l)).

If the loop is really nested, then it will increase the Big(O) complexity, 
whereas if it's parallel to a similar loop, it will just increase the constant 
factor. Which it really is in this case, I don't know without sitting down and 
sketching out the exact algorithm, but certainly upon a first glance, it was my 
conclusion that the loops were nested. If they're not, then they're not.

Regardless of whether it's the Big(O) complexity or the constant factor that's 
the problem here, clearly there's enough additional overhead that it's causing 
problems for Jay's particular case. It's also the sort of thing that can be 
easy to miss and then end up wondering why your code is so slow (if it 
actually matters in your particular situation).

- Jonathan M Davis