Re: Replacing tango.text.Ascii.isearch

2022-10-28 Thread rikki cattermole via Digitalmars-d-learn

On 29/10/2022 11:05 AM, Siarhei Siamashka wrote:
And as for the D language and Phobos, should "ß" still uppercase to 
"SS"? Or can we change it to uppercase "ẞ" and remove German from the 
list of tricky languages at 
https://dlang.org/library/std/uni/to_upper.html ? Should Turkish be 
listed there?


That particular function, is based upon the simple mappings provided by 
UnicodeData.txt and (should be) in compliance of the Unicode standard.


The only thing we need to do is regenerate the tables backing it 
whenever Unicode updates.


Note the behavior you are asking for is defined in the Unicode database 
file SpecialCasing.txt which have not been implemented.


```
# The German es-zed is special--the normal mapping is to SS.
# Note: the titlecase should never occur in practice. It is equal to 
titlecase(uppercase())


00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
```

That file is how you support languages like Turkish. We currently don't 
have it implemented. It requires operating on a whole string and to pass 
in what language rules to apply (i.e. Turkish, Azeri).


Re: Replacing tango.text.Ascii.isearch

2022-10-28 Thread Siarhei Siamashka via Digitalmars-d-learn

On Wednesday, 26 October 2022 at 06:05:14 UTC, Ali Çehreli wrote:
The problem with Unicode is its main aim of allowing characters 
of multiple writing systems in the same text. When multiple 
writing systems are in play, conflicts and ambiguities will 
appear.


I personally don't think that it's the problem of the Unicode 
itself. Based on what I can see, it looks like the individuals or 
the committees responsible for mapping the Turkish alphabet to 
Unicode just made a blunder.


For example, let's compare the Latin uppercase "B" and the 
Cyrillic uppercase "В". Looks exactly the same, right? Would it 
be a smart idea for them to share the same index in the Unicode 
table? But wait. What happens if we convert these letters to 
lowercase? The Latin "B" becomes "b" and the Cyrillic "В" becomes 
"в". Oops! So by having different indexes for the Latin uppercase 
"B" and the Cyrillic uppercase "В", we dodged a whole bunch of 
nasty problems.


Another example. Patrick Schluter mentioned the Greek sigma 
letter and the [wikipedia 
article](https://en.wikipedia.org/wiki/Sigma) says: "uppercase Σ, 
lowercase σ, lowercase in word-final position ς", which makes 
everything rather problematic. Now let's compare this to the 
Belarusian language and its letter "у". The Belarusian "у" 
transforms into "ў" depending on context, however this 
transformation doesn't happen for the first letter of proper 
nouns or in acronyms (and this theoretically makes the uppercase 
"ў" redundant). Just imagine an alternative Greek-inspired 
reality, where both "у" and "ў" uppercase to "У". And yet the 
uppercase "Ў" exists in Unicode, so luckily in our reality we 
don't have to deal with uppercase/lowercase round trip failures. 
This is computers friendly. And as I already mentioned in an 
earlier comment, the Germans also got the uppercase "ẞ" in 
Unicode since 2008 (better late than never).


I solved my problem by writing an Alphabet hierarchy in the 
past. I don't like that code but it still works:


[...]

It's confusing but it seems to work. :) It doesn't matter. Life 
is imperfect and things will somehow work in the end.


What's your opinion/conclusion? Is it fine the way it is? Do you 
think that some unique property of the Turkish language/alphabet 
made these difficulties unavoidable? Or do you think that it was 
a mistake, but now it has to live with us forever for 
compatibility reasons? Anything else?


And as for the D language and Phobos, should "ß" still uppercase 
to "SS"? Or can we change it to uppercase "ẞ" and remove German 
from the list of tricky languages at 
https://dlang.org/library/std/uni/to_upper.html ? Should Turkish 
be listed there?


Re: Replacing tango.text.Ascii.isearch

2022-10-26 Thread Ali Çehreli via Digitalmars-d-learn

On 10/25/22 22:49, Siarhei Siamashka wrote:

> Unicode is significantly simpler than a set of various
> incompatible 8-bit encodings

Strongly agreed.

> I'm surely
> able to ignore the peculiarities of modern Turkish Unicode

The problem with Unicode is its main aim of allowing characters of 
multiple writing systems in the same text. When multiple writing systems 
are in play, conflicts and ambiguities will appear.


> and wait for
> the other people to come up with a solution for D language if they
> really care.

I solved my problem by writing an Alphabet hierarchy in the past. I 
don't like that code but it still works:



https://bitbucket.org/acehreli/ddili/src/4c0552fe8352dfe905c9734a57d84d36ce4ed476/src/alphabet.d#lines-50

It handles capitalization, ordering, etc. I use it when preparing the 
Index section of the Turkish edition of "Programming in D":


  http://ddili.org/ders/d/ix.html

One of the ambiguities is what came up on this thread: Should a word 
that starts with I (capital i) be listed under I (because it's Turkish) 
or under İ (because it's English)? So far, I am lucky because the only 
word that starts with I happens to be the English "IDE", so it goes 
under i (which appears as İ in the Turkish edition) and would make sense 
to a Turkish reader because a Turkish reader might (really?) accept it 
as the capital of ide.


It's confusing but it seems to work. :) It doesn't matter. Life is 
imperfect and things will somehow work in the end.


Ali



Re: Replacing tango.text.Ascii.isearch

2022-10-26 Thread rikki cattermole via Digitalmars-d-learn

On 26/10/2022 6:49 PM, Siarhei Siamashka wrote:

On Wednesday, 26 October 2022 at 05:17:06 UTC, rikki cattermole wrote:
if you are able to ignore that Unicode is a thing, I'd recommend it. 
It is complicated, as we humans are very complicated ;)


I can't ignore Unicode, because I frequently have to deal with Cyrillic 
alphabet ;) Also Unicode is significantly simpler than a set of various 
incompatible 8-bit encodings (such as 
[CP1251](https://en.wikipedia.org/wiki/Windows-1251) vs. variants of 
[KOI-8](https://en.wikipedia.org/wiki/KOI-8) vs. [ISO/IEC 
8859-5](https://en.wikipedia.org/wiki/ISO/IEC_8859-5)) that were 
simultaneously in use earlier and caused a lot of pain. But I'm surely 
able to ignore the peculiarities of modern Turkish Unicode and wait for 
the other people to come up with a solution for D language if they 
really care.


Cyrillic isn't an issue.

Lithuanian, Turkish and Azeri are the ones with the biggest issues.

There is a bunch of non-simple mappings for Latin, Armenian and Greek, 
but they are not language dependent. There is six conditional ones which 
are all Greek.


So if you are not dealing with these languages (even if you are, a 
simple replace should be easy to do for most), you should be fine with 
the simple mappings supported by std.uni.


Re: Replacing tango.text.Ascii.isearch

2022-10-25 Thread Siarhei Siamashka via Digitalmars-d-learn
On Wednesday, 26 October 2022 at 05:17:06 UTC, rikki cattermole 
wrote:
if you are able to ignore that Unicode is a thing, I'd 
recommend it. It is complicated, as we humans are very 
complicated ;)


I can't ignore Unicode, because I frequently have to deal with 
Cyrillic alphabet ;) Also Unicode is significantly simpler than a 
set of various incompatible 8-bit encodings (such as 
[CP1251](https://en.wikipedia.org/wiki/Windows-1251) vs. variants 
of [KOI-8](https://en.wikipedia.org/wiki/KOI-8) vs. [ISO/IEC 
8859-5](https://en.wikipedia.org/wiki/ISO/IEC_8859-5)) that were 
simultaneously in use earlier and caused a lot of pain. But I'm 
surely able to ignore the peculiarities of modern Turkish Unicode 
and wait for the other people to come up with a solution for D 
language if they really care.


Re: Replacing tango.text.Ascii.isearch

2022-10-25 Thread rikki cattermole via Digitalmars-d-learn

On 26/10/2022 6:06 PM, Siarhei Siamashka wrote:
Should we ignore the `"D should strive to be correct, rather than fast"` 
comment from bauss for now? Or some actions can be taken to improve the 
current situation?


Bauss is correct.

It should be implemented but it does not need to be fast.

But yeah, if you are able to ignore that Unicode is a thing, I'd 
recommend it. It is complicated, as we humans are very complicated ;)


Re: Replacing tango.text.Ascii.isearch

2022-10-25 Thread Siarhei Siamashka via Digitalmars-d-learn
On Tuesday, 25 October 2022 at 06:32:00 UTC, rikki cattermole 
wrote:

On 25/10/2022 5:17 PM, Siarhei Siamashka wrote:
What are the best practices to deal with Turkish text in D 
language?


std.uni doesn't support it.


OK, I'm not specifically interested in this personally and I even 
would be happy to remain blissfully ignorant. Just wondered 
whether a preferred solution already exists, considering that 
this forum has a Turkish section.


Should we ignore the `"D should strive to be correct, rather than 
fast"` comment from bauss for now? Or some actions can be taken 
to improve the current situation?


Re: Replacing tango.text.Ascii.isearch

2022-10-25 Thread rikki cattermole via Digitalmars-d-learn

On 25/10/2022 5:17 PM, Siarhei Siamashka wrote:
Wow, I didn't expect anything like this and just thought that the 
nightmares of handling 8-bit codepages for non-English languages ceased 
to exist nowadays. Too bad. What are the best practices to deal with 
Turkish text in D language?


std.uni doesn't support it.

For casing it only supports the simple mappings which are 1:1 and not 
language dependent.


I haven't got to it yet for my own string handling library, so I can't 
point you to that (even if it was not ready).


I'm sure somebody has got it but you may end up wanting to use ICU 
unfortunately.


Re: Replacing tango.text.Ascii.isearch

2022-10-24 Thread Siarhei Siamashka via Digitalmars-d-learn

On Thursday, 13 October 2022 at 08:27:17 UTC, bauss wrote:

```d
bool isearch(S1, S2)(S1 haystack, S2 needle)
{
import std.uni;
import std.algorithm;
return haystack.asLowerCase.canFind(needle.asLowerCase);
}
```

untested.

-Steve


This doesn't actually work properly in all languages. It will 
probably work in most, but it's not entirely correct.


Ex. Turkish will not work with it properly.

Very interesting article: 
http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.html


Wow, I didn't expect anything like this and just thought that the 
nightmares of handling 8-bit codepages for non-English languages 
ceased to exist nowadays. Too bad. What are the best practices to 
deal with Turkish text in D language?


For example, [Ukrainian letters 'і' and 
'І'](https://en.wikipedia.org/wiki/Dotted_I_(Cyrillic)) don't 
share the same codes with Latin 'i' and 'I' and this is working 
fine. Except for a possible [phishing 
opportunity](https://www.theguardian.com/technology/2017/apr/19/phishing-url-trick-hackers). Why haven't the standard committees done the same for Turkish 'I' yet?


As for the [German letter 
'ß'](https://en.wikipedia.org/wiki/%C3%9F), wikipedia says that 
the uppercase variant 'ẞ' exists since 2008 (ISO 10646). Do 
German people use it now?

```D
import std;
void main() {
  "ß".asUpperCase.writeln; // prints "SS"
  "ẞ".asLowerCase.writeln; // prints "ß"
  "ẞ".asLowerCase.asUpperCase.writeln; // prints "SS"
}
```


Re: Replacing tango.text.Ascii.isearch

2022-10-13 Thread Patrick Schluter via Digitalmars-d-learn

On Thursday, 13 October 2022 at 08:27:17 UTC, bauss wrote:
On Wednesday, 5 October 2022 at 17:29:25 UTC, Steven 
Schveighoffer wrote:

On 10/5/22 12:59 PM, torhu wrote:
I need a case-insensitive check to see if a string contains 
another string for a "quick filter" feature. It should 
preferrably be perceived as instant by the user, and needs to 
check a few thousand strings in typical cases. Is a regex the 
best option, or what would you suggest?


https://dlang.org/phobos/std_uni.html#asLowerCase

```d
bool isearch(S1, S2)(S1 haystack, S2 needle)
{
import std.uni;
import std.algorithm;
return haystack.asLowerCase.canFind(needle.asLowerCase);
}
```

untested.

-Steve


This doesn't actually work properly in all languages. It will 
probably work in most, but it's not entirely correct.


Ex. Turkish will not work with it properly.


Greek will also be problematic. 2 different lowercase sigmas but 
only 1 uppercase. Other languages that may make issues, German 
where normally ß uppercases as SS (or not) but not the other way 
round, but here we already arrived to Unicode land and the 
normalization conundrum.





Re: Replacing tango.text.Ascii.isearch

2022-10-13 Thread rikki cattermole via Digitalmars-d-learn

On 13/10/2022 9:55 PM, bauss wrote:

Yeah, text isn't easy :D


Indeed!

It has me a bit concerned actually, I'm wondering if my string stuff 
will even work correctly for UI's due to performance issues.


My string builder for instance allocates like crazy just to do slicing. 
But hey, at least I can feel confident that my general purpose allocator 
& infrastructure is working correctly!


Re: Replacing tango.text.Ascii.isearch

2022-10-13 Thread bauss via Digitalmars-d-learn
On Thursday, 13 October 2022 at 08:48:49 UTC, rikki cattermole 
wrote:

On 13/10/2022 9:42 PM, bauss wrote:
Oh and to add onto this, IFF you have to do it the hacky way, 
then converting to uppercase instead of lowercase should be 
preferred, because not all lowercase characters can perform 
round trip, although a small group of characters, then using 
uppercase fixes it, so that's a relatively easy fix. A round 
trip is basically converting characters from one culture to 
another and then back. It's impossible with some characters 
when converting to lowercase, but should always be possible 
when converting to uppercase.


You will want to repeat this process with normalize to NFKC and 
normalize to NFD before transforming. Otherwise there is a 
possibility that you will miss some transformations as the 
simplified mappings are 1:1 for characters and not everything 
is representable as a single character.


Yeah, text isn't easy :D


Re: Replacing tango.text.Ascii.isearch

2022-10-13 Thread rikki cattermole via Digitalmars-d-learn

On 13/10/2022 9:42 PM, bauss wrote:
Oh and to add onto this, IFF you have to do it the hacky way, then 
converting to uppercase instead of lowercase should be preferred, 
because not all lowercase characters can perform round trip, although a 
small group of characters, then using uppercase fixes it, so that's a 
relatively easy fix. A round trip is basically converting characters 
from one culture to another and then back. It's impossible with some 
characters when converting to lowercase, but should always be possible 
when converting to uppercase.


You will want to repeat this process with normalize to NFKC and 
normalize to NFD before transforming. Otherwise there is a possibility 
that you will miss some transformations as the simplified mappings are 
1:1 for characters and not everything is representable as a single 
character.




Re: Replacing tango.text.Ascii.isearch

2022-10-13 Thread bauss via Digitalmars-d-learn

On Thursday, 13 October 2022 at 08:35:50 UTC, bauss wrote:
On Thursday, 13 October 2022 at 08:30:04 UTC, rikki cattermole 
wrote:

On 13/10/2022 9:27 PM, bauss wrote:
This doesn't actually work properly in all languages. It will 
probably work in most, but it's not entirely correct.


Ex. Turkish will not work with it properly.

Very interesting article: 
http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.html


Yes turkic languages, they require a state machine and quite a 
bit of LUTs to work correctly.


You also need to provide a language and it has to operate on 
the whole string, not individual characters.


I didn't think it was relevant since Ascii was in the original 
post ;)


I think it's relevant when it comes to D since D is arguably a 
unicode language, not ascii.


D should strive to be correct, rather than fast.


Oh and to add onto this, IFF you have to do it the hacky way, 
then converting to uppercase instead of lowercase should be 
preferred, because not all lowercase characters can perform round 
trip, although a small group of characters, then using uppercase 
fixes it, so that's a relatively easy fix. A round trip is 
basically converting characters from one culture to another and 
then back. It's impossible with some characters when converting 
to lowercase, but should always be possible when converting to 
uppercase.


Re: Replacing tango.text.Ascii.isearch

2022-10-13 Thread bauss via Digitalmars-d-learn
On Thursday, 13 October 2022 at 08:30:04 UTC, rikki cattermole 
wrote:

On 13/10/2022 9:27 PM, bauss wrote:
This doesn't actually work properly in all languages. It will 
probably work in most, but it's not entirely correct.


Ex. Turkish will not work with it properly.

Very interesting article: 
http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.html


Yes turkic languages, they require a state machine and quite a 
bit of LUTs to work correctly.


You also need to provide a language and it has to operate on 
the whole string, not individual characters.


I didn't think it was relevant since Ascii was in the original 
post ;)


I think it's relevant when it comes to D since D is arguably a 
unicode language, not ascii.


D should strive to be correct, rather than fast.


Re: Replacing tango.text.Ascii.isearch

2022-10-13 Thread rikki cattermole via Digitalmars-d-learn

On 13/10/2022 9:27 PM, bauss wrote:
This doesn't actually work properly in all languages. It will probably 
work in most, but it's not entirely correct.


Ex. Turkish will not work with it properly.

Very interesting article: 
http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.html


Yes turkic languages, they require a state machine and quite a bit of 
LUTs to work correctly.


You also need to provide a language and it has to operate on the whole 
string, not individual characters.


I didn't think it was relevant since Ascii was in the original post ;)


Re: Replacing tango.text.Ascii.isearch

2022-10-13 Thread bauss via Digitalmars-d-learn
On Wednesday, 5 October 2022 at 17:29:25 UTC, Steven 
Schveighoffer wrote:

On 10/5/22 12:59 PM, torhu wrote:
I need a case-insensitive check to see if a string contains 
another string for a "quick filter" feature. It should 
preferrably be perceived as instant by the user, and needs to 
check a few thousand strings in typical cases. Is a regex the 
best option, or what would you suggest?


https://dlang.org/phobos/std_uni.html#asLowerCase

```d
bool isearch(S1, S2)(S1 haystack, S2 needle)
{
import std.uni;
import std.algorithm;
return haystack.asLowerCase.canFind(needle.asLowerCase);
}
```

untested.

-Steve


This doesn't actually work properly in all languages. It will 
probably work in most, but it's not entirely correct.


Ex. Turkish will not work with it properly.

Very interesting article: 
http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.html


Re: Replacing tango.text.Ascii.isearch

2022-10-09 Thread rassoc via Digitalmars-d-learn

On 10/9/22 03:08, Siarhei Siamashka via Digitalmars-d-learn wrote:

Does the difference really have to be two orders of magnitude for you to 
acknowledge that there might be a performance problem in Phobos? [...] Except 
that similar one-liners implemented using other programming languages are 
faster and more versatile (can handle any input data without catastrophic 
performance pitfalls).


Oh, I get all that, there's no reason to argue with me or win me over, I can 
see that the implementation is subpar. Since I'm never hitting this performance 
bottleneck in my code, and being a regular dev and not a core maintainer, I 
simply haven't been motivated enough to contribute an improvement. Change like 
that isn't happening in the forums. Various optimizations have made it into 
Phobos in the past, don't think you would get any pushback if you can show that 
a new implementation improves the situation in almost all cases while 
maintaining compatibility.



Re: Replacing tango.text.Ascii.isearch

2022-10-08 Thread Siarhei Siamashka via Digitalmars-d-learn

On Saturday, 8 October 2022 at 01:07:46 UTC, rassoc wrote:
On 10/8/22 00:50, Siarhei Siamashka via Digitalmars-d-learn 
wrote:

On Friday, 7 October 2022 at 12:19:59 UTC, bachmeier wrote:
python -c "print(('a' * 49 + 'b') * 2)" > test.lst


That's generating a file with a single line:

$> wc -l test.lst
1 test.lst


If you insist on having 100K lines in the input data, then you 
can try a different test (run it as a shell script):


```bash
python -c "print(('a' *  + 'b') * 89 + '\\n' * 9 + 'a' * 
1)" > words.txt

cp words.txt test.lst
wc -l words.txt

NEEDLE=`python -c "print('a' * 1, end='')"`

echo "=== Python ==="
time python search.py "$NEEDLE"

echo "=== Ruby ==="
time ruby search.rb "$NEEDLE"

echo "=== Crystal ==="
time ./search-cr "$NEEDLE"

echo "=== D (LDC) ==="
time ./search-ldc "$NEEDLE"
```

It is testing a 1MB file with 100K lines and a 10K characters 
long needle. Results from my computer (with Python 3.10, because 
earlier versions of Python are slower):


```
10 words.txt
=== Python ===
1

real0m0.083s
user0m0.072s
sys 0m0.010s
=== Ruby ===
1

real0m0.313s
user0m0.313s
sys 0m0.000s
=== Crystal ===
1

real0m1.492s
user0m1.482s
sys 0m0.010s
=== D (LDC) ===
1

real1m10.314s
user1m10.234s
sys 0m0.050s
```

Going with an appropriate 100k mixed line file and your 
mentioned needle, D is still quite a bit slower, but the 
results aren't as drastic


Your "appropriate" file is also entirely artificial and happens 
to be specifically crafted to work best with the current 
".canFind" implementation from Phobos. The primary source of 
slowness are partial matches. Such as the `"int i = "` prefix 
when searching for `"int i = 0;"` substring in a string, which 
contains `"int i = 1;"`. The time is wasted on comparing the 
initial 8 characters before encountering a mismatch and bailing 
out. But when searching in a randomly generated gibberish, the 
chances of encountering long partial matches are very low. At 
least lower than in the real text files.



and nowhere near "two orders of magnitude".


Does the difference really have to be two orders of magnitude for 
you to acknowledge that there might be a performance problem in 
Phobos? Moreover, you are quoting torhu, who compared two D 
implementations compiled by DMD (regex vs. canFind) rather than 
standard libraries of different programming languages.



```D
import std;
void main(string[] args) {
enforce(args.length > 1, "Need a needle argument.");
auto needle = args[1].toLower;
File("words.txt").byLine.count!(ln => 
ln.asLowerCase.canFind(needle)).writeln;

}
```


It's a nicely looking one-liner implementation in D language and 
everything is fine. Except that similar one-liners implemented 
using other programming languages are faster and more versatile 
(can handle any input data without catastrophic performance 
pitfalls).


BTW, changing ".asLowerCase" to ".toLower" introduces an extra 
memory allocation, but still significantly improves handling of 
the worst case. Conversion to lowercase is expensive for unicode 
characters and revisiting the same character to repeat such 
conversion again and again is bad for performance.


Re: Replacing tango.text.Ascii.isearch

2022-10-07 Thread rassoc via Digitalmars-d-learn

On 10/8/22 00:50, Siarhei Siamashka via Digitalmars-d-learn wrote:

On Friday, 7 October 2022 at 12:19:59 UTC, bachmeier wrote:
python -c "print(('a' * 49 + 'b') * 2)" > test.lst


That's generating a file with a single line:

$> wc -l test.lst
1 test.lst

Going with an appropriate 100k mixed line file and your mentioned needle, D is still 
quite a bit slower, but the results aren't as drastic and nowhere near "two orders 
of magnitude".

$> crystal build --release -o search-cr search.cr
abort "Need a needle argument." unless ARGV.size >= 1
needle = ARGV[0].downcase
puts File.open("words.txt").each_line.count(&.downcase.includes? needle)

$> ldc2 -O2 --release -of search-ldc search.d
import std;
void main(string[] args) {
enforce(args.length > 1, "Need a needle argument.");
auto needle = args[1].toLower;
File("words.txt").byLine.count!(ln => 
ln.asLowerCase.canFind(needle)).writeln;
}


Re: Replacing tango.text.Ascii.isearch

2022-10-07 Thread Siarhei Siamashka via Digitalmars-d-learn

On Friday, 7 October 2022 at 12:19:59 UTC, bachmeier wrote:

https://www.cs.utexas.edu/users/moore/best-ideas/string-searching/

"the longer the pattern is, the faster the algorithm goes"


Yes, that's how substring search works in the standard libraries 
of the other programming languages. Now please take the torhu's 
test program (posted a few comments above) and generate input for 
it using


python -c "print(('a' * 49 + 'b') * 2)" > test.lst

Run it to search for 
"aa" (the 
character 'a' replicated 50 times) and then compare its 
performance against the other similar programs doing the same 
thing:


Python:

```Python
import sys
assert len(sys.argv) >= 2, "Need a needle argument."
needle = sys.argv[1].lower()
print(sum([1 if line.lower().find(needle) != -1 else 0 for line 
in open("test.lst")]))

```

Ruby/Crystal:

```Ruby
abort "Need a needle argument." unless ARGV.size >= 1
needle = ARGV[0].downcase
puts File.open("test.lst").each_line.sum {|line| 
line.downcase.index(needle) ? 1 : 0 }

```

A generic substring search is tuned to be fast on any input in 
the other programming languages. But in Phobos a simpleminded 
slow search algorithm is used by default. The Boyer-Moore 
algorithm can be used too if it's explicitly requested, but it 
has some gotchas of its own.


Re: Replacing tango.text.Ascii.isearch

2022-10-07 Thread bachmeier via Digitalmars-d-learn
On Friday, 7 October 2022 at 07:16:19 UTC, Siarhei Siamashka 
wrote:
On Friday, 7 October 2022 at 06:34:50 UTC, Siarhei Siamashka 
wrote:
Also are we allowed to artificially construct needle and 
haystack to blow up this test rather than only benchmarking it 
on typical real data?


Such as generating the input data via running:

python -c "print(('a' * 49 + 'b') * 2)" > test.lst

And then using 
"aa" (the 
character 'a' replicated 50 times) as the needle to search for. 
Much longer needles work even better. In Linux the command line 
size is limited by 128K, so there's a huge room for improvement.


https://www.cs.utexas.edu/users/moore/best-ideas/string-searching/

"the longer the pattern is, the faster the algorithm goes"


Re: Replacing tango.text.Ascii.isearch

2022-10-07 Thread Siarhei Siamashka via Digitalmars-d-learn
On Friday, 7 October 2022 at 06:34:50 UTC, Siarhei Siamashka 
wrote:
Also are we allowed to artificially construct needle and 
haystack to blow up this test rather than only benchmarking it 
on typical real data?


Such as generating the input data via running:

python -c "print(('a' * 49 + 'b') * 2)" > test.lst

And then using 
"aa" (the 
character 'a' replicated 50 times) as the needle to search for. 
Much longer needles work even better. In Linux the command line 
size is limited by 128K, so there's a huge room for improvement.


Re: Replacing tango.text.Ascii.isearch

2022-10-07 Thread Siarhei Siamashka via Digitalmars-d-learn

On Friday, 7 October 2022 at 00:57:38 UTC, rassoc wrote:

On 10/7/22 01:39, torhu via Digitalmars-d-learn wrote:

regex is about ten times faster then.


Interesting! Using your code, I'm seeing a 1.5x max difference 
for ldc, nothing close to 10x. Welp, the woes of superficial 
benchmarking. :)


Benchmark results depend on many things, such as the actual text 
in both needle and haystack and the needle length. Are we dealing 
with unicode text by the way? One example is searching for 
something like "äußere" in 
https://www.gutenberg.org/ebooks/6343.txt.utf-8


If it's the source code, then searching for 
"sqlite3_value_bytes16" in the sqlite3.c file from 
https://www.sqlite.org/2022/sqlite-amalgamation-3390400.zip may 
be a good test too.


I'm getting at least 5x difference in favor of regex with LDC on 
these two examples.


Also are we allowed to artificially construct needle and haystack 
to blow up this test rather than only benchmarking it on typical 
real data?


Re: Replacing tango.text.Ascii.isearch

2022-10-06 Thread rassoc via Digitalmars-d-learn

On 10/7/22 01:39, torhu via Digitalmars-d-learn wrote:

regex is about ten times faster then.


Interesting! Using your code, I'm seeing a 1.5x max difference for ldc, nothing 
close to 10x. Welp, the woes of superficial benchmarking. :)



Re: Replacing tango.text.Ascii.isearch

2022-10-06 Thread torhu via Digitalmars-d-learn

On Thursday, 6 October 2022 at 21:36:48 UTC, rassoc wrote:

And what kind of testing was that? Mind to share? Because I did 
the following real quick and wasn't able to measure a "two 
orders of magnitude" difference. Sure, the regex version came 
on top, but they were both faster than the ruby baseline I 
cooked up.


Originally I just loaded a one megabyte file and searched the 
whole thing. I changed it to split it into (40 000) lines 
instead, regex is about ten times faster then. I compile with 
-release -O -inline. Here is the second version:


```d
import std;
import std.datetime.stopwatch;

enum FILE = "test.lst";
string text;
string needle;

void test(bool delegate(string haystack) dg)
{

auto sw = StopWatch(AutoStart.yes);
int counter = 0;
foreach (line; lineSplitter(text)) {
if (dg(line))
counter++;
}
sw.stop();
writefln("%s", sw.peek());
writefln("counter: %s", counter);
}

void main(char[][] args)
{
enforce(args.length > 1, "Need a needle argument.");

text = cast(string)read(FILE);
needle = args[1].idup;
auto re = regex(to!string(escaper(needle)), "i");
string needleLower = needle.toLower();

test((h) => !!h.matchFirst(re));
test((h) => h.asLowerCase().canFind(needleLower));
}
```



Re: Replacing tango.text.Ascii.isearch

2022-10-06 Thread rassoc via Digitalmars-d-learn

On 10/5/22 23:50, torhu via Digitalmars-d-learn wrote:

I did some basic testing, and regex was two orders of magnitude faster. So now 
I know, I guess.


And what kind of testing was that? Mind to share? Because I did the following real quick 
and wasn't able to measure a "two orders of magnitude" difference. Sure, the 
regex version came on top, but they were both faster than the ruby baseline I cooked up.

First, generate a word file with 100k entries of various lengths:

$> dmd -run words.d foobaz 10
---
import std;

string randomWord(ulong n) {
static chars = letters.array;
return generate!(() => chars.choice).take(n).text;
}

void main(string[] args) {
enforce(args.length == 3, "Usage: dmd -run words.d needle num");

auto f = File("words.txt", "w");
foreach (i; 0..args[2].to!ulong) {
ulong n = uniform(0, 50), m = uniform(0, 50);
if (i % 2 == 0)
f.writeln(randomWord(n), args[1], randomWord(m));
else
f.writeln(randomWord(n + m));
}
}
---

And then for the actual measuring:

$> dmd -O -version={range,regex} -of=search-{range,regex} search.d
$> ldc -O -d-version={range,regex} -of=search-{range,regex}-ldc search.d
$> time ./search-{range,regex}{,-ldc} foobaz
---
import std;

void main(string[] args) {
enforce(args.length == 2, "Usage: search 'needle'");

version (regex)
auto rx = regex(args[1], "i");
else version (range)
auto needle = args[1].asLowerCase.text;
else
static assert(0, "use -version={regex,range}");

ulong matches;
foreach (line; File("words.txt").byLine) {
version (regex)
if (line.matchFirst(rx))
matches++;
version (range)
if (line.asLowerCase.canFind(needle))
matches++;
}
writeln(matches);
}
---


Re: Replacing tango.text.Ascii.isearch

2022-10-06 Thread Sergey via Digitalmars-d-learn
On Thursday, 6 October 2022 at 08:15:10 UTC, Siarhei Siamashka 
wrote:

On Wednesday, 5 October 2022 at 21:50:32 UTC, torhu wrote:


Please don’t tell us that D will be slower than Python again?)




Re: Replacing tango.text.Ascii.isearch

2022-10-06 Thread Siarhei Siamashka via Digitalmars-d-learn

On Wednesday, 5 October 2022 at 21:50:32 UTC, torhu wrote:
I did some basic testing, and regex was two orders of magnitude 
faster. So now I know, I guess.


Substring search functionality is currently in a very bad shape 
in Phobos. I discovered this myself a few weeks ago when I was 
trying to solve the 
https://www.facebook.com/codingcompetitions/hacker-cup/2022/round-1/problems/A2 puzzle using D language. A part of the solution requires a fast substring search (actually a subarray search) and Phobos doesn't offer anything with even remotely acceptable performance.


Phobos does have a Boyer-Moore implementation. This algorithm is 
historically famous in computer science as one of the first 
attempts to optimize substring search, but it also has 
pathologically bad performance on certain input data and probably 
shouldn't be recommended for any practical use nowadays. The 
users of old versions of Python discovered this too: 
https://codeforces.com/blog/entry/106849?#comment-952483


The standard 'find' function from the following simple example 
also becomes awfully slow when arrays 'a' and 'b' are large 
and/or are adversarially constructed:


```D
import std;
void main() {
  auto a = [1, 2, 3, 4];
  auto b = [2, 3];
  writefln("Is %s a subarray of %s? The answer is %s.",
   b, a, a.find(b).empty ? "no" : "yes");
}
```

I think that the best fit for your use case is the 
https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm and there's an old issue in bugzilla about this: https://issues.dlang.org/show_bug.cgi?id=16066


BTW, if anyone is curious, one of the possible solutions for the 
hacker-cup/2022/round-1/problems/A2 puzzle in D language can be 
found here: 
https://codeforces.com/blog/entry/106768?#comment-952808


Re: Replacing tango.text.Ascii.isearch

2022-10-05 Thread torhu via Digitalmars-d-learn
On Wednesday, 5 October 2022 at 17:29:25 UTC, Steven 
Schveighoffer wrote:



```d
bool isearch(S1, S2)(S1 haystack, S2 needle)
{
import std.uni;
import std.algorithm;
return haystack.asLowerCase.canFind(needle.asLowerCase);
}
```

untested.

-Steve


I did some basic testing, and regex was two orders of magnitude 
faster. So now I know, I guess.


Re: Replacing tango.text.Ascii.isearch

2022-10-05 Thread Ali Çehreli via Digitalmars-d-learn

On 10/5/22 13:40, torhu wrote:


 auto sw = StopWatch();


Either this:

auto sw = StopWatch(AutoStart.yes);

or this:

auto sw = StopWatch();
sw.start();

Ali




Re: Replacing tango.text.Ascii.isearch

2022-10-05 Thread torhu via Digitalmars-d-learn

On Wednesday, 5 October 2022 at 20:45:55 UTC, torhu wrote:

On Wednesday, 5 October 2022 at 20:40:46 UTC, torhu wrote:



Am I doing something wrong here?


Right, you can instantiate structs without arguments. It's been 
ten years since I last used D, I was thinking of structs like 
if they were classes.


I think there should be sensible default here, seems like an easy 
trap to remove.


Re: Replacing tango.text.Ascii.isearch

2022-10-05 Thread torhu via Digitalmars-d-learn

On Wednesday, 5 October 2022 at 20:40:46 UTC, torhu wrote:



Am I doing something wrong here?


Right, you can instantiate structs without arguments. It's been 
ten years since I last used D, I was thinking of structs like if 
they were classes.


Re: Replacing tango.text.Ascii.isearch

2022-10-05 Thread torhu via Digitalmars-d-learn
On Wednesday, 5 October 2022 at 17:29:25 UTC, Steven 
Schveighoffer wrote:

[...]


I wanted to do some quick benchmarking to figure out what works.

When I run this:

```d
import std.stdio;
import std.datetime.stopwatch;

void main()
{
auto sw = StopWatch();  
sw.stop();
writeln(sw.peek().toString());
}
```

It prints this:
2 weeks, 6 days, 9 hours, 34 minutes, 43 secs, 214 ms, 946 ╬╝s, 
and 7 hnsecs


Am I doing something wrong here?


Re: Replacing tango.text.Ascii.isearch

2022-10-05 Thread Steven Schveighoffer via Digitalmars-d-learn

On 10/5/22 12:59 PM, torhu wrote:
I need a case-insensitive check to see if a string contains another 
string for a "quick filter" feature. It should preferrably be perceived 
as instant by the user, and needs to check a few thousand strings in 
typical cases. Is a regex the best option, or what would you suggest?


https://dlang.org/phobos/std_uni.html#asLowerCase

```d
bool isearch(S1, S2)(S1 haystack, S2 needle)
{
import std.uni;
import std.algorithm;
return haystack.asLowerCase.canFind(needle.asLowerCase);
}
```

untested.

-Steve