On 18.11.2011 21:07, Andrea Fontana wrote:
It seems related to toLower too...

Here the line with exception:

s = replace(s, regex(`[^"a-zA-Z0-9àòèéìù\.]`, "g"), " ").toLower();

Where s is a string with that sequence...

Using dmd 2.056

You mean one of prepackaged zips|debs|etc. from the website? It uses the old regex, which, I have to admit, is not that good with unicode. Then ... well you are somewhat out of luck untill next release.

That's where brand new regex engine is coming, provided I figure out mysterious FreeBSD|OSX issue (sigh). Unfortunately, I was very busy recently, though maybe this weekend I'll finally work something out.

I just tested it with my version on win32 ... well it hits one of asserts (it should have been exception, ouch!), but the fix was easy. It's all about . that works as simple '.' char in [], it's just wrong to escape it inside character class (some engines do allow this, though it's confusing like hell).
After that it outputs stuff like this:
std.regex.RegexException@std\regex.d(1939): invalid escape sequence
Pattern with error: `[^"a-zA-Z0-9àòèéìù\.` <--HERE-- `]`

After changing \. --> . It does work for me with s = "Sò  ", no exceptions.

Bottom line:
Thanks, as I uncovered a serious issue i.e. misjudged assert on wrong escapes in character classes.
Second if you are on win32/linux you might want to try fresh github version.
And stay tuned for the next release that should fix most of regex issues once and for all.


Il giorno ven, 18/11/2011 alle 20.33 +0400, Dmitry Olshansky ha scritto:
On 18.11.2011 17:58, Andrea Fontana wrote:
>  I build a data access layer in c++. This layer works with mongo db where
>  string are always encoded using UTF-8. I've ported this layer in D using
>  swig. String is written correctly in console but when i use std.regex
>  sometimes it gives an exception:
>
>  core.exception.UnicodeException@src
>  <mailto:core.exception.UnicodeException@src>/rt/util/utf.d(290): invalid
>  UTF-8 sequence
>
>  Byte sequence (for better undestanding) is:
>  [83, 195, 179, 32]
>
>  And the string was"Sò  "  (with accented o and a space)
>
>  I'm not a utf expert, so Is it a wrong utf-8 encoding or it is a bug on
>  utf.d?
>

Which version of std.regex are you using - the one from git master or
the one in the latest release?
If it's the former then I'm willing to look into this thing on weekend,
if you can get a hold of a pair: string + pattern that fails like this.




--
Dmitry Olshansky

Reply via email to