Re: Autodecode in the wild and An Awful Hack to std.regex

2016-08-01 Thread Joakim via Digitalmars-d-learn

On Thursday, 28 July 2016 at 21:02:59 UTC, John Carter wrote:

On Thursday, 28 July 2016 at 15:48:58 UTC, Seb wrote:

[...]


Eh. I hoped that somewhere in that explosion of discussion on 
the topic the problem had been solved and I had just missed it 
and merely had to use that.


Also this idea is a bit immature for a DIP... I haven't look at 
the regex code beyond the the stack trace it died on.


ie.

* Would this even be a Good Idea for a dip or is it better 
solve by another existing means?
* I only inspected and changed one occurrence of decode (the 
one that broke) is there any other route in the regex engine 
that could throw a UTFException?
* Would adding an additional template parameter with default 
break existing code? Or would I have to provide a shim?


I suggest you talk to Dmitry, who wrote std.regex, as he will be 
motivated to look into this.


Re: Autodecode in the wild and An Awful Hack to std.regex

2016-07-28 Thread John Carter via Digitalmars-d-learn

On Thursday, 28 July 2016 at 15:48:58 UTC, Seb wrote:
We call them DIP (D Improvement Proposals) and I think it's a 
lot more productive way to discuss improvements than in the 
forum.


Eh. I hoped that somewhere in that explosion of discussion on the 
topic the problem had been solved and I had just missed it and 
merely had to use that.


Also this idea is a bit immature for a DIP... I haven't look at 
the regex code beyond the the stack trace it died on.


ie.

* Would this even be a Good Idea for a dip or is it better solve 
by another existing means?
* I only inspected and changed one occurrence of decode (the one 
that broke) is there any other route in the regex engine that 
could throw a UTFException?
* Would adding an additional template parameter with default 
break existing code? Or would I have to provide a shim?




Re: Autodecode in the wild and An Awful Hack to std.regex

2016-07-28 Thread Seb via Digitalmars-d-learn

On Thursday, 28 July 2016 at 09:10:33 UTC, Kagamin wrote:
Create an RFE? Given that regex returns results as slices of 
the input string, using the replacement character doesn't 
introduce data corruption.


We call them DIP (D Improvement Proposals) and I think it's a lot 
more productive way to discuss improvements than in the forum.


For more info see: https://github.com/dlang/DIPs


Re: Autodecode in the wild and An Awful Hack to std.regex

2016-07-28 Thread Kagamin via Digitalmars-d-learn
A template parameter is usually needed when it affects the output 
data, but in case of regex it won't do much, because the output 
data are slices of the input string, so decoding doesn't affect 
them, only exceptions.


Re: Autodecode in the wild and An Awful Hack to std.regex

2016-07-28 Thread Lodovico Giaretta via Digitalmars-d-learn

On Thursday, 28 July 2016 at 09:10:33 UTC, Kagamin wrote:
Create an RFE? Given that regex returns results as slices of 
the input string, using the replacement character doesn't 
introduce data corruption.


(RFE = Request For Enhancement, right?)
Yes, all algorithms that use decode internally shall provide a 
template parameter to useReplacementDchar. There's no reason not 
to expose this option, given that it allows not to brutally abort 
any computation in such situations, and also has the bonus point 
of making decode @nogc.


Re: Autodecode in the wild and An Awful Hack to std.regex

2016-07-28 Thread Kagamin via Digitalmars-d-learn
Create an RFE? Given that regex returns results as slices of the 
input string, using the replacement character doesn't introduce 
data corruption.


Autodecode in the wild and An Awful Hack to std.regex

2016-07-27 Thread John Carter via Digitalmars-d-learn
Don't you just hate it when you google a problem and find a post 
from yourself asking the same question?


In 2013 I ran into the UTF8 invalid char autodecode UTFException, 
and the answer then was "use std.encoding.sanitize" and my 
opinion looking at the implementation, was then, as is now... Eww!


Since then, I'm glad to see Walter Bright agrees that autodecode 
is problematic.


http://forum.dlang.org/thread/nh2o9i$hr0$1...@digitalmars.com

After wading through 46 pages of that, and Jack Stouffers handy 
blog entry and the longish discussion thread on it...

https://forum.dlang.org/post/eozguhavggchzzruz...@forum.dlang.org

Maybe I missed something.

What am I supposed to do here in 2016?

I was happily churning through 25 gigabytes of data with

foreach( line; File( file).byLine()) {
   auto c = line.matchFirst( myRegex);
  .
  .

When I hit an invalid codepoint...Again.

What is an efficient (elegant) solution?

An inelegant solution was to hack into the point that throws the 
exception and "Do It Right" (for various values of Right)



diff -u /usr/include/dmd/phobos/std/regex/internal/ir.d{~,}
--- /usr/include/dmd/phobos/std/regex/internal/ir.d~	2015-12-03 
14:41:31.0 +1300
+++ /usr/include/dmd/phobos/std/regex/internal/ir.d	2016-07-28 
11:04:55.525480585 +1200

@@ -591,7 +591,7 @@
 pos = _index;
 if(_index == _origin.length)
 return false;
-res = std.utf.decode(_origin, _index);
+res = std.utf.decode!(UseReplacementDchar.yes)(_origin, 
_index);

 return true;
 }
 @property bool atEnd(){

That "Works For Me".

But it vaguely feels to me that that template parameter needs to 
be trickled all the way up the regex engine.