On 8/2/2012 3:38 PM, Jonathan M Davis wrote:
On Thursday, August 02, 2012 15:14:17 Walter Bright wrote:
Remember, its the consumer doing the decoding, not the input range.
But that's the problem. The consumer has to treat character ranges specially
to make this work. It's not generic. If it were generic, then it would simply
be using front, popFront, etc. It's going to have to special case strings to
do the buffering that you're suggesting. And if you have to special case
strings, then how is that any different from what we have now?
No, the consumer can do its own buffering. It only needs a 4 character buffer,
worst case.
If you're arguing that strings should be treated as ranges of code units, then
pretty much _every_ range-based function will have to special case strings to
even work correctly - otherwise it'll be operating on individual code points
rather than code points (e.g. filtering code units rather than code points,
which would generate an invalid string). This makes the default behavior
incorrect, forcing _everyone_ to special case strings _everywhere_ if they
want correct behavior with ranges which are strings. And efficiency means
nothing if the result is wrong.
No, I'm arguing that the LEXER should accept a UTF8 input range for its input. I
am not making a general argument about ranges, characters, or Phobos.
As it is now, the default behavior of strings with range-based functions is
correct but inefficient, so at least we get correct code. And if someone wants
their string processing to be efficient, then they special case strings and do
things like the buffering that you're suggesting. So, we have correct by
default with efficiency as an option. The alternative that you seem to be
suggesting (making strings be treated as ranges of code units) means that it
would be fast by default but correct as an option, which is completely
backwards IMHO. Efficiency is important, but it's pointless how efficient
something is if it's wrong, and expecting that your average programmer is
going to write unicode-aware code which functions correctly is completely
unrealistic.
Efficiency for the *lexer* is of *paramount* importance. I don't anticipate
std.d.lexer will be implemented by some random newbie, I expect it to be
carefully implemented and to do Unicode correctly, regardless of how difficult
or easy that may be.
I seem to utterly fail at making this point.
The same point applies to std.regex - efficiency is terribly, terribly important
for it. Everyone judges regexes by their speed, and nobody cares how hard they
are to implement to get that speed.
To reiterate another point, since we are in the compiler business, people will
expect std.d.lexer to be of top quality, not some bag on the side. It needs to
be usable as a base for writing a professional quality compiler. It's the reason
why I'm pushing much harder on this than I do for other modules.