On 01/18/2011 06:14 PM, Michel Fortin wrote:

On 2011-01-18 11:38:45 -0500, Andrei Alexandrescu <[email protected]> said:
I was thinking along the lines of:

struct Grapheme
{
private string support_;
...
}

struct ByGrapheme
{
private string iteratee_;
bool empty();
Grapheme front();
void popFront();
// Additional funs
dchar frontCodePoint();
void popFrontCodePoint();
char frontCodeUnit();
void popFrontCodeUnit();
...
}

// helper function
ByGrapheme byGrapheme(string s);

// usage
string s = ...;
size_t i;
foreach (g; byGrapheme(s))
{
writeln("Grapheme #", i, " is ", g);
}

We need this range in Phobos.

Yes, we need a grapheme range.

But that's not what my thing was about. It was about shortcutting code
point decoding when it isn't necessary while still keeping the ability
to decode to code points when iterating on the same range. For instance,
here's a simple made up example:

string s = "<hello>";
if (!s.empty && s.frontUnit == '<')
s.popFrontUnit(); // skip
while (!s.empty && s.frontUnit != '>')
s.popFront(); // do something with each code point
if (!s.empty && s.frontUnit == '>')
s.popFrontUnit(); // skip
assert(s.empty);

Here, since I know I'm testing and skipping for '<', an ASCII character,
decoding the code point is wasted time, so I skip that decoding. The
problem is that this optimization can't happen with a range that
abstracts things at the code point level. I can do it with strings
because strings still allow you to access code units through the
indexing operators, but this can't really apply to ranges of code points
in general.

And parsing with range of code unit would also be a pain, because even
if I'm testing for '<' for the first character, sometimes I really need
to advance by code point and test for code points.

This means a single string type that exposes various _synchrone_ range levels (codeunit, codepoint, grapheme), doesn't it? As opposed to Andrei's approach of ranges beeing structures external to string types, IIUC, which thus move on independantly?

One thing that might be interesting is benchmarking my XML parser by
replacing every instance of frontUnit and popFrontUnit with front and
popFront. That won't change there results, but it'd give us an idea of
the overhead of the unnecessary decoded characters code points.

Yes, would you have time to do it? I would be interesting in such perf measurements. (--> your idea about a Text variant, for which I would like to know whether it's worth still decoding systematically.)

Denis
_________________
vita es estrany
spir.wikidot.com


Reply via email to