On 10/05/2012 12:49 AM, Henri Sivonen wrote:
On Tue, Apr 10, 2012 at 4:42 PM, Niko Matsakis <[email protected]> wrote:
On 4/10/12 5:53 AM, Henri Sivonen wrote:
The use case I have is targeting Rust with the translator that
currently targets C++ and generates the HTML parser in Gecko. (It uses
goto hidden behind macros to emulate break and continue by label in
C++.)
There is currently no way to do that kind of control flow beyond using flags
with `if` checks or restructuring the code in some other way (tail calls, if
they worked, seem like they would be useful). I believe our `break` can
only target loops in any case.
How hard would it be do you think to prototype a version that avoids these
control-flow features? Also, how important is fall-through for alt vs break
to labeled blocks?
Sorry about the ridiculously long delay. Unfortunate things got in my way.
The HTML parser has the following important break/continue patterns:
The basic pattern of the tokenizer is
stateloop: for (;;) {
switch (state) {
...
case SOME_STATE:
somestateloop: for (;;) {
if (++pos == endPos) {
break stateloop;
}
c = buf[pos];
switch (c) {
case '\r':
silentCarriageReturn();
break stateloop;
case '\n':
silentLineFeed();
// fall thru
case ' ':
case '\t':
case '\u000C':
continue;
case '/':
...
state = WHATEVER_STATE;
continue stateloop;
case '\u0000':
c = '\uFFFD';
// fall thru
case '<':
case '=':
errBadCharOrNull(c);
// fall thru to treat like anything else
default:
...
state = OTHER_STATE;
continue stateloop;
}
}
case OTHER_STATE:
...
}
}
Key characteristics:
1) There's a huge conditionless loop with a huge switch right inside
with a case for each tokenizer state.
2) State cases contain an inner conditionless loop.
3) If the position advances to the end of the buffer or there is a
carriage return, the inner loop breaks out of both loops.
4) State cases never need a break statement to end a state, because
states are exited by assigning to the state variable and then
immediately continuing back to the top of the state loop.
5) Inside the inner loop right after advancing the position and
reading another character from the buffer, there is the switch on the
newly-read character.
6) The switch on the character never needs a plain break statement for
breaking out of a switch case, either. Instead, each case either falls
through, continues the inner loop or continues the state loop.
7) The linefeed case always calls a method and then falls through to
the case for space, tab on a form feed.
8) There are also fall through this around the handling of nul and
errors. In this example, the nul case changes the character to the
replacement character and then falls through to the same cases a
couple of erroneous characters and then that case, after reporting
error, falls through to the "anything else" state.
It would be possible to remove the fall-throughs in the inner switch
by automatically duplicating code from the subsequent cases. It would
be possible to change continuing the stateloop from within the inner
loop to breaking from the inner loop. It would be possible to change
breaking the outer loop from within the inner loop by introducing a
boolean exit condition for the outer loop and setting it in the inner
loop and then breaking the inner loop.
In practice, the tokenizer state loop isn't quite like presented
above. In the example, there's a transition to OTHER_STATE which
happens to be the next state in the outer switch. To avoid checking
the state variable at the top of the loop in that case, I have used
the following micro optimization instead:
...
state = OTHER_STATE;
break somestateloop;
}
}
case OTHER_STATE:
...
That is, the code breaks the inner loop and then ends up falling
through between cases in the outer switch.
This characteristic could be removed by automatically rewriting the
code to the basic case presented first.
The tree builder is more problematic. It looks like this:
starttagloop: for (;;) {
switch (mode) {
case FIRST_MODE:
// This inner loop never loops. It's only to be broken as a "goto" to the
// next mode.
firstmodeloop: for (;;) {
switch (element) {
case foo:
...
break starttagloop;
case bar:
...
mode = OTHER_MODE;
continue starttagloop;
case baz:
if (...) {
...
// move onto next mode
break firstmodeloop:
}
...
break starttagloop;
case qux:
...
continue starttagloop;
default:
// move onto next mode
break firstmodeloop:
}
}
case SECOND_MODE:
...
}
}
So in the tree builder, there is an outer loop and a switch on the
tree builder mode inside it. The cases in this switch are *huge* and
they fall through to the next case as a chain. Eliminating
fall-through by making the translator duplicate code automatically
would result in a massive amounts of duplicated code. Making the
translator generate a method for each mode would be error-prone and
messy and would introduce needless recursion where the above example
breaks or continues the outer loop and a method-per-mode version would
call a method for another mode. It would probably be less messy to
break the outer switch into one "if" statement per mode and making the
conditions of the form (mode <= SECOND_MODE) to make fall-throughs
also enter the next "if"s.
It seems to me that with the current facilities provided by Rust, it
would be fairly straightforward to make the translator change the
structure of the tokenizer to get rid of the language features of Java
that Rust does not have, but it would involve making the translator
duplicate some code in the inner switch cases. Dealing with the tree
builder by deconstructing the outer switch into a series of "if"s and
introducing helper variables for breaking or continuing the outer loop
from within the inner loop would be possible but more special-casey.
That is, at present, it appears that automatically translating the
HTML parser to Rust would be possible without language changes to
Rust, but being able to apply "break" or "again" to an outer loop and
having a statement for continuing from the end of an "alt" branch to
the next branch in the same "alt" would make the translation nicer.
I think adding labeled blocks/loops and the ability to break/continue with a
label is plausible, but a fair bit of work. Fall-through in alt seems less
likely.
Even break and continue by label would be useful without fall-through in alt.
Thanks, Henri. I've posted this information to our bugtracker:
https://github.com/mozilla/rust/issues/2216
_______________________________________________
Rust-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/rust-dev