Hi again Brian,

----- Original Message -----
From: "shire"
Sent: Monday, April 06, 2009


Hey Matt,

Matt Wilmas wrote:
Yep, 5.3's snapshot self-compiled from a couple days ago on Windows (not
that that should matter). (I'm not regenerating it with re2c, which also
shouldn't matter; using the existing .c file. I haven't touched the
scanner stuff in a long time (yet) to regen.) Scanner of course hasn't
changed since then.

Here's what I'm currently doing (more or less with some changed paths):
[...]
  [1]=>
  array(3) {
    [0]=>
    int(366)
    [1]=>
"   string(57) "// this comment and trailing blank contain windows CR+LF
    [2]=>
    int(2)
  }

As a side note, I just noticed that the full Windows newline (\r\n, CR+LF) isn't getting taken with the comment (\n included in WHITESPACE after), as *nix's \n does. See the " before string(57)? Because the CR is resetting the line I guess, without going to the next. It's this rule:

<ST_ONE_LINE_COMMENT>[^\n\r?%>]*{ANY_CHAR} {

that's only matching the \r before returning T_COMMENT. Simple enough to fix as well, but I hadn't spotted that one before until I was trying to see why that quote was out-of-place. :-) (This isn't new in 5.3 though...)

  [2]=>
  array(3) {
    [0]=>
    int(371)
    [1]=>
    string(3) "

"
    [2]=>
    int(2)
  }
}


The newlines look like this in the second file:

<?php$
// this comment and trailing blank contain windows CR+LF^M$
^M$

Unfortunately I can't test on a windows build, perhaps you could re-test
or share your reproduction that fails as this seems to work for me unless
I'm of course missing some difference.


Test case is the one in the bug report. :-) Last token is not the
comment, but whitespace.

There are two reproductions in the bug report ;-)

Oops, forgot about the second one -- I meant the first in the initial report. The part I'm talking about is: "It only seems to occur if there isn't a newline behind the comment." So the easiest way to see is simply:

var_dump(token_get_all('<?php // test'));

array(1) {
 [0]=>
 array(3) {
   [0]=>
   int(368)
   [1]=>
   string(6) "<?php "
   [2]=>
   int(1)
 }
}

Also, the unterminated comment Warning is still missing with "<?php /*
blah " like it's been since the re2c change (except maybe for the time
your fix was applied). My changes would clean this up of course, unless
you do something first.

I think fixing this would be great as well as the other highlighter test
that was changed.  I would just prefer that the scanner handle these
rather than us implementing what is essentially a hand-written scanner
within the lexer file.

Yeah, I remember you said that last time. :-) But like the inline HTML scanner part you mentioned then, if it's pretty simple to implement manually, I thought it seemed logical (I don't know if that stuff was possible with how flex worked; it was only after seeing the HTML scanning that I thought, "Ah.") The regex would've generated more code, and probably wouldn't make much difference for readability...? (I still wonder if it wasn't used because it wouldn't work with the re2c issues otherwise.) With the string, etc. scanning, my regular expressions are pretty complicated, to match stuff that isn't very complicated, which generates a LOT of code, and probably aren't that readable or easy to understand, even with the comments. Well anyway, if I do something I'll send it along for analysis!


-shire


- Matt

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to