Edit report at https://bugs.php.net/bug.php?id=54089&edit=1
ID: 54089
User updated by: nicolas dot grekas+php at gmail dot com
Reported by: nicolas dot grekas+php at gmail dot com
Summary: token_get_all with regards to __halt_compiler is not
binary safe
-Status: Closed
+Status: Assigned
Type: Bug
Package: Unknown/Other Function
Operating System: Any
PHP Version: 5.3.5
Assigned To: iliaa
Block user comment: N
Private report: N
New Comment:
Excerpt from internals:
Nikita Popov Fri, Sep 9, 2011 at 09:15
[...]
The problem with the patch is, that there are some tokens after
T_HALT_COMPILER that are of interest, namely the '(' ')' ';'. After
the patch it is impossible to get those tokens, without either
relexing the code after T_HALT_COMPILER (that way you get the binary
data problem back, just with much more complex code) or writing a
regular expression to match it (which is really hard, as there may be
any token dropped by the PHP parser in there, i.e. whitespace,
comments, PHP tags).
[...]
I would like this patch to be reverted on the 5.4 and trunk branches.
I assume it's too late to revert it on 5.3, as it has been there for
some time already. It is just counterproductive. (Alternatively one
could fix token_get_all to return the (); tokens after
__halt_compiler, too, but that would be hard, probably.)
--
Ferenc Kovacs Fri, Sep 9, 2011 at 10:01
I think that it wouldn't be too hard.
>From a quick glance on the code, we should introduce a new local
variable, set that to true where we break now (
http://svn.php.net/viewvc/php/php-src/trunk/ext/tokenizer/tokenizer.c?view=markup#l155
) and don't break there but for the next ';'. another maybe less
confusing solution would be to explicitly add '(', ')' and ';' to the
result in the T_HALT_COMPILER condition before breking out of the
loop.
I will create a patch for this afternoon.
or could there be other important tokens after the __halt_compiler()
which should be present in the token_get_all() result?
--
Nicolas Grekas Fri, Sep 9, 2011 at 10:46
> don't break there but for the next ';'.
You can also just count the number of semantic token after T_HALT_COMPILER (ie
excluding whitespace and comments) and once you hit 3, halt.
> less confusing solution would be to explicitly add '(', ')' and ';' to the
> result in the T_HALT_COMPILER condition before breking out of the loop.
If you mean verifying that '(', ')' and (';' or T_CLOSE_TAG) are effectively
following T_HALT_COMPILER, I think that's part of the syntax analyser's job,
not tokenizer's.
If you're ok with this argument, then just couting 3 tokens is really the most
basic "syntax analysis" we have to do to fix the pb, don't you think?
> could there be other important tokens after the __halt_compiler()
> which should be present in the token_get_all() result?
Maybe the binary data itself, as a big T_INLINE_HTML for example ?
Also, if token_get_all you be made binary safe, that would be very cool ! (no
more eating of \x00-\x1F inside regular code) :)
--
Nikita Popov Fri, Sep 9, 2011 at 19:39
> You can also just count the number of semantic token after T_HALT_COMPILER
> (ie excluding whitespace and comments) and once you hit 3, halt.
> [...]
> Maybe the binary data itself, as a big T_INLINE_HTML for example ?
In favor of both proposals!
Returning the next 3 tokens should be quite easy [1].
Returning the rest as an T_INLINE_HTML makes sense too, as extracting
the data is probably what you want. Though I have no idea how to
implement that ^^
[1]:
https://github.com/nikic/php-src/commit/2d4cfa05947f04de447635ca5748b3b58defbfaf
(Not tested, only guessing)
--
Nikita Popov Tue, Sep 13, 2011 at 09:16
I just set up an PHP environment and wrote a proper patch (including
test changes) to make it collect the next three tokens. It's a git
patch and I'm not sure whether it's compatible with SVN patches. I
would love it if this would go into 5.4 before beta. I didn't know how
one could fetch the rest into T_INLINE_HTML, so I'm hoping on help
here from someone who actually knowns C there :)
Previous Comments:
------------------------------------------------------------------------
[2011-03-17 23:51:25] nicolas dot grekas+php at gmail dot com
Latest 5.3.6 has been released with the patch...
So now token_get_all() stops on T_HALT_COMPILER for ever :)
------------------------------------------------------------------------
[2011-03-10 08:26:30] nicolas dot grekas+php at gmail dot com
Really, the actual patch is a step backward, I can't do things that were easy
before (getting the halt_compiler_offset with token_get_all)...
Please consider reverting it!
------------------------------------------------------------------------
[2011-03-03 15:44:33] nicolas dot grekas+php at gmail dot com
Sorry to reopen. As 5.3.6 is in RC, I just want to be sure my previous comment
has been read. What about reverting the patch ?
------------------------------------------------------------------------
[2011-03-01 10:15:47] nicolas dot grekas+php at gmail dot com
Thanks for the patch. After reading it, I'm not sure it really helps,
considering that the stop on T_HALT_COMPILER was already easily feasible in
plain PHP. In fact, it may be worse, because now if I want to access data after
T_HALT_COMPILER in PHP using tokenizer, I have to write more code, as the data
is missing from the token array.
As a corner case also, __halt_compiler is always followed by 3 valid tokens:
"(", ")" then ";" or T_CLOSE_TAG, with any number of
T_WHITESPACE/T_COMMENT/T_DOC_COMMENT between.
My view is that this "bug" can be fixed by introducing a new
T_UNEXPECTED_CHARACTER token type, matching those "Unexpected character in
input" warnings: this would fix token_get_all binary unsafeness. Is it a good
idea? I don't know if it's difficult to implement, nor if it would introduce
any BC break, so maybe a "Won't fix" on this bug is enough?
Could the patch be reverted? I'm afraid it's the best for tokenizer users...
Here is what I was using before the patch to work around this binary
incompatibility:
<?php
// New token matching an "Unexpected character in input"
define('T_UNEXPECTED_CHARACTER', -1);
$src_tokens = @token_get_all($code);
$bin_tokens = array();
$offset = 0;
$i = -1;
while (isset($src_tokens[++$i]))
{
$t = isset($src_tokens[$i][1]) ? $src_tokens[$i][1] : $src_tokens[$i];
while ($t[0] !== $code[$offset])
$bin_tokens[] = array(T_UNEXPECTED_CHARACTER, $code[$offset++]);
$offset += strlen($t);
$bin_tokens[] = $src_tokens[$i];
unset($src_tokens[$i]);
}
// Here, $bin_tokens contains binary safe tokens
?>
------------------------------------------------------------------------
[2011-02-28 16:18:35] [email protected]
This bug has been fixed in SVN.
Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.
Thank you for the report, and for helping us make PHP better.
------------------------------------------------------------------------
The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at
https://bugs.php.net/bug.php?id=54089
--
Edit this bug report at https://bugs.php.net/bug.php?id=54089&edit=1