Re: [PHP-DEV] [PATCH] Scanner diet with fixes, etc.

2009-05-04 Thread shire



Hey Matt,

Thanks for posting, sorry for not having a chance to reply to this sooner.  
Maybe couple things from the patch,


+/* To save initial string length after scanning to first variable, 
CG(doc_comment_len) can be reused */
+#define double_quotes_scanned_len CG(doc_comment_len)
+


(minor) Maybe we should rename this var if we're going to use it for other 
purposes, this doesn't really save any typing.  Also if we do want the define 
maybe we should upper case it so it's more obvious?


+   while (YYCURSOR  YYLIMIT) {
+   switch (*YYCURSOR++) {


In the example above, which we have a couple examples of here, we don't obey 
the YYFILL macro to detect if we have exceeded our EOF.  This *might* be a 
problem, but only really depends on if we intend to use the YYFILL as a 
solution for exceeding our mmap bounds.

Regarding the ZEND_MMAP_AHEAD issue and the temp. fix that Dmitry put in we 
need to find a solution to that, perhaps I can play with that this week too as 
I think I'm seeing some related issues in my testing of 5.3.  Essentially we 
abuse ZEND_MMAP_AHEAD by adding it to the file size and passing it to the mmap 
call which isn't at all valid and only really works up to PAGESIZE.  We could 
possibly use YYFILL to re-allocate more space as necessary past the end of file 
to fix this.

I don't see anything glaring in the patch that's a major issue, I can probably 
test more on a larger code base in the next 2-3 days.  As I've said before this 
seems to be crossing the line of us writing a scanner by hand rather than 
letting re2c do the heavy lifting, but without a modification to re2c to handle 
EOF I don't have an alternative solution currently.  (If we had some way to 
detect which regex we where matching against in the YYFILL that would likely be 
able to handle these bugs, but I didn't see a way to do that easily).

-shire

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [PATCH] Scanner diet with fixes, etc.

2009-05-04 Thread Dmitry Stogov

Hi Matt,

I wasn't able to look into all details of the patch, but in general I 
like it, as it fixes bugs and makes scanner smaller. I think you can 
commit it.


Although this patch doesn't fix the EOF handling related to mmap().

Thanks. Dmitry.

Matt Wilmas wrote:

Hi guys,

- Original Message -
From: Nuno Lopes
Sent: Thursday, April 30, 2009

The patch looks generally ok. However I'll need a few more days to 
review it carefully and throughly. (you can merge it in the 
meantime  if you want).
I'm just slighty concern with the amount of parsing we are now 
doing  by hand, and with the possible (local) security bugs we might 
be introducing..



Am I understanding this properly, that this addresses the re2c EOF  
bug? So we have an RC planned for next week (freeze Monday evening).  
Can you get this fixed and released by then as Marcus is unable to 
do  this himself?


So this addresses some of the re2c EOF problems, but I don't know if 
it addresses all of them or not. I haven't had the time yet for a full 
review.

Anyway, Matt can surelly comment on this.


Yes, it addresses the re2c EOF issues for strings and comments, as they 
were the problem ones that allowed NULL bytes, and scanned past the EOF 
NULL.  As I said to Dmitry, I'm not sure if it's now possible to remove 
the temporary mmap() fixes that he wanted removed before the next RC 
(??), or if there would still be problems with re2c scanning other 
tokens, even though they can't contain NULLs.  I didn't attempt to make 
any changes there, since I'm not familiar with what's been done.


I just wanted to finally send the patch for others to review, and decide 
what to do, so I won't commit any changes yet in the meantime. :-)



Nuno



- Matt


--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [PATCH] Scanner diet with fixes, etc.

2009-05-04 Thread Arnaud Le Blanc
Hi,
On Mon, May 4, 2009 at 9:36 AM, shire sh...@php.net wrote:
 Regarding the ZEND_MMAP_AHEAD issue and the temp. fix that Dmitry put in we
 need to find a solution to that, perhaps I can play with that this week too
 as I think I'm seeing some related issues in my testing of 5.3.  Essentially
 we abuse ZEND_MMAP_AHEAD by adding it to the file size and passing it to the
 mmap call which isn't at all valid and only really works up to PAGESIZE.  We
 could possibly use YYFILL to re-allocate more space as necessary past the
 end of file to fix this.

I was thinking of doing something like that with YYFILL too. However
there is a bunch of pointers to take in to account and to update (e.g.
yy_marker, yy_text, etc).

There is mmap(MAP_FIXED) as an other temporary solution. The idea is
to map the entire file rounded up to a whole page size, and to map the
last (already mapped) page to anonymous memory using MAP_FIXED. In the
end we get a readable contiguous memory region with the file data and
ZEND_MMAP_AHEAD.
This should work on many systems as long as the requested address is
already part of the process's address space (which is always the case
here). When this fails we can fallback to malloc/read.

Regards,

Arnaud

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [PATCH] Scanner diet with fixes, etc.

2009-05-04 Thread shire

Arnaud Le Blanc wrote:

Hi,
On Mon, May 4, 2009 at 9:36 AM, shiresh...@php.net  wrote:

Regarding the ZEND_MMAP_AHEAD issue and the temp. fix that Dmitry put in we
need to find a solution to that, perhaps I can play with that this week too
as I think I'm seeing some related issues in my testing of 5.3.  Essentially
we abuse ZEND_MMAP_AHEAD by adding it to the file size and passing it to the
mmap call which isn't at all valid and only really works up to PAGESIZE.  We
could possibly use YYFILL to re-allocate more space as necessary past the
end of file to fix this.


I was thinking of doing something like that with YYFILL too. However
there is a bunch of pointers to take in to account and to update (e.g.
yy_marker, yy_text, etc).



Yeah, I'm pretty sure that's how most of the example re2c code is setup:

#define YYFILL(n) {cursor = fill(s, cursor);}

uchar *fill(Scanner *s, uchar *cursor){
if(!s-eof){
  unint cnt = s-lim - s-tok;
  uchar *buf = malloc((cnt + 1)*sizeof(uchar));
  memcpy(buf, s-tok, cnt);
  cursor = buf[cursor - s-tok];
  s-pos = buf[s-pos - s-tok];
  s-ptr = buf[s-ptr - s-tok];
  s-lim = buf[cnt];
  s-eof = s-lim; *(s-eof)++ = '\n';
  s-tok = buf;
}
return cursor;
}



-shire

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [PATCH] Scanner diet with fixes, etc.

2009-05-04 Thread Arnaud Le Blanc
On Mon, May 4, 2009 at 5:51 PM, shire sh...@php.net wrote:
 Arnaud Le Blanc wrote:

 Hi,
 On Mon, May 4, 2009 at 9:36 AM, shiresh...@php.net  wrote:

 Regarding the ZEND_MMAP_AHEAD issue and the temp. fix that Dmitry put in
 we
 need to find a solution to that, perhaps I can play with that this week
 too
 as I think I'm seeing some related issues in my testing of 5.3.
  Essentially
 we abuse ZEND_MMAP_AHEAD by adding it to the file size and passing it to
 the
 mmap call which isn't at all valid and only really works up to PAGESIZE.
  We
 could possibly use YYFILL to re-allocate more space as necessary past the
 end of file to fix this.

 I was thinking of doing something like that with YYFILL too. However
 there is a bunch of pointers to take in to account and to update (e.g.
 yy_marker, yy_text, etc).


 Yeah, I'm pretty sure that's how most of the example re2c code is setup:

 #define YYFILL(n) {cursor = fill(s, cursor);}

 uchar *fill(Scanner *s, uchar *cursor){
    if(!s-eof){
  unint cnt = s-lim - s-tok;
  uchar *buf = malloc((cnt + 1)*sizeof(uchar));
  memcpy(buf, s-tok, cnt);
  cursor = buf[cursor - s-tok];
  s-pos = buf[s-pos - s-tok];
  s-ptr = buf[s-ptr - s-tok];
  s-lim = buf[cnt];
  s-eof = s-lim; *(s-eof)++ = '\n';
  s-tok = buf;
    }
    return cursor;
 }



 -shire


This is what I seen too, but this is not always applicable. The
scanner have code that refers to yy_text, yy_start, yy_cursor,
yy_marker, etc. All those pointers point to the original buffer and
must be updated by fill(). At each point in time the scanner may
rollback to yy_marker or a rule may want to fetch yy_text or yy_start
at any time. So the buffer must be large enough to contain all data
from min(all_of_them) to max(all_of_them). That makes things a little
complicated and potentially less efficient than a big buffer for the
whole file.

Regards,

Arnaud

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [PATCH] Scanner diet with fixes, etc.

2009-05-04 Thread Matt Wilmas

Hi Dmitry,

- Original Message -
From: Dmitry Stogov
Sent: Monday, May 04, 2009


Hi Matt,

I wasn't able to look into all details of the patch, but in general I like 
it, as it fixes bugs and makes scanner smaller. I think you can commit it.


OK, you mean before the freeze for RC2...?  I'll go ahead and do that later 
unless someone says not to.



Although this patch doesn't fix the EOF handling related to mmap().


:-/


Thanks. Dmitry.


Thanks,
Matt 



--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [PATCH] Scanner diet with fixes, etc.

2009-05-04 Thread shire


Hey Matt,

Matt Wilmas wrote:


+/* To save initial string length after scanning to first variable,
CG(doc_comment_len) can be reused */
+#define double_quotes_scanned_len CG(doc_comment_len)
+


(minor) Maybe we should rename this var if we're going to use it for
other
purposes, this doesn't really save any typing. Also if we do want the
define maybe we should upper case it so it's more obvious?


Yeah, I tried to think of other ways to do it, but just left it trying
to look like another variable (not to save typing). Well, it can easily
be changed later if a cleaner way is decided...



Yeah I would just prefer if it was more obvious that it is *not* a variable ;-)


+ while (YYCURSOR  YYLIMIT) {
+ switch (*YYCURSOR++) {


In the example above, which we have a couple examples of here, we don't
obey the YYFILL macro to detect if we have exceeded our EOF. This
*might* be a problem, but only really depends on if we intend to use the
YYFILL as a solution for exceeding our mmap bounds.


I don't understand what the problem might be? The YYCURSOR  YYLIMIT
check is what the YYFILL has been doing. If you mean after changes
later, as long as the the whole thing is mmap()'d (which I'm assuming
would be the case?), it just looks like a standard string, with
terminating '\0', right? And there's no reading past YYLIMIT.


Sorry yeah this wouldn't be a problem currently, but only if we try to fix the 
mmap issue by using YYFILL to realloc more space into the buffer.  Then that 
macro would change to something more complicated. (per my previous replies with 
Arnaud)

Have you considered using the lexer STATES and regex's instead of the manual C 
code for scanning the rest.  It seems like if we have a one-char regex match 
for what the C code is doing we could handle this in the lexer without a lot of 
manual intervention (need to look at it more, just a thought I had earlier, the 
expressions are clearer now with your patch applied) ;-)

-shire

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [PATCH] Scanner diet with fixes, etc.

2009-05-04 Thread Matt Wilmas

Hi Brian,

- Original Message -
From: shire
Sent: Monday, May 04, 2009



Hey Matt,

Matt Wilmas wrote:


+/* To save initial string length after scanning to first variable,
CG(doc_comment_len) can be reused */
+#define double_quotes_scanned_len CG(doc_comment_len)
+


(minor) Maybe we should rename this var if we're going to use it for
other
purposes, this doesn't really save any typing. Also if we do want the
define maybe we should upper case it so it's more obvious?


Yeah, I tried to think of other ways to do it, but just left it trying
to look like another variable (not to save typing). Well, it can easily
be changed later if a cleaner way is decided...



Yeah I would just prefer if it was more obvious that it is *not* a 
variable ;-)


How about this?

#define SET_DOUBLE_QUOTES_SCANNED_LENGTH(len) CG(doc_comment_len) = (len)
#define GET_DOUBLE_QUOTES_SCANNED_LENGTH()CG(doc_comment_len)


+ while (YYCURSOR  YYLIMIT) {
+ switch (*YYCURSOR++) {


In the example above, which we have a couple examples of here, we don't
obey the YYFILL macro to detect if we have exceeded our EOF. This
*might* be a problem, but only really depends on if we intend to use the
YYFILL as a solution for exceeding our mmap bounds.


I don't understand what the problem might be? The YYCURSOR  YYLIMIT
check is what the YYFILL has been doing. If you mean after changes
later, as long as the the whole thing is mmap()'d (which I'm assuming
would be the case?), it just looks like a standard string, with
terminating '\0', right? And there's no reading past YYLIMIT.


Sorry yeah this wouldn't be a problem currently, but only if we try to
fix the mmap issue by using YYFILL to realloc more space into the buffer.
Then that macro would change to something more complicated. (per my
previous replies with Arnaud)


Gotcha.  If something changes, YYFILL -- or something to handle what needs 
to be done -- could just be added to the manual parts as necessary, right?



Have you considered using the lexer STATES and regex's instead of the
manual C code for scanning the rest.  It seems like if we have a one-char
regex match for what the C code is doing we could handle this in the
lexer without a lot of manual intervention (need to look at it more, just
a thought I had earlier, the expressions are clearer now with your patch
applied) ;-)


It seems that matching one-char-at-a-time with re2c would be more 
complicated than the manual way, not to mention slower than the old 
(current) way.


Do you have any objection (well, you've kinda mentioned some :-)) if I'd 
commit the changes in a little while like Dmitry thought could be done?



-shire


- Matt 



--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [PATCH] Scanner diet with fixes, etc.

2009-05-04 Thread shire

Matt Wilmas wrote:

Hi Brian,

- Original Message -
From: shire
Sent: Monday, May 04, 2009



Hey Matt,

Matt Wilmas wrote:


+/* To save initial string length after scanning to first variable,
CG(doc_comment_len) can be reused */
+#define double_quotes_scanned_len CG(doc_comment_len)
+


(minor) Maybe we should rename this var if we're going to use it for
other
purposes, this doesn't really save any typing. Also if we do want the
define maybe we should upper case it so it's more obvious?


Yeah, I tried to think of other ways to do it, but just left it trying
to look like another variable (not to save typing). Well, it can easily
be changed later if a cleaner way is decided...



Yeah I would just prefer if it was more obvious that it is *not* a
variable ;-)


How about this?

#define SET_DOUBLE_QUOTES_SCANNED_LENGTH(len) CG(doc_comment_len) = (len)
#define GET_DOUBLE_QUOTES_SCANNED_LENGTH() CG(doc_comment_len)



Sure, works for me ;-)



+ while (YYCURSOR  YYLIMIT) {
+ switch (*YYCURSOR++) {


In the example above, which we have a couple examples of here, we don't
obey the YYFILL macro to detect if we have exceeded our EOF. This
*might* be a problem, but only really depends on if we intend to use
the
YYFILL as a solution for exceeding our mmap bounds.


I don't understand what the problem might be? The YYCURSOR  YYLIMIT
check is what the YYFILL has been doing. If you mean after changes
later, as long as the the whole thing is mmap()'d (which I'm assuming
would be the case?), it just looks like a standard string, with
terminating '\0', right? And there's no reading past YYLIMIT.


Sorry yeah this wouldn't be a problem currently, but only if we try to
fix the mmap issue by using YYFILL to realloc more space into the buffer.
Then that macro would change to something more complicated. (per my
previous replies with Arnaud)


Gotcha. If something changes, YYFILL -- or something to handle what
needs to be done -- could just be added to the manual parts as
necessary, right?


Have you considered using the lexer STATES and regex's instead of the
manual C code for scanning the rest. It seems like if we have a one-char
regex match for what the C code is doing we could handle this in the
lexer without a lot of manual intervention (need to look at it more, just
a thought I had earlier, the expressions are clearer now with your patch
applied) ;-)


It seems that matching one-char-at-a-time with re2c would be more
complicated than the manual way, not to mention slower than the old
(current) way.

Do you have any objection (well, you've kinda mentioned some :-)) if I'd
commit the changes in a little while like Dmitry thought could be done?


Well I'm wondering if something more along these lines (just did this on-top of 
your patch as you cleaned up a lot) might be more appealing.  (I'm not sure how 
much slower this would be than the current implementation, obviously it'll be 
somewhat slower, I'm basically just doing what you did in C but in the scanner 
instead of course).

ST_IN_SCRIPTING#|// {
BEGIN(ST_EOL_COMMENT);
yymore();
}

ST_EOL_COMMENT({NEWLINE}|%|?) {
char tmp = *(YYCURSOR-1);
if ((tmp == '%'  CG(asp_tags)) | tmp == '?') {
  YYCURSOR -= 2;
}
CG(zend_lineno)++;
BEGIN(ST_IN_SCRIPTING);
return T_COMMENT;
}

ST_EOL_COMMENT{ANY_CHAR} {
if (YYCURSOR = YYLIMIT) {
  BEGIN(ST_IN_SCRIPTING);
  return T_COMMENT;
}
yymore();
}



Let me know what the thoughts are on the above, if we don't want that then I 
say yeah, commit away!

-shire





--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [PATCH] Scanner diet with fixes, etc.

2009-05-04 Thread shire

Matt Wilmas wrote:


Gotcha. If something changes, YYFILL -- or something to handle what
needs to be done -- could just be added to the manual parts as
necessary, right?



Sorry forget to reply on this one, but yeah we'd have to do a manual call to 
YYFILL or a check or whatever we come up with wherever we're scanning ahead 
manually.  (there's probably some other parts that might need this as well).

-shire

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [PATCH] Scanner diet with fixes, etc.

2009-05-04 Thread Matt Wilmas

Hi Brian,

- Original Message -
From: shire
Sent: Monday, May 04, 2009


Matt Wilmas wrote:

[...]
How about this?

#define SET_DOUBLE_QUOTES_SCANNED_LENGTH(len) CG(doc_comment_len) = (len)
#define GET_DOUBLE_QUOTES_SCANNED_LENGTH() CG(doc_comment_len)



Sure, works for me ;-)


Cool. :-)


[...]

Have you considered using the lexer STATES and regex's instead of the
manual C code for scanning the rest. It seems like if we have a one-char
regex match for what the C code is doing we could handle this in the
lexer without a lot of manual intervention (need to look at it more, 
just

a thought I had earlier, the expressions are clearer now with your patch
applied) ;-)


It seems that matching one-char-at-a-time with re2c would be more
complicated than the manual way, not to mention slower than the old
(current) way.

Do you have any objection (well, you've kinda mentioned some :-)) if I'd
commit the changes in a little while like Dmitry thought could be done?


Well I'm wondering if something more along these lines (just did this
on-top of your patch as you cleaned up a lot) might be more appealing.
(I'm not sure how much slower this would be than the current
implementation, obviously it'll be somewhat slower, I'm basically just
doing what you did in C but in the scanner instead of course).

ST_IN_SCRIPTING#|// {
BEGIN(ST_EOL_COMMENT);
yymore();
}

ST_EOL_COMMENT({NEWLINE}|%|?) {
char tmp = *(YYCURSOR-1);
if ((tmp == '%'  CG(asp_tags)) | tmp == '?') {
  YYCURSOR -= 2;
}
CG(zend_lineno)++;
BEGIN(ST_IN_SCRIPTING);
return T_COMMENT;
}

ST_EOL_COMMENT{ANY_CHAR} {
if (YYCURSOR = YYLIMIT) {
  BEGIN(ST_IN_SCRIPTING);
  return T_COMMENT;
}
yymore();
}



Let me know what the thoughts are on the above, if we don't want that
then I say yeah, commit away!


Wouldn't it be a little more complicated for strings/heredocs than comments? 
Or not, haven't thought about it much! :-)  And you still need the manual 
use of YYCURSOR, etc.  In other words, to me, the scanner rules are doing 
what the manual switch ()'s case statements do, but in a slower, 
roundabout way.


Well, I'm gonna be away for a bit now, but I guess I can commit away when I 
get back.



-shire


- Matt 



--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [PATCH] Scanner diet with fixes, etc.

2009-04-30 Thread Nuno Lopes
The patch looks generally ok. However I'll need a few more days to review it 
carefully and throughly. (you can merge it in the meantime if you want).
I'm just slighty concern with the amount of parsing we are now doing by 
hand, and with the possible (local) security bugs we might be introducing..


Nuno

- Original Message -

Hi Dmitry, Brian, all,

Here's a scanner patch that I mentioned awhile ago, with a possible way to 
work around the re2c EOF handling issues.


The primary change is to do a manual scan like I talked about in areas 
that match large amounts and can contain NULL bytes (strings/comments, 
which are now scanned faster too), as is done for inline HTML.  I called 
it a diet :-) because it removes my complicated string regex patterns 
from a couple years ago, which doesn't make the .l file much smaller after 
adding the manual scan code (easier to understand...?), but it does result 
in a ~34k reduction of 5.3's generated .c file...


This fixes Bug #46817, as well as a better, more proper fix for the older 
Bug #42767, both related to ending comments.


Now inline HTML chunks aren't broken up when a tag starting with s is 
encountered (script for JS, span, etc.), since it's unlikely to be a 
long PHP script tag.


If an opening PHP SCRIPT tag was used with a capital S, it was missed 
if it wasn't the first thing scanned:


var_dump(token_get_all(HTML... SCRIPT language=php));

Single-line comments with a Windows newline didn't include the full \r\n:

var_dump(token_get_all(?php // Comment\r\n?));

Finally, part of the optimized scanning is that, for double quoted 
strings, when the first variable is encountered (making it non-constant), 
the amount that's been scanned up to that point is remembered, which can 
then be skipped over (up to the variable) after returning the quote token. 
Previously that initial part of the string was rescanned -- the cost 
dependent on how far into the string the first var is.



I think that's about all --  I'll send another message if I forgot to 
mention anything...  Just wanted to send this along quick for to you guys 
to look at or whatever.  It was basically done last week, I just had to do 
a couple finishing touches and verify that everything was OK.


http://realplain.com/php/scanner_diet.diff (Merged changes, but didn't 
test yet.)

http://realplain.com/php/scanner_diet_5_3.diff


Thanks,
Matt 



--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [PATCH] Scanner diet with fixes, etc.

2009-04-30 Thread Lukas Kahwe Smith


On 30.04.2009, at 19:18, Nuno Lopes wrote:

The patch looks generally ok. However I'll need a few more days to  
review it carefully and throughly. (you can merge it in the meantime  
if you want).
I'm just slighty concern with the amount of parsing we are now doing  
by hand, and with the possible (local) security bugs we might be  
introducing..



Am I understanding this properly, that this addresses the re2c EOF  
bug? So we have an RC planned for next week (freeze Monday evening).  
Can you get this fixed and released by then as Marcus is unable to do  
this himself?


regards,
Lukas Kahwe Smith
m...@pooteeweet.org




--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [PATCH] Scanner diet with fixes, etc.

2009-04-30 Thread Nuno Lopes
The patch looks generally ok. However I'll need a few more days to 
review it carefully and throughly. (you can merge it in the meantime  if 
you want).
I'm just slighty concern with the amount of parsing we are now doing  by 
hand, and with the possible (local) security bugs we might be 
introducing..



Am I understanding this properly, that this addresses the re2c EOF  bug? 
So we have an RC planned for next week (freeze Monday evening).  Can you 
get this fixed and released by then as Marcus is unable to do  this 
himself?


So this addresses some of the re2c EOF problems, but I don't know if it 
addresses all of them or not. I haven't had the time yet for a full review.

Anyway, Matt can surelly comment on this.

Nuno 



--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [PATCH] Scanner diet with fixes, etc.

2009-04-30 Thread Matt Wilmas

Hi guys,

- Original Message -
From: Nuno Lopes
Sent: Thursday, April 30, 2009

The patch looks generally ok. However I'll need a few more days to 
review it carefully and throughly. (you can merge it in the meantime  if 
you want).
I'm just slighty concern with the amount of parsing we are now doing  by 
hand, and with the possible (local) security bugs we might be 
introducing..



Am I understanding this properly, that this addresses the re2c EOF  bug? 
So we have an RC planned for next week (freeze Monday evening).  Can you 
get this fixed and released by then as Marcus is unable to do  this 
himself?


So this addresses some of the re2c EOF problems, but I don't know if it 
addresses all of them or not. I haven't had the time yet for a full 
review.

Anyway, Matt can surelly comment on this.


Yes, it addresses the re2c EOF issues for strings and comments, as they were 
the problem ones that allowed NULL bytes, and scanned past the EOF NULL.  As 
I said to Dmitry, I'm not sure if it's now possible to remove the temporary 
mmap() fixes that he wanted removed before the next RC (??), or if there 
would still be problems with re2c scanning other tokens, even though they 
can't contain NULLs.  I didn't attempt to make any changes there, since I'm 
not familiar with what's been done.


I just wanted to finally send the patch for others to review, and decide 
what to do, so I won't commit any changes yet in the meantime. :-)



Nuno



- Matt 



--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP-DEV] [PATCH] Scanner diet with fixes, etc.

2009-04-29 Thread Matt Wilmas

Hi Dmitry, Brian, all,

Here's a scanner patch that I mentioned awhile ago, with a possible way to 
work around the re2c EOF handling issues.


The primary change is to do a manual scan like I talked about in areas 
that match large amounts and can contain NULL bytes (strings/comments, which 
are now scanned faster too), as is done for inline HTML.  I called it a 
diet :-) because it removes my complicated string regex patterns from a 
couple years ago, which doesn't make the .l file much smaller after adding 
the manual scan code (easier to understand...?), but it does result in a 
~34k reduction of 5.3's generated .c file...


This fixes Bug #46817, as well as a better, more proper fix for the older 
Bug #42767, both related to ending comments.


Now inline HTML chunks aren't broken up when a tag starting with s is 
encountered (script for JS, span, etc.), since it's unlikely to be a 
long PHP script tag.


If an opening PHP SCRIPT tag was used with a capital S, it was missed if 
it wasn't the first thing scanned:


var_dump(token_get_all(HTML... SCRIPT language=php));

Single-line comments with a Windows newline didn't include the full \r\n:

var_dump(token_get_all(?php // Comment\r\n?));

Finally, part of the optimized scanning is that, for double quoted strings, 
when the first variable is encountered (making it non-constant), the amount 
that's been scanned up to that point is remembered, which can then be 
skipped over (up to the variable) after returning the quote token. 
Previously that initial part of the string was rescanned -- the cost 
dependent on how far into the string the first var is.



I think that's about all --  I'll send another message if I forgot to 
mention anything...  Just wanted to send this along quick for to you guys to 
look at or whatever.  It was basically done last week, I just had to do a 
couple finishing touches and verify that everything was OK.


http://realplain.com/php/scanner_diet.diff (Merged changes, but didn't test 
yet.)

http://realplain.com/php/scanner_diet_5_3.diff


Thanks,
Matt 



--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php