Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-23 Thread Marcus Boerger
Hello Stanislav,

  cool, care to change the code snippet into a test as I've done for Rui's
  snippet?

marcus

Sunday, March 23, 2008, 5:06:53 AM, you wrote:

 is broken code and not a single test. If this is not going to change as in
 we are not getting any .phpt files for this feature then there are two

 As I understand the theory of the thing should be pretty simple, you set 
 input encoding (by config or declare) and internal encoding, and then 
 when script is being read, you convert it from input to internal.
 However, it appears that since flex couldn't stomach certain encodings, 
 there's also a hack there - script is translated from input to some 
 safe encoding for flex, and then strings are translated back to 
 internal encoding after flex processes them. If re2c can deal with 
 encodings like SJIS without trouble then some of the hacks might be 
 unnecessary. I think encodings that need to be checked are those in 
 zend_multibyte.c that have compatible flag off.

 Here's a short script example I found that shows what's the problem there:

 ?php echo 'ソ'; ?

 Character echoed there is U+30BD Katakana letter SO. Now if you run it 
 in UTF-8, works good. However, if you recode it to Shift-JIS, it won't 
 run, since this script looks to the parser this way:

 ?php echo '83\'; ?
 (that's dump of VI output, so replace 83 with actual 0x83 if you 
 compose it). That's parse error for the parser, if parsed naively. So 
 somehow the parser needs to know 0x83+\ is actually U+30BD and at the 
 same time the user still might want it as 0x83+\ in a zval (or maybe as 
 utf-8 - it depends on him).
 -- 
 Stanislav Malyshev, Zend Software Architect
 [EMAIL PROTECTED]   http://www.zend.com/
 (408)253-8829   MSN: [EMAIL PROTECTED]




Best regards,
 Marcus


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-22 Thread Marcus Boerger
Hello Alan, Andi, Rui,

  my impression still is that not a single person uses this crap. I only
hear of people claiming they have heard that people use it. But what I see
is broken code and not a single test. If this is not going to change as in
we are not getting any .phpt files for this feature then there are two
ways. First I implement something that I personally would expect and I
wouldn't care about anything that is there right now or second we simply
get rid of it completely.

So far I have extended re2c to make it easier to deal with other encodings
and even allow multiple char width at the same time. So I did my homework.
Now I expect that somebody writes tests! Then we could provide a scanner
that works on UCS-2 or on UTF-32 and then try to identofy the script
encoding. Then work on th extended charset and do a reverse encoding if
necessary for output. THough even thinking about this approach (still like
what we seem to have right now) really hurts my very badly becasue it is
the wrong approach. What we want is a working HEAD.

marcus

Monday, March 3, 2008, 4:19:24 PM, you wrote:

 a few replaces with this file should be  a good testcase
 - probably worth testing
 * comments with these character in them. both /* and //
 * string with these characters in them.
  lynx -source 
 'http://smontagu.damowmow.com/genEncodingTest.cgi?family=windowscodepage=950'
 | grep test | grep -v testcase

 I have definatly seen code with chinese characters in comments and 
 strings and a few times function names and variable names with chinese 
 characters...

 Regards
 Alan


 Marcus Boerger wrote:
 Hello Alan,

   be my hero then :-) Could you generate a few tests for the multibyte
 support so that we know how it is used right now and what we need to take
 care of?

 marcus

 Monday, March 3, 2008, 12:48:44 AM, you wrote:

   
 Can you clarify the Multibyte issues:
 - I presume this means that it can handle ASCII/UTF8/16 etc. but will 
 not handle things like BIG5/GB encoding in source code - this may be a 
 bit of an issue around here..
 

   
 Regards
 Alan
 


   
 Marcus Boerger wrote:
 
 RFC: REPLACE THE FLEX-BASED SCANNER WITH AN RE2C [1] BASED LEXER

 Situation:
 The current flex-based lexer depends on an outdated and unsupported flex
 version. Alternatives include either updating to a newer version of flex or
 using re2c, which we already use for a variety of things (serializing, pdo 
 sql
 scanning, date/time parsing). While moving towards a newer flex version 
 would
 be much easier, switching to re2c promises a much faster lexer. Actually,
 without any specific re2c optimizations we already get around a 20% scanner
 performance increase. Running the tests gets an overall speedup of 2%. It 
 is
 arguable whether this is enough, but re2c has more advantages. First of 
 all,
 re2c allows one to scan any type of input (ASCII, UTF-8, UTF-16, UTF-32).
 Secondly, it allows for better integration with Lemon [2], which would be 
 the
 next step. And thirdly we can switch to a reentrant scanner.

 Current state:
 Flex has been fully replaced by re2c in Zend. We have also switched to an
 mmap-based lexer approach for now. However, we had to drop multibyte 
 support
 as well as the encoding declare. The current state can be checked out from
 Scott's subversion repository [3] and you can follow the development on his
 Trac setup [4]. When you want to build php with re2c, then you need to grab
 re2c from its sourceforge subversion repository [5]. You can also check out
 the changes in a patch created Sunday 2nd March against a PHP checkout 
 from 
 14th February [6].

 Further steps:
 Commit this to PHP 5.3. Synch to HEAD. Add pecl/intl to 5.3. 
 Discuss/recreate
 multibyte support with libintl.

 Future steps:
 Replace bison with lemon in PHP 5.4 or HEAD.

 Time Frame:
 Commit to 5.3 between the 5th and the 15th of March. Synch to HEAD a couple
 of days later. Moving pecl/libintl to ext (depends on the 5.3 RMs 
 decision).
 After that is done, decide about multibyte support. Along with the commit 
 to
 the 5.3 branch there will be a new re2c version available.


 Marcus Boerger
 Nuno Lopes
 Scott MacVicar


 [1] http://re2c.org/
 [2] http://www.hwaci.com/sw/lemon/
 [3] svn://whisky.macvicar.net/php-re2c
 [4] http://trac.macvicar.net/php-re2c/
 [5] https://re2c.svn.sourceforge.net/svnroot/re2c/trunk/re2c
 [6] http://php.net/~helly/php-re2c-20080302.diff.txt



   
   





 Best regards,
  Marcus

   



Best regards,
 Marcus


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-22 Thread Stanislav Malyshev

is broken code and not a single test. If this is not going to change as in
we are not getting any .phpt files for this feature then there are two


As I understand the theory of the thing should be pretty simple, you set 
input encoding (by config or declare) and internal encoding, and then 
when script is being read, you convert it from input to internal.
However, it appears that since flex couldn't stomach certain encodings, 
there's also a hack there - script is translated from input to some 
safe encoding for flex, and then strings are translated back to 
internal encoding after flex processes them. If re2c can deal with 
encodings like SJIS without trouble then some of the hacks might be 
unnecessary. I think encodings that need to be checked are those in 
zend_multibyte.c that have compatible flag off.


Here's a short script example I found that shows what's the problem there:

?php echo 'ソ'; ?

Character echoed there is U+30BD Katakana letter SO. Now if you run it 
in UTF-8, works good. However, if you recode it to Shift-JIS, it won't 
run, since this script looks to the parser this way:


?php echo '83\'; ?
(that's dump of VI output, so replace 83 with actual 0x83 if you 
compose it). That's parse error for the parser, if parsed naively. So 
somehow the parser needs to know 0x83+\ is actually U+30BD and at the 
same time the user still might want it as 0x83+\ in a zval (or maybe as 
utf-8 - it depends on him).

--
Stanislav Malyshev, Zend Software Architect
[EMAIL PROTECTED]   http://www.zend.com/
(408)253-8829   MSN: [EMAIL PROTECTED]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-05 Thread Antony Dovgal
On 04.03.2008 21:28, Stanislav Malyshev wrote:
 Hi!
 
 Right.
 Please take more time if needed, no need to rush and release something 
 half-working.
 If it takes several months to prepare 5.3 release, let it be so.
 
 With this approach we would never release 5.3 - each couple of months 
 somebody would have a cool idea which would only require initial commit 
 and 2-3 months work on it on CVS, which delays the release - and then it 
 goes to the next idea. We should cut it off somewhere - not because 
 these ideas are bad - they aren't, but because we have to have releases. 

Even though I do agree that delaying the release every 2-3 months is bad, 
I believe this particular case deserves some special treatment.
And btw this is a major release, not just a bugfix one, so everyone (Zend 
included)
should spend even more time to make sure there are no regressions whatsoever.

Releasing a half-working version just because we have to have releases is 
total nonsense.
So please instead of arguing with me, help Marcus and the others if 
you don't want the release postponed.

   The best idea is worth nothing for the users unless it's part of the 
 release.
 5.3 is not the last version of PHP

Making new 5.x releases each year makes no sense to me, so 5.3 seems to be 
perfect candidate for the next several years if we want to implement something 
major.

 After all, we're not a commercial company that has to roll out a release 
 every 
 couple of months under pressure of share holders and overall competition.
 
 If you think that because PHP project is not a commercial company it 
 doesn't have to adhere to the laws of markets, popularity and users 
 expectations - you are mistaken. 

These are the last things I think of.
The most important is to make it as stable as we can.

 We still have to take into account 
 millions of PHP users, even though they don't pay us money directly.

Right, and they want PHP to do its job and to do it good.

 And it's open source which was release often last time I checked ;)

Wow, that's the most serious argument ever!

-- 
Wbr, 
Antony Dovgal

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-05 Thread Jani Taskinen
On Tue, 2008-03-04 at 20:17 +0100, Hannes Magnusson wrote:
 I'll hunt you all down and make you eat 1kg of vegetables each day
 after the 5.3 release untill proper documentation and upgrade guides
 have been written.

I already eat that much vegetables a day..what's my punishment? :-p
(and Pierre promised to handle the php.ini docs.. :D)

--Jani



-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-05 Thread Stanislav Malyshev

Hi!

Even though I do agree that delaying the release every 2-3 months is bad, 
I believe this particular case deserves some special treatment.


Why? We have perfectly working parser now and no immediate need to 
replace it. I agree that new parser is faster and better, but we are 
perfectly capable to live without it for half a year until it's 
polished, if that proves to be the situation.



Releasing a half-working version just because we have to have releases is 
total nonsense.


Fully agreed here. That's why I'm against committing new parser without 
multibyte support.


So please instead of arguing with me, help Marcus and the others if 
you don't want the release postponed.


Unfortunately, I do not know Marcus' code and may not have resources to 
help him right now. Please keep in mind that while I am happy to help 
whenever I can, I am not under obligation to help on call to any project 
as soon as anybody wants me to, just because he wants it.
That said, if somebody can and does fix new parser to support MB in 
reasonable time - I'm all for it.


Making new 5.x releases each year makes no sense to me, so 5.3 seems to be 
perfect candidate for the next several years if we want to implement something major.


What's wrong with making new 5.x releases each year if needed?


Right, and they want PHP to do its job and to do it good.


Having no mutibyte support used by a lot of people does not qualify as 
do its job and to do it good. What qualifies is either 5.3 with old 
parser or 5.3 with new parser, fully compatible. As I believe I already 
explained about delaying release etc., I wouldn't repeat myself here.

--
Stanislav Malyshev, Zend Software Architect
[EMAIL PROTECTED]   http://www.zend.com/
(408)253-8829   MSN: [EMAIL PROTECTED]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-04 Thread Antony Dovgal
On 04.03.2008 12:38, Marcus Boerger wrote:
 This sounds like we are going to do the same mistake over and over and over
 again. Who is forcing a hard time line on us? Why are we late in the
 develoment I don't get it at all. 

Right.
Please take more time if needed, no need to rush and release something 
half-working.
If it takes several months to prepare 5.3 release, let it be so.

After all, we're not a commercial company that has to roll out a release every 
couple of months under pressure of share holders and overall competition.

-- 
Wbr, 
Antony Dovgal

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-04 Thread Marcus Boerger
Hello Andi,

Tuesday, March 4, 2008, 7:51:07 AM, you wrote:

 Hi Marcus, Johannes, and all,

 First of all let me say that I have no conceptual problem with replacing
 the scanner with re2c. If it's cleaner, performs better and a better
 maintained piece of software (let's hope Marcus doesn't get run over)
 then we can move to re2c.

 There are a few important things to consider though:
 - There is a huge PHP/MySQL community in the far east especially in
 Japan. You may not hear as much from them because they mostly don't post
 on our public lists but it's large. They very much depend on multibyte
 support and it works well for them (I have talked to several people in
 those communities). Shift-JIS is a matter of fact for those communities.
 We can't just dump them in PHP 5.3.
 - We need to make sure that we have a streams story that works and
 existing functionality is supported by it (sounds like this is almost
 complete so probably not high risk).
 - We should make sure we can achieve compatibility including supporting
 functionality like declare(...) which is used by some including
 multibyte guys. I haven't heard of a reason why this couldn't be
 possible with RE2C.

 I think all the above is doable but we shouldn't ship without
 accomplishing that 100% compatibility especially telling the non-Latin
 world that we will stop supporting them.

 So at the end of the day it all boils down to timing. I have been
 expecting Johannes to cut a beta any day now (I realize Sun acquisition
 somewhat postponed his schedule). PHP 5.3 is on a pretty good track to a
 good  stable release cycle. I think re-engineering a core piece of the
 engine at this point adds considerable risk and would definitely prolong
 the release cycle.

 So while I'm supportive of embracing RE2C if we get commitment to reach
 that 100% compatibility including multibyte support, I don't quite
 understand the sense of urgency and why we'd want to introduce this risk
 so late in the development of PHP 5.3. This is a risk the release
 manager shouldn't really be willing to take. Rewriting this multibyte
 support will require time and interaction with the communities that are
 currently using it to make sure that it meets their needs. It will not
 be a trivial project.

 We can definitely work towards RE2C in parallel and as Stas said the
 engine hasn't really been changing very much recently to make this hard
 (we finished our todos for 5.3). We could even branch off PHP 5.4 right
 after RC1 for PHP 5.3 and therefore reduce the time where this patch
 would need to be maintained separately (although I think it can already
 be maintained in a branch).

 Let's consider all the angles in addition to wanting to get the code in
 the tree asap.
 Andi


This sounds like we are going to do the same mistake over and over and over
again. Who is forcing a hard time line on us? Why are we late in the
develoment I don't get it at all. We haven't done all steps that were on
our radar for 5.3. Now that we finally found time to address this we should
do it. Otherwise the consequence is just that we have to do a 5.4 version
immediately. What is the reason for that, who is more happy with a 5.3 now?
Are we a company that makes money with selling upgrades?

Best regards,
 Marcus


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-04 Thread Stanislav Malyshev

Hi!


Right.
Please take more time if needed, no need to rush and release something 
half-working.
If it takes several months to prepare 5.3 release, let it be so.


With this approach we would never release 5.3 - each couple of months 
somebody would have a cool idea which would only require initial commit 
and 2-3 months work on it on CVS, which delays the release - and then it 
goes to the next idea. We should cut it off somewhere - not because 
these ideas are bad - they aren't, but because we have to have releases. 
 The best idea is worth nothing for the users unless it's part of the 
release.
5.3 is not the last version of PHP, and we have quite a bunch of stuff 
there already - so I think it makes sense to have release of what we 
have or will have soon, all while continuing to develop the ideas for 
next versions.


After all, we're not a commercial company that has to roll out a release every 
couple of months under pressure of share holders and overall competition.


If you think that because PHP project is not a commercial company it 
doesn't have to adhere to the laws of markets, popularity and users 
expectations - you are mistaken. We still have to take into account 
millions of PHP users, even though they don't pay us money directly.

And it's open source which was release often last time I checked ;)
--
Stanislav Malyshev, Zend Software Architect
[EMAIL PROTECTED]   http://www.zend.com/
(408)253-8829   MSN: [EMAIL PROTECTED]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-04 Thread Marcus Boerger
Hello Andi,

Tuesday, March 4, 2008, 7:51:07 AM, you wrote:

 Hi Marcus, Johannes, and all,

 First of all let me say that I have no conceptual problem with replacing
 the scanner with re2c. If it's cleaner, performs better and a better
 maintained piece of software (let's hope Marcus doesn't get run over)
 then we can move to re2c.

 There are a few important things to consider though:
 - There is a huge PHP/MySQL community in the far east especially in
 Japan. You may not hear as much from them because they mostly don't post
 on our public lists but it's large. They very much depend on multibyte
 support and it works well for them (I have talked to several people in
 those communities). Shift-JIS is a matter of fact for those communities.
 We can't just dump them in PHP 5.3.
 - We need to make sure that we have a streams story that works and
 existing functionality is supported by it (sounds like this is almost
 complete so probably not high risk).
 - We should make sure we can achieve compatibility including supporting
 functionality like declare(...) which is used by some including
 multibyte guys. I haven't heard of a reason why this couldn't be
 possible with RE2C.

 I think all the above is doable but we shouldn't ship without
 accomplishing that 100% compatibility especially telling the non-Latin
 world that we will stop supporting them.

 So at the end of the day it all boils down to timing. I have been
 expecting Johannes to cut a beta any day now (I realize Sun acquisition
 somewhat postponed his schedule). PHP 5.3 is on a pretty good track to a
 good  stable release cycle. I think re-engineering a core piece of the
 engine at this point adds considerable risk and would definitely prolong
 the release cycle.

 So while I'm supportive of embracing RE2C if we get commitment to reach
 that 100% compatibility including multibyte support, I don't quite
 understand the sense of urgency and why we'd want to introduce this risk
 so late in the development of PHP 5.3. This is a risk the release
 manager shouldn't really be willing to take. Rewriting this multibyte
 support will require time and interaction with the communities that are
 currently using it to make sure that it meets their needs. It will not
 be a trivial project.

 We can definitely work towards RE2C in parallel and as Stas said the
 engine hasn't really been changing very much recently to make this hard
 (we finished our todos for 5.3). We could even branch off PHP 5.4 right
 after RC1 for PHP 5.3 and therefore reduce the time where this patch
 would need to be maintained separately (although I think it can already
 be maintained in a branch).

 Let's consider all the angles in addition to wanting to get the code in
 the tree asap.
 Andi



Give me any reason why we need 5.4 at this point?
Any single one?
Are you having a bet or a deal about 5.3 release date?
And what is the deal, you do whatever you think goes in and that's a law?

Best regards,
 Marcus


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-04 Thread Stanislav Malyshev

Hi!


Improving on that statement: The coolest feature ever is worth
absolutely nothing unless it is documented.


I agree with the intent - documentation is *very* important. Even 
though, people use undocumented features too (probably cursing the lazy 
developers on the way ;)


BTW, as far as I remember, we have at least 4 undocumented features 
right now sitting in 5.3 CVS, so if anybody wants to do something cool, 
that's a good place:

- Nowdocs aren't documented
- .htaccess-like .ini files undocumented
- [HOST=] and [PATH=] .ini sections undocumented
- new version constants undocumented
--
Stanislav Malyshev, Zend Software Architect
[EMAIL PROTECTED]   http://www.zend.com/
(408)253-8829   MSN: [EMAIL PROTECTED]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



RE: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-04 Thread Andi Gutmans
 -Original Message-
 From: Hannes Magnusson [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, March 04, 2008 11:18 AM
 To: Stas Malyshev
 Cc: Antony Dovgal; Marcus Boerger; Andi Gutmans;
 internals@lists.php.net
 Subject: Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an
 re2c [1] based lexer
 
 Improving on that statement: The coolest feature ever is worth
 absolutely nothing unless it is documented.
 
 Don't care if its a new language construct, new class, function or
 method, optional parameter, new syntax in php.ini, errorlevel, dropped
 warnings or an awesome --enable-zend-multibyte configure switch. If it
 isn't documented its totally useless for anyone not reading php-cvs,
 zend-engine-cvs and this list daily.
 
 I'll hunt you all down and make you eat 1kg of vegetables each day
 after the 5.3 release untill proper documentation and upgrade guides
 have been written.
 Mark my words my friends, mark my words! ;)
 

Why do you say it's not documented?
http://www.aconus.com/~oyaji/www/apache_linux_php.htm
http://tinyurl.com/2o8pq2

OK just kidding and I agree it would be nice to have it better
documented in the mainstream docs. As it applies mostly to the Asian
users though (Chinese/Japanese) who usually seek localized docs it's
probably not as good as it should be in php.net.

Andi

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-04 Thread Hannes Magnusson
On Tue, Mar 4, 2008 at 8:38 PM, Andi Gutmans [EMAIL PROTECTED] wrote:
  OK just kidding and I agree it would be nice to have it better
  documented in the mainstream docs. As it applies mostly to the Asian
  users though (Chinese/Japanese) who usually seek localized docs it's
  probably not as good as it should be in php.net.

The Japanese docs are 100% up-to-date with the English docs so they
shouldn't have any problem reading out docs.
In fact, if you do changes in the en/ tree Takagi Masahiro will have
it translated within 24hours - even if that change spanned 50files.
Not kidding.

-Hannes

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-04 Thread Hannes Magnusson
On Tue, Mar 4, 2008 at 8:38 PM, Andi Gutmans [EMAIL PROTECTED] wrote:
  Why do you say it's not documented?
  http://www.aconus.com/~oyaji/www/apache_linux_php.htm
  http://tinyurl.com/2o8pq2

According to the latter link, our windows binaries don't enable
zend-multibyte, is this true?

-Hannes

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-03 Thread Derick Rethans
On Sun, 2 Mar 2008, Marcus Boerger wrote:

 However, we had to drop multibyte support as well as the encoding 
 declare.

Just wondering, why did you have to drop the declare(encoding=...) ? 
It's just ignored in PHP 5.x - and it is useful to have for migrating 
php 5.3 apps to 6. So can you atleast make the new parser just ignore 
this statement?

regards,
Derick

-- 
Derick Rethans
http://derickrethans.nl | http://ezcomponents.org | http://xdebug.org

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-03 Thread Johannes Schlüter
Hi Derick,

On Mon, 2008-03-03 at 09:28 +0100, Derick Rethans wrote:
 On Sun, 2 Mar 2008, Marcus Boerger wrote:
 
  However, we had to drop multibyte support as well as the encoding 
  declare.
 
 Just wondering, why did you have to drop the declare(encoding=...) ? 
 It's just ignored in PHP 5.x - and it is useful to have for migrating 
 php 5.3 apps to 6. So can you atleast make the new parser just ignore 
 this statement?

It is not ignored in PHP 5 as Marcus and I found out while reading the
code :-)
If you compile with --enable-zend-multibyte you can change the encoding
using declare even multiple times per file using declare it seems.

johannes



-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-03 Thread Marcus Boerger
Hello Derick,

  actually you get a message (E_COMPILE_WARNING) that this is not
supported. Maybe we could turn this into an E_NOTICE though.

marcus

Monday, March 3, 2008, 9:28:01 AM, you wrote:

 On Sun, 2 Mar 2008, Marcus Boerger wrote:

 However, we had to drop multibyte support as well as the encoding 
 declare.

 Just wondering, why did you have to drop the declare(encoding=...) ? 
 It's just ignored in PHP 5.x - and it is useful to have for migrating 
 php 5.3 apps to 6. So can you atleast make the new parser just ignore 
 this statement?

 regards,
 Derick

 -- 
 Derick Rethans
 http://derickrethans.nl | http://ezcomponents.org | http://xdebug.org




Best regards,
 Marcus


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-03 Thread Derick Rethans
On Mon, 3 Mar 2008, Marcus Boerger wrote:

   actually you get a message (E_COMPILE_WARNING) that this is not
 supported. Maybe we could turn this into an E_NOTICE though.

No, I don't get any warning/notice/ whatever with PHP 5.3:

[EMAIL PROTECTED]:~$ php-5.3dev -derror_reporting=65535

?php
declare(encoding=utf-8);
echo foo\n;
?

foo


Please don't break this.

regards,
Derick

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-03 Thread Marcus Boerger
Hello Stanislav,

Monday, March 3, 2008, 5:39:35 AM, you wrote:

 Hi!

 Were the stream support issues solved?
 
 We completely dropped multibyte support. The reason is that the way we were

 I wasn't asking about multibyte (that we discuss below), but about other 
 streams - I think I mentioned it on IRC last time re2c parser was 
 discussed. I remember re2c used mmap, and not all files PHP can run can 
 be mmap'ed. Was it fixed?

Ah, you didn't write that so I got confused. Anyway, what we are doing is
the following order:
1) If mmap is supported, then use it
2) If mmap is not supported or does not work then read the whole stream
3) If that is not possible read char by char

The flex based scanner reads in smaller chunks or char by char, so it is
more or less always like case 3.

 Once we have finished the move to re2c, we can support all of those
 correctly. The multibyte support also duplicated the encoding tables
 otherwise available in ext/mbstring or ext/iconv or pecl/intl.

 pecl/intl per se doesn't have any encoding tables. ICU does, but that 
 would mean you have to have ICU to run PHP. That might not be a big 
 problem since ICU is supported by IBM (read: good chance more exotic 
 systems would have support) it is still dependency on non-bundled 3rd 
 party library in PHP 5 core. Of course, PHP 6 has this dependency, but 
 we might want to not have such things in 5.x so that you won't have to 
 change your system too much while staying on 5.x.

Are you saying we cannot depend on ICU in PHP 6 and have to redo it
completely or what?

 Rely on a not supported undocumented feature? I am rather able to build php
 and rewrite that support.

 Being undocumented is nothing to be proud of, however as poorly 
 documented as it is, it is used. I'm all for implementing it in a better 
 way - and having new parser is a good time to do it. That's exactly the 
 reason we shouldn't rush with it but do it right this time. There's no 
 burning need to have a new parser right now, so we can have some moment 
 to think - ok, how we want multibyte support there to work? And if we 
 might need some modifications, we'd have time and flexibility to do it, 
 not having the code in 5.3 which was supposed to go in RC in Q1 (ending 
 1 month from now).

 You are free to contribute and make MB support working upfront.

 I know I'm free :) However, as much as I understand the eagerness of 
 having it in the source tree, I repeat that I do not think dropping 
 multibyte support in 5.3 is acceptable. Thus, if it is committed right 
 now, 5.3 would have to be deferred until this is resolved. If this is 
 resolved timely for 5.3 - great. If not, we better get it in 5.4 right 
 than in 5.3 wrong.

I don't see a problem with redoing multibyte support in a useable way.
Actually we better redo it anyway because it is a very bad solution as it
is right now. That is the current solution duplicates the input and uses a
flattening filter to always scan an eight bit input stream. Then when
something needs to get pushed to the output, we recalculate the position on
the original input and use that part. Changing to re2c we can do a very
easy solution. When requested or detected per BOM, we switch to a second
version of the scanner that works on unsigned int and supports the full
unicode character set (only thing to do for re2c is to switch the input
type and guess what, this is already in production on a lot of systems).

Other approaches are to natively support UTF-8 and UTF-16 besides 8 bit
and UTF-32. Further more we can apply any kind of filtering correctly on
top of the UTF-* scanner.

I Know there is some work left but when we do not apply the work now then
we basically have two engines. In that case I'll just rewrite the engine
completely and replace it.

Best regards,
 Marcus


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-03 Thread Marcus Boerger
Hello Alan,

  be my hero then :-) Could you generate a few tests for the multibyte
support so that we know how it is used right now and what we need to take
care of?

marcus

Monday, March 3, 2008, 12:48:44 AM, you wrote:

 Can you clarify the Multibyte issues:
 - I presume this means that it can handle ASCII/UTF8/16 etc. but will 
 not handle things like BIG5/GB encoding in source code - this may be a 
 bit of an issue around here..

 Regards
 Alan


 Marcus Boerger wrote:
 RFC: REPLACE THE FLEX-BASED SCANNER WITH AN RE2C [1] BASED LEXER

 Situation:
 The current flex-based lexer depends on an outdated and unsupported flex
 version. Alternatives include either updating to a newer version of flex or
 using re2c, which we already use for a variety of things (serializing, pdo 
 sql
 scanning, date/time parsing). While moving towards a newer flex version would
 be much easier, switching to re2c promises a much faster lexer. Actually,
 without any specific re2c optimizations we already get around a 20% scanner
 performance increase. Running the tests gets an overall speedup of 2%. It is
 arguable whether this is enough, but re2c has more advantages. First of all,
 re2c allows one to scan any type of input (ASCII, UTF-8, UTF-16, UTF-32).
 Secondly, it allows for better integration with Lemon [2], which would be the
 next step. And thirdly we can switch to a reentrant scanner.

 Current state:
 Flex has been fully replaced by re2c in Zend. We have also switched to an
 mmap-based lexer approach for now. However, we had to drop multibyte support
 as well as the encoding declare. The current state can be checked out from
 Scott's subversion repository [3] and you can follow the development on his
 Trac setup [4]. When you want to build php with re2c, then you need to grab
 re2c from its sourceforge subversion repository [5]. You can also check out
 the changes in a patch created Sunday 2nd March against a PHP checkout from 
 14th February [6].

 Further steps:
 Commit this to PHP 5.3. Synch to HEAD. Add pecl/intl to 5.3. Discuss/recreate
 multibyte support with libintl.

 Future steps:
 Replace bison with lemon in PHP 5.4 or HEAD.

 Time Frame:
 Commit to 5.3 between the 5th and the 15th of March. Synch to HEAD a couple
 of days later. Moving pecl/libintl to ext (depends on the 5.3 RMs decision).
 After that is done, decide about multibyte support. Along with the commit to
 the 5.3 branch there will be a new re2c version available.


 Marcus Boerger
 Nuno Lopes
 Scott MacVicar


 [1] http://re2c.org/
 [2] http://www.hwaci.com/sw/lemon/
 [3] svn://whisky.macvicar.net/php-re2c
 [4] http://trac.macvicar.net/php-re2c/
 [5] https://re2c.svn.sourceforge.net/svnroot/re2c/trunk/re2c
 [6] http://php.net/~helly/php-re2c-20080302.diff.txt



   





Best regards,
 Marcus


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-03 Thread Marcus Boerger
Hello Derick,

  ok, for now I changed to not issue any error at all.

marcus

Monday, March 3, 2008, 11:28:31 AM, you wrote:

 On Mon, 3 Mar 2008, Marcus Boerger wrote:

   actually you get a message (E_COMPILE_WARNING) that this is not
 supported. Maybe we could turn this into an E_NOTICE though.

 No, I don't get any warning/notice/ whatever with PHP 5.3:

 [EMAIL PROTECTED]:~$ php-5.3dev -derror_reporting=65535

 ?php
 declare(encoding=utf-8);
 echo foo\n;
?

 foo


 Please don't break this.

 regards,
 Derick




Best regards,
 Marcus


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-03 Thread Lukas Kahwe Smith


On 03.03.2008, at 00:48, Alan Knowles wrote:


Can you clarify the Multibyte issues:
- I presume this means that it can handle ASCII/UTF8/16 etc. but  
will not handle things like BIG5/GB encoding in source code - this  
may be a bit of an issue around here..




At first I also thought that this had something to do with ext/ 
mbstring, but since then I have learned that this is not the case.  
However this confusion is likely what causes many people to enable  
zend mb support. So the question to Stas (Alan and the rest of the  
world) is if they really have a script in the wild that actually  
requires this switch and would break if its would be disabled. And if  
there is such a script what exactly are the needs and how can these be  
filled in 5.3 using re2c.


regards,
Lukas

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-03 Thread Johannes Schlüter
Hi,

On Sun, 2008-03-02 at 14:47 -0800, Stanislav Malyshev wrote:
 Hi!
 
  be much easier, switching to re2c promises a much faster lexer. Actually,
  without any specific re2c optimizations we already get around a 20% scanner
 
 I think 20% faster is very cool.
 However, as I understand re2c is not a standard tool found everywhere. 
 So what happens if you wanted to use it on some exotic system where re2c 
 is not readily available as manintainer-supported software? Also, flex 
 is available on Windows for example as part of cygwin, while I don't see 
 re2c there.
 I understand this can be of low importance since we keep generated files 
 in our repositories, but I think we still have to keep it in mind.
 I understand also current patch requires non-release version of re2c - 
 maybe we should have some release version at least until we make PHP 
 depend on it?

We need a change there anyways, flex 2.5.4 is bundled with less systems,
even my Solaris 20 box has 2.5.33 instead of 2.5.4 by default. And I
think changing to something which is maintained by one of our main
contributors might be beneficial for us.

 Note - pecl/intl does nothing towards multibyte support etc., at least 
 for now. If there are voloteers to change that, it can be discussed, but 
 so far it is for doing entirely other things (locale-dependent 
 functionality mostly).
 So, I think before re2c parser can be merged the issue with multibyte 
 compatibility must be solved - otherwise it will make the users that 
 rely on it unable to use newer PHP. As cool as 20% faster is, I think we 
 can't drop support for such feature, especially not in 5.3.
 
  Commit to 5.3 between the 5th and the 15th of March. Synch to HEAD a couple
  of days later. Moving pecl/libintl to ext (depends on the 5.3 RMs decision).
  After that is done, decide about multibyte support. Along with the commit to
  the 5.3 branch there will be a new re2c version available.
 
 I think we first need to figure out what happens to multibyte support, 
 and not commit anything before we have it figured out. Multibyte support 
 is important piece of functionality for some PHP users, and it works 
 now. Breaking it without providing any alternative - especially that we 
 have now 5.3 mostly ready for the release cycle, and solving multibyte 
 problems with re2c may take undefined amount of time, as far as I 
 understand. I do not think it would be acceptable to release 5.3 without 
 multibyte support, so the option here either merge it now and have 5.3 
 waiting until MB is figured out, or try to figure it out before commit 
 and if we can't in a reasonable term, go forward with 5.3 and defer the 
 parser change for 5.4.

Since there's no documentation about zend-multibyte stuff I spent some
time searching for other resources about it, but except bug reports I
found nothing whee it was required. I'm sure there are some but comments
like TODO: support widechars in the code give me the impression that
it doesn't really work... and I guess many people just enable it sinceit
sounds important not due to the fact that hey really need it. Of course
I might be wrong so I'd be interested in use cases for
--enable-zend-multibyte stuff. Maybe we can fullfill the needs without
the switch.

If there are good use cases for that switch I won't like to replace some
small engine thingy with a huge external library like ICU.

And I doubt that more than just a few people know what it really does -
Marcus and I just found out while working on that stuff over the
weekend.

 Again, while I think the speedup is great and congratulate Marcus, Nuno 
 and Scott on great work, I think we should keep in mind we have working 
 parser right now and changing it in an incompatible way is very 
 high-risk and should not be taken hastily.

Right, it's great work they did there but a broken scanner would be one
of the worst things we might ship. So I'd invite everybody to checkout
that version from SVN (see Marcus's mail) and test it using the worst
stuff you can think off :-)

johannes


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-03 Thread Alan Knowles

a few replaces with this file should be  a good testcase
- probably worth testing
* comments with these character in them. both /* and //
* string with these characters in them.
lynx -source 
'http://smontagu.damowmow.com/genEncodingTest.cgi?family=windowscodepage=950'  
| grep test | grep -v testcase


I have definatly seen code with chinese characters in comments and 
strings and a few times function names and variable names with chinese 
characters...


Regards
Alan


Marcus Boerger wrote:

Hello Alan,

  be my hero then :-) Could you generate a few tests for the multibyte
support so that we know how it is used right now and what we need to take
care of?

marcus

Monday, March 3, 2008, 12:48:44 AM, you wrote:

  

Can you clarify the Multibyte issues:
- I presume this means that it can handle ASCII/UTF8/16 etc. but will 
not handle things like BIG5/GB encoding in source code - this may be a 
bit of an issue around here..



  

Regards
Alan




  

Marcus Boerger wrote:


RFC: REPLACE THE FLEX-BASED SCANNER WITH AN RE2C [1] BASED LEXER

Situation:
The current flex-based lexer depends on an outdated and unsupported flex
version. Alternatives include either updating to a newer version of flex or
using re2c, which we already use for a variety of things (serializing, pdo sql
scanning, date/time parsing). While moving towards a newer flex version would
be much easier, switching to re2c promises a much faster lexer. Actually,
without any specific re2c optimizations we already get around a 20% scanner
performance increase. Running the tests gets an overall speedup of 2%. It is
arguable whether this is enough, but re2c has more advantages. First of all,
re2c allows one to scan any type of input (ASCII, UTF-8, UTF-16, UTF-32).
Secondly, it allows for better integration with Lemon [2], which would be the
next step. And thirdly we can switch to a reentrant scanner.

Current state:
Flex has been fully replaced by re2c in Zend. We have also switched to an
mmap-based lexer approach for now. However, we had to drop multibyte support
as well as the encoding declare. The current state can be checked out from
Scott's subversion repository [3] and you can follow the development on his
Trac setup [4]. When you want to build php with re2c, then you need to grab
re2c from its sourceforge subversion repository [5]. You can also check out
the changes in a patch created Sunday 2nd March against a PHP checkout from 
14th February [6].


Further steps:
Commit this to PHP 5.3. Synch to HEAD. Add pecl/intl to 5.3. Discuss/recreate
multibyte support with libintl.

Future steps:
Replace bison with lemon in PHP 5.4 or HEAD.

Time Frame:
Commit to 5.3 between the 5th and the 15th of March. Synch to HEAD a couple
of days later. Moving pecl/libintl to ext (depends on the 5.3 RMs decision).
After that is done, decide about multibyte support. Along with the commit to
the 5.3 branch there will be a new re2c version available.


Marcus Boerger
Nuno Lopes
Scott MacVicar


[1] http://re2c.org/
[2] http://www.hwaci.com/sw/lemon/
[3] svn://whisky.macvicar.net/php-re2c
[4] http://trac.macvicar.net/php-re2c/
[5] https://re2c.svn.sourceforge.net/svnroot/re2c/trunk/re2c
[6] http://php.net/~helly/php-re2c-20080302.diff.txt



  
  






Best regards,
 Marcus

  



--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-03 Thread Stanislav Malyshev

Hi!


Since there's no documentation about zend-multibyte stuff I spent some
time searching for other resources about it, but except bug reports I
found nothing whee it was required. I'm sure there are some but comments
like TODO: support widechars in the code give me the impression that
it doesn't really work... and I guess many people just enable it sinceit


It does work and there are people using it, even though I imagine it can 
have some bugs. I guess it would be best to talk to mbstring maintainer 
on code details, etc.



If there are good use cases for that switch I won't like to replace some
small engine thingy with a huge external library like ICU.


The use cases are scripts written in encodings like shift-JIS, etc.


And I doubt that more than just a few people know what it really does -
Marcus and I just found out while working on that stuff over the
weekend.


So I guess documentation is important :) Let it be a lesson to us all.
--
Stanislav Malyshev, Zend Software Architect
[EMAIL PROTECTED]   http://www.zend.com/
(408)253-8829   MSN: [EMAIL PROTECTED]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-03 Thread Pierre Joye
Hi,

On Mon, Mar 3, 2008 at 7:59 PM, Stanislav Malyshev [EMAIL PROTECTED] wrote:

  Just curious who you were answering to... Anyway, to be clear:
  1. PHP 6 is major version with its major feature being Unicode support.
  2. PHP 5.x is same-major branch, where you are not expected to have to
  change your system in order to upgrade.
  3. We do not expect people to take PHP 6 and have absolutely everything
  work instantly from PHP 5. We try to minimize upgrade path, but major
  version upgrades can take some adjustments.
  4. We expect people to upgrade from 5.2.x to 5.3.x without changing
  their systems.

  Is it clearer why I think PHP 5.x and 6 are different and why I think
  ICU dependency in the 5.3 core might be a problem?

It is clearer but it is not a problem. New features may introduce new
dependencies. Having a dependency on libicu while we introduce intl
and other features related to unicode or i18n. I would agree if we
were talking about 5.2.x.

-- 
Pierre
http://blog.thepimp.net | http://www.libgd.org

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-03 Thread Stanislav Malyshev

Hi!


It is clearer but it is not a problem. New features may introduce new
dependencies. Having a dependency on libicu while we introduce intl
and other features related to unicode or i18n. I would agree if we
were talking about 5.2.x.


pecl/intl is an extension, there's no surprise that you need external 
library when you enable extension. However, adding dependency in core 
that you can not rid of has a lot of consequences (think distributions, 
builds on non-Linux systems, etc., etc.).

--
Stanislav Malyshev, Zend Software Architect
[EMAIL PROTECTED]   http://www.zend.com/
(408)253-8829   MSN: [EMAIL PROTECTED]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-03 Thread Pierre Joye
On Mon, Mar 3, 2008 at 8:48 PM, Stanislav Malyshev [EMAIL PROTECTED] wrote:
 Hi!


   It is clearer but it is not a problem. New features may introduce new
   dependencies. Having a dependency on libicu while we introduce intl
   and other features related to unicode or i18n. I would agree if we
   were talking about 5.2.x.

  pecl/intl is an extension, there's no surprise that you need external
  library when you enable extension. However, adding dependency in core
  that you can not rid of has a lot of consequences (think distributions,
  builds on non-Linux systems, etc., etc.).

intl (and related changes) is almost the only why one will upgrade to
5.3.x. There is no core (as in zend engine) for 95% of our users.
There is a PHP release with default features which can be relied on.
That's my feeling and experiences on this topic.

That being said, icu is so common these days, I really don't see a
problem to have it as dep. If we were asking for some esoteric
library, I would worry more, obviously :)

-- 
Pierre
http://blog.thepimp.net | http://www.libgd.org

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-03 Thread Derick Rethans
On Mon, 3 Mar 2008, Stanislav Malyshev wrote:

 4. We expect people to upgrade from 5.2.x to 5.3.x without changing their
 systems.
 
 Is it clearer why I think PHP 5.x and 6 are different and why I think ICU
 dependency in the 5.3 core might be a problem?

FWIW... I also think that bringing in ICU in 5.3 so late in the cycle 
- or actually at all in 5.3 - is not such a bright idea.

regards,
Derick

-- 
Derick Rethans
http://derickrethans.nl | http://ezcomponents.org | http://xdebug.org

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-03 Thread Steph Fox

Is it clearer why I think PHP 5.x and 6 are different and why I think ICU
dependency in the 5.3 core might be a problem?


FWIW... I also think that bringing in ICU in 5.3 so late in the cycle
- or actually at all in 5.3 - is not such a bright idea.


'so late in the cycle'? We haven't had a beta rc yet. I agree intl should've 
been moved into core several weeks ago if that helps any...


- Steph


--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-03 Thread Steph Fox

No one was considering any such move. Having pecl/intl shipped per default
as symlinked into ext would be as much optional as --enable-zend-multibyte
or --enable-mbstring are right now. This will be more like brining in zip
to 5.2. However it is completely off-topic as it is just one possible 
cause

of action while the other is to stick with mbstring.


Intl and mbstring don't share anything like the same functionality...

- Steph 



--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-03 Thread Marcus Boerger
Hello Pierre,

Monday, March 3, 2008, 9:31:37 PM, you wrote:

 Hi Marcus,

 On Mon, Mar 3, 2008 at 9:16 PM, Marcus Boerger [EMAIL PROTECTED] wrote:
 Hello Stanislav,


  Monday, March 3, 2008, 8:48:38 PM, you wrote:

   Hi!

   It is clearer but it is not a problem. New features may introduce new
   dependencies. Having a dependency on libicu while we introduce intl
   and other features related to unicode or i18n. I would agree if we
   were talking about 5.2.x.

 Bad example, it is not symlinked :)

 And heh, it would be time to give a break with your zip rant, hmmk? =)

Sorry, this wasn't meant at all as a rant. It is just a recent example
where a new extension brought in a new dependency. Though you come with a
bundled one so it actually should have looked for a better one.

Best regards,
 Marcus


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-03 Thread Stanislav Malyshev

Hi!


intl (and related changes) is almost the only why one will upgrade to
5.3.x. There is no core (as in zend engine) for 95% of our users.


From NEWS:
- Added and improved PHP syntax and semantics:
  . Added NOWDOC. (Gwynne Raskind, Stas, Dmitry)
  . Added ?: operator. (Marcus)
  . Added support for namespaces. (Dmitry, Stas, Gregory)
  . Added support for Late Static Binding. (Dmitry, Etienne Kneuss)
  . Added support for __callstatic() magic method. (Sara)
  . Added support for dynamic access of static members using 
$foo::myFunc().

(Etienne Kneuss)
  . Improved checks for callbacks. (Marcus)
And that's not counting extension stuff. I of course value a lot the 
importance given to intl, but 5.3 IMHO is juicier than just intl :)

--
Stanislav Malyshev, Zend Software Architect
[EMAIL PROTECTED]   http://www.zend.com/
(408)253-8829   MSN: [EMAIL PROTECTED]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-02 Thread Stanislav Malyshev

Hi!


be much easier, switching to re2c promises a much faster lexer. Actually,
without any specific re2c optimizations we already get around a 20% scanner


I think 20% faster is very cool.
However, as I understand re2c is not a standard tool found everywhere. 
So what happens if you wanted to use it on some exotic system where re2c 
is not readily available as manintainer-supported software? Also, flex 
is available on Windows for example as part of cygwin, while I don't see 
re2c there.
I understand this can be of low importance since we keep generated files 
in our repositories, but I think we still have to keep it in mind.
I understand also current patch requires non-release version of re2c - 
maybe we should have some release version at least until we make PHP 
depend on it?



Current state:
Flex has been fully replaced by re2c in Zend. We have also switched to an
mmap-based lexer approach for now. However, we had to drop multibyte support


Were the stream support issues solved?


as well as the encoding declare. The current state can be checked out from
Scott's subversion repository [3] and you can follow the development on his
Trac setup [4]. When you want to build php with re2c, then you need to grab
re2c from its sourceforge subversion repository [5]. You can also check out
the changes in a patch created Sunday 2nd March against a PHP checkout from 
14th February [6].


Further steps:
Commit this to PHP 5.3. Synch to HEAD. Add pecl/intl to 5.3. Discuss/recreate
multibyte support with libintl.


Note - pecl/intl does nothing towards multibyte support etc., at least 
for now. If there are voloteers to change that, it can be discussed, but 
so far it is for doing entirely other things (locale-dependent 
functionality mostly).
So, I think before re2c parser can be merged the issue with multibyte 
compatibility must be solved - otherwise it will make the users that 
rely on it unable to use newer PHP. As cool as 20% faster is, I think we 
can't drop support for such feature, especially not in 5.3.



Commit to 5.3 between the 5th and the 15th of March. Synch to HEAD a couple
of days later. Moving pecl/libintl to ext (depends on the 5.3 RMs decision).
After that is done, decide about multibyte support. Along with the commit to
the 5.3 branch there will be a new re2c version available.


I think we first need to figure out what happens to multibyte support, 
and not commit anything before we have it figured out. Multibyte support 
is important piece of functionality for some PHP users, and it works 
now. Breaking it without providing any alternative - especially that we 
have now 5.3 mostly ready for the release cycle, and solving multibyte 
problems with re2c may take undefined amount of time, as far as I 
understand. I do not think it would be acceptable to release 5.3 without 
multibyte support, so the option here either merge it now and have 5.3 
waiting until MB is figured out, or try to figure it out before commit 
and if we can't in a reasonable term, go forward with 5.3 and defer the 
parser change for 5.4.


Again, while I think the speedup is great and congratulate Marcus, Nuno 
and Scott on great work, I think we should keep in mind we have working 
parser right now and changing it in an incompatible way is very 
high-risk and should not be taken hastily.

--
Stanislav Malyshev, Zend Software Architect
[EMAIL PROTECTED]   http://www.zend.com/
(408)253-8829   MSN: [EMAIL PROTECTED]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-02 Thread Rasmus Lerdorf

Stanislav Malyshev wrote:

Hi!

be much easier, switching to re2c promises a much faster lexer. 
Actually,
without any specific re2c optimizations we already get around a 20% 
scanner


I think 20% faster is very cool.
However, as I understand re2c is not a standard tool found everywhere. 
So what happens if you wanted to use it on some exotic system where 
re2c is not readily available as manintainer-supported software? Also, 
flex is available on Windows for example as part of cygwin, while I 
don't see re2c there.
I don't think this part is a concern since we have required re2c for 
quite a while now to build many critical parts of PHP.  People who 
actually need to regenerate the parser files are the same people for 
whom it is trivial to figure out how to install re2c.  And yes, it would 
of course be good to use a released version of re2c, but I think by the 
time 5.3 is ready to go the version of re2c we need will be out there.  
Since it is Marcus' baby, he can just push it out, I don't think this is 
a stumbling block either.  Some of the new stuff in re2c was 
specifically added to make it easier to write a PHP parser, so I don't 
think backporting to an older version is really an option.


-Rasmus

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-02 Thread Marcus Boerger
Hello Stanislav,

Sunday, March 2, 2008, 11:47:57 PM, you wrote:

 Hi!

 be much easier, switching to re2c promises a much faster lexer. Actually,
 without any specific re2c optimizations we already get around a 20% scanner

 I think 20% faster is very cool.
 However, as I understand re2c is not a standard tool found everywhere. 
 So what happens if you wanted to use it on some exotic system where re2c 
 is not readily available as manintainer-supported software? Also, flex 
 is available on Windows for example as part of cygwin, while I don't see 
 re2c there.
 I understand this can be of low importance since we keep generated files 
 in our repositories, but I think we still have to keep it in mind.
 I understand also current patch requires non-release version of re2c - 
 maybe we should have some release version at least until we make PHP 
 depend on it?

Well, re2c works for on a very large amount of systems, can easily be build
and comes with a read to download windows executable. Furthermore all major
distributions have re2c packages. Along with storing the generated files in
cvs i see no issue at all in these regards.

 Current state:
 Flex has been fully replaced by re2c in Zend. We have also switched to an
 mmap-based lexer approach for now. However, we had to drop multibyte support

 Were the stream support issues solved?

We completely dropped multibyte support. The reason is that the way we were
doing it, is that we constanlty switch between the full original and a
recoded duplicate that simply ignores multibyte (or any encoding at all).
Once we have finished the move to re2c, we can support all of those
correctly. The multibyte support also duplicated the encoding tables
otherwise available in ext/mbstring or ext/iconv or pecl/intl.

 as well as the encoding declare. The current state can be checked out from
 Scott's subversion repository [3] and you can follow the development on his
 Trac setup [4]. When you want to build php with re2c, then you need to grab
 re2c from its sourceforge subversion repository [5]. You can also check out
 the changes in a patch created Sunday 2nd March against a PHP checkout from 
 14th February [6].
 
 Further steps:
 Commit this to PHP 5.3. Synch to HEAD. Add pecl/intl to 5.3. Discuss/recreate
 multibyte support with libintl.

 Note - pecl/intl does nothing towards multibyte support etc., at least 
 for now. If there are voloteers to change that, it can be discussed, but 
 so far it is for doing entirely other things (locale-dependent 
 functionality mostly).

Yes I know. However pecl/intl brings in a php/icu bridge which we can build
on.

 So, I think before re2c parser can be merged the issue with multibyte 
 compatibility must be solved - otherwise it will make the users that 
 rely on it unable to use newer PHP. As cool as 20% faster is, I think we 
 can't drop support for such feature, especially not in 5.3.

Rely on a not supported undocumented feature? I am rather able to build php
and rewrite that support.

 Commit to 5.3 between the 5th and the 15th of March. Synch to HEAD a couple
 of days later. Moving pecl/libintl to ext (depends on the 5.3 RMs decision).
 After that is done, decide about multibyte support. Along with the commit to
 the 5.3 branch there will be a new re2c version available.

 I think we first need to figure out what happens to multibyte support, 
 and not commit anything before we have it figured out. Multibyte support 
 is important piece of functionality for some PHP users, and it works 
 now. Breaking it without providing any alternative - especially that we 
 have now 5.3 mostly ready for the release cycle, and solving multibyte 
 problems with re2c may take undefined amount of time, as far as I 
 understand. I do not think it would be acceptable to release 5.3 without 
 multibyte support, so the option here either merge it now and have 5.3 
 waiting until MB is figured out, or try to figure it out before commit 
 and if we can't in a reasonable term, go forward with 5.3 and defer the 
 parser change for 5.4.

 Again, while I think the speedup is great and congratulate Marcus, Nuno 
 and Scott on great work, I think we should keep in mind we have working 
 parser right now and changing it in an incompatible way is very 
 high-risk and should not be taken hastily.

You are free to contribute and make MB support working upfront.

Best regards,
 Marcus


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-02 Thread Marcus Boerger
Hello Rasmus,

Monday, March 3, 2008, 12:25:52 AM, you wrote:

 Stanislav Malyshev wrote:
 Hi!

 be much easier, switching to re2c promises a much faster lexer. 
 Actually,
 without any specific re2c optimizations we already get around a 20% 
 scanner

 I think 20% faster is very cool.
 However, as I understand re2c is not a standard tool found everywhere. 
 So what happens if you wanted to use it on some exotic system where 
 re2c is not readily available as manintainer-supported software? Also, 
 flex is available on Windows for example as part of cygwin, while I 
 don't see re2c there.
 I don't think this part is a concern since we have required re2c for 
 quite a while now to build many critical parts of PHP.  People who 
 actually need to regenerate the parser files are the same people for 
 whom it is trivial to figure out how to install re2c.  And yes, it would 
 of course be good to use a released version of re2c, but I think by the 
 time 5.3 is ready to go the version of re2c we need will be out there.  
 Since it is Marcus' baby, he can just push it out, I don't think this is 
 a stumbling block either.  Some of the new stuff in re2c was 
 specifically added to make it easier to write a PHP parser, so I don't 
 think backporting to an older version is really an option.

Right. The current re2c development cycle is solely dedicated to be able
to rewrite the PHP scanners. I will update re2c whenever necessary during
the remaining development cycle and release a new stable release before we
release PHP 5.3.

Best regards,
 Marcus


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-02 Thread Pierre Joye
Hi Stan,

On Sun, Mar 2, 2008 at 11:47 PM, Stanislav Malyshev [EMAIL PROTECTED] wrote:
 Hi!


   be much easier, switching to re2c promises a much faster lexer. Actually,
   without any specific re2c optimizations we already get around a 20% scanner

  I think 20% faster is very cool.
  However, as I understand re2c is not a standard tool found everywhere.
  So what happens if you wanted to use it on some exotic system where re2c
  is not readily available as manintainer-supported software? Also, flex
  is available on Windows for example as part of cygwin, while I don't see
  re2c there.

A quick note about this non problem. re2c works pretty well on windows
and they provide a .exe as far as I remember (much easier than flex
which requires cygwin or gnuwin32, even if both work :). Besides the
portability of re2c, we already use it in some extensions (if I
remember correctly) and nobody complained.

Cheers,
-- 
Pierre
http://blog.thepimp.net | http://www.libgd.org

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-02 Thread Alan Knowles

Can you clarify the Multibyte issues:
- I presume this means that it can handle ASCII/UTF8/16 etc. but will 
not handle things like BIG5/GB encoding in source code - this may be a 
bit of an issue around here..


Regards
Alan


Marcus Boerger wrote:

RFC: REPLACE THE FLEX-BASED SCANNER WITH AN RE2C [1] BASED LEXER

Situation:
The current flex-based lexer depends on an outdated and unsupported flex
version. Alternatives include either updating to a newer version of flex or
using re2c, which we already use for a variety of things (serializing, pdo sql
scanning, date/time parsing). While moving towards a newer flex version would
be much easier, switching to re2c promises a much faster lexer. Actually,
without any specific re2c optimizations we already get around a 20% scanner
performance increase. Running the tests gets an overall speedup of 2%. It is
arguable whether this is enough, but re2c has more advantages. First of all,
re2c allows one to scan any type of input (ASCII, UTF-8, UTF-16, UTF-32).
Secondly, it allows for better integration with Lemon [2], which would be the
next step. And thirdly we can switch to a reentrant scanner.

Current state:
Flex has been fully replaced by re2c in Zend. We have also switched to an
mmap-based lexer approach for now. However, we had to drop multibyte support
as well as the encoding declare. The current state can be checked out from
Scott's subversion repository [3] and you can follow the development on his
Trac setup [4]. When you want to build php with re2c, then you need to grab
re2c from its sourceforge subversion repository [5]. You can also check out
the changes in a patch created Sunday 2nd March against a PHP checkout from 
14th February [6].


Further steps:
Commit this to PHP 5.3. Synch to HEAD. Add pecl/intl to 5.3. Discuss/recreate
multibyte support with libintl.

Future steps:
Replace bison with lemon in PHP 5.4 or HEAD.

Time Frame:
Commit to 5.3 between the 5th and the 15th of March. Synch to HEAD a couple
of days later. Moving pecl/libintl to ext (depends on the 5.3 RMs decision).
After that is done, decide about multibyte support. Along with the commit to
the 5.3 branch there will be a new re2c version available.


Marcus Boerger
Nuno Lopes
Scott MacVicar


[1] http://re2c.org/
[2] http://www.hwaci.com/sw/lemon/
[3] svn://whisky.macvicar.net/php-re2c
[4] http://trac.macvicar.net/php-re2c/
[5] https://re2c.svn.sourceforge.net/svnroot/re2c/trunk/re2c
[6] http://php.net/~helly/php-re2c-20080302.diff.txt



  



--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-02 Thread Stanislav Malyshev

Hi!


Were the stream support issues solved?


We completely dropped multibyte support. The reason is that the way we were


I wasn't asking about multibyte (that we discuss below), but about other 
streams - I think I mentioned it on IRC last time re2c parser was 
discussed. I remember re2c used mmap, and not all files PHP can run can 
be mmap'ed. Was it fixed?



Once we have finished the move to re2c, we can support all of those
correctly. The multibyte support also duplicated the encoding tables
otherwise available in ext/mbstring or ext/iconv or pecl/intl.


pecl/intl per se doesn't have any encoding tables. ICU does, but that 
would mean you have to have ICU to run PHP. That might not be a big 
problem since ICU is supported by IBM (read: good chance more exotic 
systems would have support) it is still dependency on non-bundled 3rd 
party library in PHP 5 core. Of course, PHP 6 has this dependency, but 
we might want to not have such things in 5.x so that you won't have to 
change your system too much while staying on 5.x.



Rely on a not supported undocumented feature? I am rather able to build php
and rewrite that support.


Being undocumented is nothing to be proud of, however as poorly 
documented as it is, it is used. I'm all for implementing it in a better 
way - and having new parser is a good time to do it. That's exactly the 
reason we shouldn't rush with it but do it right this time. There's no 
burning need to have a new parser right now, so we can have some moment 
to think - ok, how we want multibyte support there to work? And if we 
might need some modifications, we'd have time and flexibility to do it, 
not having the code in 5.3 which was supposed to go in RC in Q1 (ending 
1 month from now).



You are free to contribute and make MB support working upfront.


I know I'm free :) However, as much as I understand the eagerness of 
having it in the source tree, I repeat that I do not think dropping 
multibyte support in 5.3 is acceptable. Thus, if it is committed right 
now, 5.3 would have to be deferred until this is resolved. If this is 
resolved timely for 5.3 - great. If not, we better get it in 5.4 right 
than in 5.3 wrong.

--
Stanislav Malyshev, Zend Software Architect
[EMAIL PROTECTED]   http://www.zend.com/
(408)253-8829   MSN: [EMAIL PROTECTED]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer

2008-03-02 Thread Stanislav Malyshev
I don't think this part is a concern since we have required re2c for 
quite a while now to build many critical parts of PHP.  People who 


Ok, great then - only issue remaining is the multibyte support.

--
Stanislav Malyshev, Zend Software Architect
[EMAIL PROTECTED]   http://www.zend.com/
(408)253-8829   MSN: [EMAIL PROTECTED]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php