Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
Hello Stanislav, cool, care to change the code snippet into a test as I've done for Rui's snippet? marcus Sunday, March 23, 2008, 5:06:53 AM, you wrote: is broken code and not a single test. If this is not going to change as in we are not getting any .phpt files for this feature then there are two As I understand the theory of the thing should be pretty simple, you set input encoding (by config or declare) and internal encoding, and then when script is being read, you convert it from input to internal. However, it appears that since flex couldn't stomach certain encodings, there's also a hack there - script is translated from input to some safe encoding for flex, and then strings are translated back to internal encoding after flex processes them. If re2c can deal with encodings like SJIS without trouble then some of the hacks might be unnecessary. I think encodings that need to be checked are those in zend_multibyte.c that have compatible flag off. Here's a short script example I found that shows what's the problem there: ?php echo 'ソ'; ? Character echoed there is U+30BD Katakana letter SO. Now if you run it in UTF-8, works good. However, if you recode it to Shift-JIS, it won't run, since this script looks to the parser this way: ?php echo '83\'; ? (that's dump of VI output, so replace 83 with actual 0x83 if you compose it). That's parse error for the parser, if parsed naively. So somehow the parser needs to know 0x83+\ is actually U+30BD and at the same time the user still might want it as 0x83+\ in a zval (or maybe as utf-8 - it depends on him). -- Stanislav Malyshev, Zend Software Architect [EMAIL PROTECTED] http://www.zend.com/ (408)253-8829 MSN: [EMAIL PROTECTED] Best regards, Marcus -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
Hello Alan, Andi, Rui, my impression still is that not a single person uses this crap. I only hear of people claiming they have heard that people use it. But what I see is broken code and not a single test. If this is not going to change as in we are not getting any .phpt files for this feature then there are two ways. First I implement something that I personally would expect and I wouldn't care about anything that is there right now or second we simply get rid of it completely. So far I have extended re2c to make it easier to deal with other encodings and even allow multiple char width at the same time. So I did my homework. Now I expect that somebody writes tests! Then we could provide a scanner that works on UCS-2 or on UTF-32 and then try to identofy the script encoding. Then work on th extended charset and do a reverse encoding if necessary for output. THough even thinking about this approach (still like what we seem to have right now) really hurts my very badly becasue it is the wrong approach. What we want is a working HEAD. marcus Monday, March 3, 2008, 4:19:24 PM, you wrote: a few replaces with this file should be a good testcase - probably worth testing * comments with these character in them. both /* and // * string with these characters in them. lynx -source 'http://smontagu.damowmow.com/genEncodingTest.cgi?family=windowscodepage=950' | grep test | grep -v testcase I have definatly seen code with chinese characters in comments and strings and a few times function names and variable names with chinese characters... Regards Alan Marcus Boerger wrote: Hello Alan, be my hero then :-) Could you generate a few tests for the multibyte support so that we know how it is used right now and what we need to take care of? marcus Monday, March 3, 2008, 12:48:44 AM, you wrote: Can you clarify the Multibyte issues: - I presume this means that it can handle ASCII/UTF8/16 etc. but will not handle things like BIG5/GB encoding in source code - this may be a bit of an issue around here.. Regards Alan Marcus Boerger wrote: RFC: REPLACE THE FLEX-BASED SCANNER WITH AN RE2C [1] BASED LEXER Situation: The current flex-based lexer depends on an outdated and unsupported flex version. Alternatives include either updating to a newer version of flex or using re2c, which we already use for a variety of things (serializing, pdo sql scanning, date/time parsing). While moving towards a newer flex version would be much easier, switching to re2c promises a much faster lexer. Actually, without any specific re2c optimizations we already get around a 20% scanner performance increase. Running the tests gets an overall speedup of 2%. It is arguable whether this is enough, but re2c has more advantages. First of all, re2c allows one to scan any type of input (ASCII, UTF-8, UTF-16, UTF-32). Secondly, it allows for better integration with Lemon [2], which would be the next step. And thirdly we can switch to a reentrant scanner. Current state: Flex has been fully replaced by re2c in Zend. We have also switched to an mmap-based lexer approach for now. However, we had to drop multibyte support as well as the encoding declare. The current state can be checked out from Scott's subversion repository [3] and you can follow the development on his Trac setup [4]. When you want to build php with re2c, then you need to grab re2c from its sourceforge subversion repository [5]. You can also check out the changes in a patch created Sunday 2nd March against a PHP checkout from 14th February [6]. Further steps: Commit this to PHP 5.3. Synch to HEAD. Add pecl/intl to 5.3. Discuss/recreate multibyte support with libintl. Future steps: Replace bison with lemon in PHP 5.4 or HEAD. Time Frame: Commit to 5.3 between the 5th and the 15th of March. Synch to HEAD a couple of days later. Moving pecl/libintl to ext (depends on the 5.3 RMs decision). After that is done, decide about multibyte support. Along with the commit to the 5.3 branch there will be a new re2c version available. Marcus Boerger Nuno Lopes Scott MacVicar [1] http://re2c.org/ [2] http://www.hwaci.com/sw/lemon/ [3] svn://whisky.macvicar.net/php-re2c [4] http://trac.macvicar.net/php-re2c/ [5] https://re2c.svn.sourceforge.net/svnroot/re2c/trunk/re2c [6] http://php.net/~helly/php-re2c-20080302.diff.txt Best regards, Marcus Best regards, Marcus -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
is broken code and not a single test. If this is not going to change as in we are not getting any .phpt files for this feature then there are two As I understand the theory of the thing should be pretty simple, you set input encoding (by config or declare) and internal encoding, and then when script is being read, you convert it from input to internal. However, it appears that since flex couldn't stomach certain encodings, there's also a hack there - script is translated from input to some safe encoding for flex, and then strings are translated back to internal encoding after flex processes them. If re2c can deal with encodings like SJIS without trouble then some of the hacks might be unnecessary. I think encodings that need to be checked are those in zend_multibyte.c that have compatible flag off. Here's a short script example I found that shows what's the problem there: ?php echo 'ソ'; ? Character echoed there is U+30BD Katakana letter SO. Now if you run it in UTF-8, works good. However, if you recode it to Shift-JIS, it won't run, since this script looks to the parser this way: ?php echo '83\'; ? (that's dump of VI output, so replace 83 with actual 0x83 if you compose it). That's parse error for the parser, if parsed naively. So somehow the parser needs to know 0x83+\ is actually U+30BD and at the same time the user still might want it as 0x83+\ in a zval (or maybe as utf-8 - it depends on him). -- Stanislav Malyshev, Zend Software Architect [EMAIL PROTECTED] http://www.zend.com/ (408)253-8829 MSN: [EMAIL PROTECTED] -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
On 04.03.2008 21:28, Stanislav Malyshev wrote: Hi! Right. Please take more time if needed, no need to rush and release something half-working. If it takes several months to prepare 5.3 release, let it be so. With this approach we would never release 5.3 - each couple of months somebody would have a cool idea which would only require initial commit and 2-3 months work on it on CVS, which delays the release - and then it goes to the next idea. We should cut it off somewhere - not because these ideas are bad - they aren't, but because we have to have releases. Even though I do agree that delaying the release every 2-3 months is bad, I believe this particular case deserves some special treatment. And btw this is a major release, not just a bugfix one, so everyone (Zend included) should spend even more time to make sure there are no regressions whatsoever. Releasing a half-working version just because we have to have releases is total nonsense. So please instead of arguing with me, help Marcus and the others if you don't want the release postponed. The best idea is worth nothing for the users unless it's part of the release. 5.3 is not the last version of PHP Making new 5.x releases each year makes no sense to me, so 5.3 seems to be perfect candidate for the next several years if we want to implement something major. After all, we're not a commercial company that has to roll out a release every couple of months under pressure of share holders and overall competition. If you think that because PHP project is not a commercial company it doesn't have to adhere to the laws of markets, popularity and users expectations - you are mistaken. These are the last things I think of. The most important is to make it as stable as we can. We still have to take into account millions of PHP users, even though they don't pay us money directly. Right, and they want PHP to do its job and to do it good. And it's open source which was release often last time I checked ;) Wow, that's the most serious argument ever! -- Wbr, Antony Dovgal -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
On Tue, 2008-03-04 at 20:17 +0100, Hannes Magnusson wrote: I'll hunt you all down and make you eat 1kg of vegetables each day after the 5.3 release untill proper documentation and upgrade guides have been written. I already eat that much vegetables a day..what's my punishment? :-p (and Pierre promised to handle the php.ini docs.. :D) --Jani -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
Hi! Even though I do agree that delaying the release every 2-3 months is bad, I believe this particular case deserves some special treatment. Why? We have perfectly working parser now and no immediate need to replace it. I agree that new parser is faster and better, but we are perfectly capable to live without it for half a year until it's polished, if that proves to be the situation. Releasing a half-working version just because we have to have releases is total nonsense. Fully agreed here. That's why I'm against committing new parser without multibyte support. So please instead of arguing with me, help Marcus and the others if you don't want the release postponed. Unfortunately, I do not know Marcus' code and may not have resources to help him right now. Please keep in mind that while I am happy to help whenever I can, I am not under obligation to help on call to any project as soon as anybody wants me to, just because he wants it. That said, if somebody can and does fix new parser to support MB in reasonable time - I'm all for it. Making new 5.x releases each year makes no sense to me, so 5.3 seems to be perfect candidate for the next several years if we want to implement something major. What's wrong with making new 5.x releases each year if needed? Right, and they want PHP to do its job and to do it good. Having no mutibyte support used by a lot of people does not qualify as do its job and to do it good. What qualifies is either 5.3 with old parser or 5.3 with new parser, fully compatible. As I believe I already explained about delaying release etc., I wouldn't repeat myself here. -- Stanislav Malyshev, Zend Software Architect [EMAIL PROTECTED] http://www.zend.com/ (408)253-8829 MSN: [EMAIL PROTECTED] -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
On 04.03.2008 12:38, Marcus Boerger wrote: This sounds like we are going to do the same mistake over and over and over again. Who is forcing a hard time line on us? Why are we late in the develoment I don't get it at all. Right. Please take more time if needed, no need to rush and release something half-working. If it takes several months to prepare 5.3 release, let it be so. After all, we're not a commercial company that has to roll out a release every couple of months under pressure of share holders and overall competition. -- Wbr, Antony Dovgal -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
Hello Andi, Tuesday, March 4, 2008, 7:51:07 AM, you wrote: Hi Marcus, Johannes, and all, First of all let me say that I have no conceptual problem with replacing the scanner with re2c. If it's cleaner, performs better and a better maintained piece of software (let's hope Marcus doesn't get run over) then we can move to re2c. There are a few important things to consider though: - There is a huge PHP/MySQL community in the far east especially in Japan. You may not hear as much from them because they mostly don't post on our public lists but it's large. They very much depend on multibyte support and it works well for them (I have talked to several people in those communities). Shift-JIS is a matter of fact for those communities. We can't just dump them in PHP 5.3. - We need to make sure that we have a streams story that works and existing functionality is supported by it (sounds like this is almost complete so probably not high risk). - We should make sure we can achieve compatibility including supporting functionality like declare(...) which is used by some including multibyte guys. I haven't heard of a reason why this couldn't be possible with RE2C. I think all the above is doable but we shouldn't ship without accomplishing that 100% compatibility especially telling the non-Latin world that we will stop supporting them. So at the end of the day it all boils down to timing. I have been expecting Johannes to cut a beta any day now (I realize Sun acquisition somewhat postponed his schedule). PHP 5.3 is on a pretty good track to a good stable release cycle. I think re-engineering a core piece of the engine at this point adds considerable risk and would definitely prolong the release cycle. So while I'm supportive of embracing RE2C if we get commitment to reach that 100% compatibility including multibyte support, I don't quite understand the sense of urgency and why we'd want to introduce this risk so late in the development of PHP 5.3. This is a risk the release manager shouldn't really be willing to take. Rewriting this multibyte support will require time and interaction with the communities that are currently using it to make sure that it meets their needs. It will not be a trivial project. We can definitely work towards RE2C in parallel and as Stas said the engine hasn't really been changing very much recently to make this hard (we finished our todos for 5.3). We could even branch off PHP 5.4 right after RC1 for PHP 5.3 and therefore reduce the time where this patch would need to be maintained separately (although I think it can already be maintained in a branch). Let's consider all the angles in addition to wanting to get the code in the tree asap. Andi This sounds like we are going to do the same mistake over and over and over again. Who is forcing a hard time line on us? Why are we late in the develoment I don't get it at all. We haven't done all steps that were on our radar for 5.3. Now that we finally found time to address this we should do it. Otherwise the consequence is just that we have to do a 5.4 version immediately. What is the reason for that, who is more happy with a 5.3 now? Are we a company that makes money with selling upgrades? Best regards, Marcus -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
Hi! Right. Please take more time if needed, no need to rush and release something half-working. If it takes several months to prepare 5.3 release, let it be so. With this approach we would never release 5.3 - each couple of months somebody would have a cool idea which would only require initial commit and 2-3 months work on it on CVS, which delays the release - and then it goes to the next idea. We should cut it off somewhere - not because these ideas are bad - they aren't, but because we have to have releases. The best idea is worth nothing for the users unless it's part of the release. 5.3 is not the last version of PHP, and we have quite a bunch of stuff there already - so I think it makes sense to have release of what we have or will have soon, all while continuing to develop the ideas for next versions. After all, we're not a commercial company that has to roll out a release every couple of months under pressure of share holders and overall competition. If you think that because PHP project is not a commercial company it doesn't have to adhere to the laws of markets, popularity and users expectations - you are mistaken. We still have to take into account millions of PHP users, even though they don't pay us money directly. And it's open source which was release often last time I checked ;) -- Stanislav Malyshev, Zend Software Architect [EMAIL PROTECTED] http://www.zend.com/ (408)253-8829 MSN: [EMAIL PROTECTED] -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
Hello Andi, Tuesday, March 4, 2008, 7:51:07 AM, you wrote: Hi Marcus, Johannes, and all, First of all let me say that I have no conceptual problem with replacing the scanner with re2c. If it's cleaner, performs better and a better maintained piece of software (let's hope Marcus doesn't get run over) then we can move to re2c. There are a few important things to consider though: - There is a huge PHP/MySQL community in the far east especially in Japan. You may not hear as much from them because they mostly don't post on our public lists but it's large. They very much depend on multibyte support and it works well for them (I have talked to several people in those communities). Shift-JIS is a matter of fact for those communities. We can't just dump them in PHP 5.3. - We need to make sure that we have a streams story that works and existing functionality is supported by it (sounds like this is almost complete so probably not high risk). - We should make sure we can achieve compatibility including supporting functionality like declare(...) which is used by some including multibyte guys. I haven't heard of a reason why this couldn't be possible with RE2C. I think all the above is doable but we shouldn't ship without accomplishing that 100% compatibility especially telling the non-Latin world that we will stop supporting them. So at the end of the day it all boils down to timing. I have been expecting Johannes to cut a beta any day now (I realize Sun acquisition somewhat postponed his schedule). PHP 5.3 is on a pretty good track to a good stable release cycle. I think re-engineering a core piece of the engine at this point adds considerable risk and would definitely prolong the release cycle. So while I'm supportive of embracing RE2C if we get commitment to reach that 100% compatibility including multibyte support, I don't quite understand the sense of urgency and why we'd want to introduce this risk so late in the development of PHP 5.3. This is a risk the release manager shouldn't really be willing to take. Rewriting this multibyte support will require time and interaction with the communities that are currently using it to make sure that it meets their needs. It will not be a trivial project. We can definitely work towards RE2C in parallel and as Stas said the engine hasn't really been changing very much recently to make this hard (we finished our todos for 5.3). We could even branch off PHP 5.4 right after RC1 for PHP 5.3 and therefore reduce the time where this patch would need to be maintained separately (although I think it can already be maintained in a branch). Let's consider all the angles in addition to wanting to get the code in the tree asap. Andi Give me any reason why we need 5.4 at this point? Any single one? Are you having a bet or a deal about 5.3 release date? And what is the deal, you do whatever you think goes in and that's a law? Best regards, Marcus -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
Hi! Improving on that statement: The coolest feature ever is worth absolutely nothing unless it is documented. I agree with the intent - documentation is *very* important. Even though, people use undocumented features too (probably cursing the lazy developers on the way ;) BTW, as far as I remember, we have at least 4 undocumented features right now sitting in 5.3 CVS, so if anybody wants to do something cool, that's a good place: - Nowdocs aren't documented - .htaccess-like .ini files undocumented - [HOST=] and [PATH=] .ini sections undocumented - new version constants undocumented -- Stanislav Malyshev, Zend Software Architect [EMAIL PROTECTED] http://www.zend.com/ (408)253-8829 MSN: [EMAIL PROTECTED] -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
RE: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
-Original Message- From: Hannes Magnusson [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 04, 2008 11:18 AM To: Stas Malyshev Cc: Antony Dovgal; Marcus Boerger; Andi Gutmans; internals@lists.php.net Subject: Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer Improving on that statement: The coolest feature ever is worth absolutely nothing unless it is documented. Don't care if its a new language construct, new class, function or method, optional parameter, new syntax in php.ini, errorlevel, dropped warnings or an awesome --enable-zend-multibyte configure switch. If it isn't documented its totally useless for anyone not reading php-cvs, zend-engine-cvs and this list daily. I'll hunt you all down and make you eat 1kg of vegetables each day after the 5.3 release untill proper documentation and upgrade guides have been written. Mark my words my friends, mark my words! ;) Why do you say it's not documented? http://www.aconus.com/~oyaji/www/apache_linux_php.htm http://tinyurl.com/2o8pq2 OK just kidding and I agree it would be nice to have it better documented in the mainstream docs. As it applies mostly to the Asian users though (Chinese/Japanese) who usually seek localized docs it's probably not as good as it should be in php.net. Andi -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
On Tue, Mar 4, 2008 at 8:38 PM, Andi Gutmans [EMAIL PROTECTED] wrote: OK just kidding and I agree it would be nice to have it better documented in the mainstream docs. As it applies mostly to the Asian users though (Chinese/Japanese) who usually seek localized docs it's probably not as good as it should be in php.net. The Japanese docs are 100% up-to-date with the English docs so they shouldn't have any problem reading out docs. In fact, if you do changes in the en/ tree Takagi Masahiro will have it translated within 24hours - even if that change spanned 50files. Not kidding. -Hannes -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
On Tue, Mar 4, 2008 at 8:38 PM, Andi Gutmans [EMAIL PROTECTED] wrote: Why do you say it's not documented? http://www.aconus.com/~oyaji/www/apache_linux_php.htm http://tinyurl.com/2o8pq2 According to the latter link, our windows binaries don't enable zend-multibyte, is this true? -Hannes -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
On Sun, 2 Mar 2008, Marcus Boerger wrote: However, we had to drop multibyte support as well as the encoding declare. Just wondering, why did you have to drop the declare(encoding=...) ? It's just ignored in PHP 5.x - and it is useful to have for migrating php 5.3 apps to 6. So can you atleast make the new parser just ignore this statement? regards, Derick -- Derick Rethans http://derickrethans.nl | http://ezcomponents.org | http://xdebug.org -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
Hi Derick, On Mon, 2008-03-03 at 09:28 +0100, Derick Rethans wrote: On Sun, 2 Mar 2008, Marcus Boerger wrote: However, we had to drop multibyte support as well as the encoding declare. Just wondering, why did you have to drop the declare(encoding=...) ? It's just ignored in PHP 5.x - and it is useful to have for migrating php 5.3 apps to 6. So can you atleast make the new parser just ignore this statement? It is not ignored in PHP 5 as Marcus and I found out while reading the code :-) If you compile with --enable-zend-multibyte you can change the encoding using declare even multiple times per file using declare it seems. johannes -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
Hello Derick, actually you get a message (E_COMPILE_WARNING) that this is not supported. Maybe we could turn this into an E_NOTICE though. marcus Monday, March 3, 2008, 9:28:01 AM, you wrote: On Sun, 2 Mar 2008, Marcus Boerger wrote: However, we had to drop multibyte support as well as the encoding declare. Just wondering, why did you have to drop the declare(encoding=...) ? It's just ignored in PHP 5.x - and it is useful to have for migrating php 5.3 apps to 6. So can you atleast make the new parser just ignore this statement? regards, Derick -- Derick Rethans http://derickrethans.nl | http://ezcomponents.org | http://xdebug.org Best regards, Marcus -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
On Mon, 3 Mar 2008, Marcus Boerger wrote: actually you get a message (E_COMPILE_WARNING) that this is not supported. Maybe we could turn this into an E_NOTICE though. No, I don't get any warning/notice/ whatever with PHP 5.3: [EMAIL PROTECTED]:~$ php-5.3dev -derror_reporting=65535 ?php declare(encoding=utf-8); echo foo\n; ? foo Please don't break this. regards, Derick -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
Hello Stanislav, Monday, March 3, 2008, 5:39:35 AM, you wrote: Hi! Were the stream support issues solved? We completely dropped multibyte support. The reason is that the way we were I wasn't asking about multibyte (that we discuss below), but about other streams - I think I mentioned it on IRC last time re2c parser was discussed. I remember re2c used mmap, and not all files PHP can run can be mmap'ed. Was it fixed? Ah, you didn't write that so I got confused. Anyway, what we are doing is the following order: 1) If mmap is supported, then use it 2) If mmap is not supported or does not work then read the whole stream 3) If that is not possible read char by char The flex based scanner reads in smaller chunks or char by char, so it is more or less always like case 3. Once we have finished the move to re2c, we can support all of those correctly. The multibyte support also duplicated the encoding tables otherwise available in ext/mbstring or ext/iconv or pecl/intl. pecl/intl per se doesn't have any encoding tables. ICU does, but that would mean you have to have ICU to run PHP. That might not be a big problem since ICU is supported by IBM (read: good chance more exotic systems would have support) it is still dependency on non-bundled 3rd party library in PHP 5 core. Of course, PHP 6 has this dependency, but we might want to not have such things in 5.x so that you won't have to change your system too much while staying on 5.x. Are you saying we cannot depend on ICU in PHP 6 and have to redo it completely or what? Rely on a not supported undocumented feature? I am rather able to build php and rewrite that support. Being undocumented is nothing to be proud of, however as poorly documented as it is, it is used. I'm all for implementing it in a better way - and having new parser is a good time to do it. That's exactly the reason we shouldn't rush with it but do it right this time. There's no burning need to have a new parser right now, so we can have some moment to think - ok, how we want multibyte support there to work? And if we might need some modifications, we'd have time and flexibility to do it, not having the code in 5.3 which was supposed to go in RC in Q1 (ending 1 month from now). You are free to contribute and make MB support working upfront. I know I'm free :) However, as much as I understand the eagerness of having it in the source tree, I repeat that I do not think dropping multibyte support in 5.3 is acceptable. Thus, if it is committed right now, 5.3 would have to be deferred until this is resolved. If this is resolved timely for 5.3 - great. If not, we better get it in 5.4 right than in 5.3 wrong. I don't see a problem with redoing multibyte support in a useable way. Actually we better redo it anyway because it is a very bad solution as it is right now. That is the current solution duplicates the input and uses a flattening filter to always scan an eight bit input stream. Then when something needs to get pushed to the output, we recalculate the position on the original input and use that part. Changing to re2c we can do a very easy solution. When requested or detected per BOM, we switch to a second version of the scanner that works on unsigned int and supports the full unicode character set (only thing to do for re2c is to switch the input type and guess what, this is already in production on a lot of systems). Other approaches are to natively support UTF-8 and UTF-16 besides 8 bit and UTF-32. Further more we can apply any kind of filtering correctly on top of the UTF-* scanner. I Know there is some work left but when we do not apply the work now then we basically have two engines. In that case I'll just rewrite the engine completely and replace it. Best regards, Marcus -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
Hello Alan, be my hero then :-) Could you generate a few tests for the multibyte support so that we know how it is used right now and what we need to take care of? marcus Monday, March 3, 2008, 12:48:44 AM, you wrote: Can you clarify the Multibyte issues: - I presume this means that it can handle ASCII/UTF8/16 etc. but will not handle things like BIG5/GB encoding in source code - this may be a bit of an issue around here.. Regards Alan Marcus Boerger wrote: RFC: REPLACE THE FLEX-BASED SCANNER WITH AN RE2C [1] BASED LEXER Situation: The current flex-based lexer depends on an outdated and unsupported flex version. Alternatives include either updating to a newer version of flex or using re2c, which we already use for a variety of things (serializing, pdo sql scanning, date/time parsing). While moving towards a newer flex version would be much easier, switching to re2c promises a much faster lexer. Actually, without any specific re2c optimizations we already get around a 20% scanner performance increase. Running the tests gets an overall speedup of 2%. It is arguable whether this is enough, but re2c has more advantages. First of all, re2c allows one to scan any type of input (ASCII, UTF-8, UTF-16, UTF-32). Secondly, it allows for better integration with Lemon [2], which would be the next step. And thirdly we can switch to a reentrant scanner. Current state: Flex has been fully replaced by re2c in Zend. We have also switched to an mmap-based lexer approach for now. However, we had to drop multibyte support as well as the encoding declare. The current state can be checked out from Scott's subversion repository [3] and you can follow the development on his Trac setup [4]. When you want to build php with re2c, then you need to grab re2c from its sourceforge subversion repository [5]. You can also check out the changes in a patch created Sunday 2nd March against a PHP checkout from 14th February [6]. Further steps: Commit this to PHP 5.3. Synch to HEAD. Add pecl/intl to 5.3. Discuss/recreate multibyte support with libintl. Future steps: Replace bison with lemon in PHP 5.4 or HEAD. Time Frame: Commit to 5.3 between the 5th and the 15th of March. Synch to HEAD a couple of days later. Moving pecl/libintl to ext (depends on the 5.3 RMs decision). After that is done, decide about multibyte support. Along with the commit to the 5.3 branch there will be a new re2c version available. Marcus Boerger Nuno Lopes Scott MacVicar [1] http://re2c.org/ [2] http://www.hwaci.com/sw/lemon/ [3] svn://whisky.macvicar.net/php-re2c [4] http://trac.macvicar.net/php-re2c/ [5] https://re2c.svn.sourceforge.net/svnroot/re2c/trunk/re2c [6] http://php.net/~helly/php-re2c-20080302.diff.txt Best regards, Marcus -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
Hello Derick, ok, for now I changed to not issue any error at all. marcus Monday, March 3, 2008, 11:28:31 AM, you wrote: On Mon, 3 Mar 2008, Marcus Boerger wrote: actually you get a message (E_COMPILE_WARNING) that this is not supported. Maybe we could turn this into an E_NOTICE though. No, I don't get any warning/notice/ whatever with PHP 5.3: [EMAIL PROTECTED]:~$ php-5.3dev -derror_reporting=65535 ?php declare(encoding=utf-8); echo foo\n; ? foo Please don't break this. regards, Derick Best regards, Marcus -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
On 03.03.2008, at 00:48, Alan Knowles wrote: Can you clarify the Multibyte issues: - I presume this means that it can handle ASCII/UTF8/16 etc. but will not handle things like BIG5/GB encoding in source code - this may be a bit of an issue around here.. At first I also thought that this had something to do with ext/ mbstring, but since then I have learned that this is not the case. However this confusion is likely what causes many people to enable zend mb support. So the question to Stas (Alan and the rest of the world) is if they really have a script in the wild that actually requires this switch and would break if its would be disabled. And if there is such a script what exactly are the needs and how can these be filled in 5.3 using re2c. regards, Lukas -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
Hi, On Sun, 2008-03-02 at 14:47 -0800, Stanislav Malyshev wrote: Hi! be much easier, switching to re2c promises a much faster lexer. Actually, without any specific re2c optimizations we already get around a 20% scanner I think 20% faster is very cool. However, as I understand re2c is not a standard tool found everywhere. So what happens if you wanted to use it on some exotic system where re2c is not readily available as manintainer-supported software? Also, flex is available on Windows for example as part of cygwin, while I don't see re2c there. I understand this can be of low importance since we keep generated files in our repositories, but I think we still have to keep it in mind. I understand also current patch requires non-release version of re2c - maybe we should have some release version at least until we make PHP depend on it? We need a change there anyways, flex 2.5.4 is bundled with less systems, even my Solaris 20 box has 2.5.33 instead of 2.5.4 by default. And I think changing to something which is maintained by one of our main contributors might be beneficial for us. Note - pecl/intl does nothing towards multibyte support etc., at least for now. If there are voloteers to change that, it can be discussed, but so far it is for doing entirely other things (locale-dependent functionality mostly). So, I think before re2c parser can be merged the issue with multibyte compatibility must be solved - otherwise it will make the users that rely on it unable to use newer PHP. As cool as 20% faster is, I think we can't drop support for such feature, especially not in 5.3. Commit to 5.3 between the 5th and the 15th of March. Synch to HEAD a couple of days later. Moving pecl/libintl to ext (depends on the 5.3 RMs decision). After that is done, decide about multibyte support. Along with the commit to the 5.3 branch there will be a new re2c version available. I think we first need to figure out what happens to multibyte support, and not commit anything before we have it figured out. Multibyte support is important piece of functionality for some PHP users, and it works now. Breaking it without providing any alternative - especially that we have now 5.3 mostly ready for the release cycle, and solving multibyte problems with re2c may take undefined amount of time, as far as I understand. I do not think it would be acceptable to release 5.3 without multibyte support, so the option here either merge it now and have 5.3 waiting until MB is figured out, or try to figure it out before commit and if we can't in a reasonable term, go forward with 5.3 and defer the parser change for 5.4. Since there's no documentation about zend-multibyte stuff I spent some time searching for other resources about it, but except bug reports I found nothing whee it was required. I'm sure there are some but comments like TODO: support widechars in the code give me the impression that it doesn't really work... and I guess many people just enable it sinceit sounds important not due to the fact that hey really need it. Of course I might be wrong so I'd be interested in use cases for --enable-zend-multibyte stuff. Maybe we can fullfill the needs without the switch. If there are good use cases for that switch I won't like to replace some small engine thingy with a huge external library like ICU. And I doubt that more than just a few people know what it really does - Marcus and I just found out while working on that stuff over the weekend. Again, while I think the speedup is great and congratulate Marcus, Nuno and Scott on great work, I think we should keep in mind we have working parser right now and changing it in an incompatible way is very high-risk and should not be taken hastily. Right, it's great work they did there but a broken scanner would be one of the worst things we might ship. So I'd invite everybody to checkout that version from SVN (see Marcus's mail) and test it using the worst stuff you can think off :-) johannes -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
a few replaces with this file should be a good testcase - probably worth testing * comments with these character in them. both /* and // * string with these characters in them. lynx -source 'http://smontagu.damowmow.com/genEncodingTest.cgi?family=windowscodepage=950' | grep test | grep -v testcase I have definatly seen code with chinese characters in comments and strings and a few times function names and variable names with chinese characters... Regards Alan Marcus Boerger wrote: Hello Alan, be my hero then :-) Could you generate a few tests for the multibyte support so that we know how it is used right now and what we need to take care of? marcus Monday, March 3, 2008, 12:48:44 AM, you wrote: Can you clarify the Multibyte issues: - I presume this means that it can handle ASCII/UTF8/16 etc. but will not handle things like BIG5/GB encoding in source code - this may be a bit of an issue around here.. Regards Alan Marcus Boerger wrote: RFC: REPLACE THE FLEX-BASED SCANNER WITH AN RE2C [1] BASED LEXER Situation: The current flex-based lexer depends on an outdated and unsupported flex version. Alternatives include either updating to a newer version of flex or using re2c, which we already use for a variety of things (serializing, pdo sql scanning, date/time parsing). While moving towards a newer flex version would be much easier, switching to re2c promises a much faster lexer. Actually, without any specific re2c optimizations we already get around a 20% scanner performance increase. Running the tests gets an overall speedup of 2%. It is arguable whether this is enough, but re2c has more advantages. First of all, re2c allows one to scan any type of input (ASCII, UTF-8, UTF-16, UTF-32). Secondly, it allows for better integration with Lemon [2], which would be the next step. And thirdly we can switch to a reentrant scanner. Current state: Flex has been fully replaced by re2c in Zend. We have also switched to an mmap-based lexer approach for now. However, we had to drop multibyte support as well as the encoding declare. The current state can be checked out from Scott's subversion repository [3] and you can follow the development on his Trac setup [4]. When you want to build php with re2c, then you need to grab re2c from its sourceforge subversion repository [5]. You can also check out the changes in a patch created Sunday 2nd March against a PHP checkout from 14th February [6]. Further steps: Commit this to PHP 5.3. Synch to HEAD. Add pecl/intl to 5.3. Discuss/recreate multibyte support with libintl. Future steps: Replace bison with lemon in PHP 5.4 or HEAD. Time Frame: Commit to 5.3 between the 5th and the 15th of March. Synch to HEAD a couple of days later. Moving pecl/libintl to ext (depends on the 5.3 RMs decision). After that is done, decide about multibyte support. Along with the commit to the 5.3 branch there will be a new re2c version available. Marcus Boerger Nuno Lopes Scott MacVicar [1] http://re2c.org/ [2] http://www.hwaci.com/sw/lemon/ [3] svn://whisky.macvicar.net/php-re2c [4] http://trac.macvicar.net/php-re2c/ [5] https://re2c.svn.sourceforge.net/svnroot/re2c/trunk/re2c [6] http://php.net/~helly/php-re2c-20080302.diff.txt Best regards, Marcus -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
Hi! Since there's no documentation about zend-multibyte stuff I spent some time searching for other resources about it, but except bug reports I found nothing whee it was required. I'm sure there are some but comments like TODO: support widechars in the code give me the impression that it doesn't really work... and I guess many people just enable it sinceit It does work and there are people using it, even though I imagine it can have some bugs. I guess it would be best to talk to mbstring maintainer on code details, etc. If there are good use cases for that switch I won't like to replace some small engine thingy with a huge external library like ICU. The use cases are scripts written in encodings like shift-JIS, etc. And I doubt that more than just a few people know what it really does - Marcus and I just found out while working on that stuff over the weekend. So I guess documentation is important :) Let it be a lesson to us all. -- Stanislav Malyshev, Zend Software Architect [EMAIL PROTECTED] http://www.zend.com/ (408)253-8829 MSN: [EMAIL PROTECTED] -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
Hi, On Mon, Mar 3, 2008 at 7:59 PM, Stanislav Malyshev [EMAIL PROTECTED] wrote: Just curious who you were answering to... Anyway, to be clear: 1. PHP 6 is major version with its major feature being Unicode support. 2. PHP 5.x is same-major branch, where you are not expected to have to change your system in order to upgrade. 3. We do not expect people to take PHP 6 and have absolutely everything work instantly from PHP 5. We try to minimize upgrade path, but major version upgrades can take some adjustments. 4. We expect people to upgrade from 5.2.x to 5.3.x without changing their systems. Is it clearer why I think PHP 5.x and 6 are different and why I think ICU dependency in the 5.3 core might be a problem? It is clearer but it is not a problem. New features may introduce new dependencies. Having a dependency on libicu while we introduce intl and other features related to unicode or i18n. I would agree if we were talking about 5.2.x. -- Pierre http://blog.thepimp.net | http://www.libgd.org -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
Hi! It is clearer but it is not a problem. New features may introduce new dependencies. Having a dependency on libicu while we introduce intl and other features related to unicode or i18n. I would agree if we were talking about 5.2.x. pecl/intl is an extension, there's no surprise that you need external library when you enable extension. However, adding dependency in core that you can not rid of has a lot of consequences (think distributions, builds on non-Linux systems, etc., etc.). -- Stanislav Malyshev, Zend Software Architect [EMAIL PROTECTED] http://www.zend.com/ (408)253-8829 MSN: [EMAIL PROTECTED] -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
On Mon, Mar 3, 2008 at 8:48 PM, Stanislav Malyshev [EMAIL PROTECTED] wrote: Hi! It is clearer but it is not a problem. New features may introduce new dependencies. Having a dependency on libicu while we introduce intl and other features related to unicode or i18n. I would agree if we were talking about 5.2.x. pecl/intl is an extension, there's no surprise that you need external library when you enable extension. However, adding dependency in core that you can not rid of has a lot of consequences (think distributions, builds on non-Linux systems, etc., etc.). intl (and related changes) is almost the only why one will upgrade to 5.3.x. There is no core (as in zend engine) for 95% of our users. There is a PHP release with default features which can be relied on. That's my feeling and experiences on this topic. That being said, icu is so common these days, I really don't see a problem to have it as dep. If we were asking for some esoteric library, I would worry more, obviously :) -- Pierre http://blog.thepimp.net | http://www.libgd.org -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
On Mon, 3 Mar 2008, Stanislav Malyshev wrote: 4. We expect people to upgrade from 5.2.x to 5.3.x without changing their systems. Is it clearer why I think PHP 5.x and 6 are different and why I think ICU dependency in the 5.3 core might be a problem? FWIW... I also think that bringing in ICU in 5.3 so late in the cycle - or actually at all in 5.3 - is not such a bright idea. regards, Derick -- Derick Rethans http://derickrethans.nl | http://ezcomponents.org | http://xdebug.org -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
Is it clearer why I think PHP 5.x and 6 are different and why I think ICU dependency in the 5.3 core might be a problem? FWIW... I also think that bringing in ICU in 5.3 so late in the cycle - or actually at all in 5.3 - is not such a bright idea. 'so late in the cycle'? We haven't had a beta rc yet. I agree intl should've been moved into core several weeks ago if that helps any... - Steph -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
No one was considering any such move. Having pecl/intl shipped per default as symlinked into ext would be as much optional as --enable-zend-multibyte or --enable-mbstring are right now. This will be more like brining in zip to 5.2. However it is completely off-topic as it is just one possible cause of action while the other is to stick with mbstring. Intl and mbstring don't share anything like the same functionality... - Steph -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
Hello Pierre, Monday, March 3, 2008, 9:31:37 PM, you wrote: Hi Marcus, On Mon, Mar 3, 2008 at 9:16 PM, Marcus Boerger [EMAIL PROTECTED] wrote: Hello Stanislav, Monday, March 3, 2008, 8:48:38 PM, you wrote: Hi! It is clearer but it is not a problem. New features may introduce new dependencies. Having a dependency on libicu while we introduce intl and other features related to unicode or i18n. I would agree if we were talking about 5.2.x. Bad example, it is not symlinked :) And heh, it would be time to give a break with your zip rant, hmmk? =) Sorry, this wasn't meant at all as a rant. It is just a recent example where a new extension brought in a new dependency. Though you come with a bundled one so it actually should have looked for a better one. Best regards, Marcus -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
Hi! intl (and related changes) is almost the only why one will upgrade to 5.3.x. There is no core (as in zend engine) for 95% of our users. From NEWS: - Added and improved PHP syntax and semantics: . Added NOWDOC. (Gwynne Raskind, Stas, Dmitry) . Added ?: operator. (Marcus) . Added support for namespaces. (Dmitry, Stas, Gregory) . Added support for Late Static Binding. (Dmitry, Etienne Kneuss) . Added support for __callstatic() magic method. (Sara) . Added support for dynamic access of static members using $foo::myFunc(). (Etienne Kneuss) . Improved checks for callbacks. (Marcus) And that's not counting extension stuff. I of course value a lot the importance given to intl, but 5.3 IMHO is juicier than just intl :) -- Stanislav Malyshev, Zend Software Architect [EMAIL PROTECTED] http://www.zend.com/ (408)253-8829 MSN: [EMAIL PROTECTED] -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
Hi! be much easier, switching to re2c promises a much faster lexer. Actually, without any specific re2c optimizations we already get around a 20% scanner I think 20% faster is very cool. However, as I understand re2c is not a standard tool found everywhere. So what happens if you wanted to use it on some exotic system where re2c is not readily available as manintainer-supported software? Also, flex is available on Windows for example as part of cygwin, while I don't see re2c there. I understand this can be of low importance since we keep generated files in our repositories, but I think we still have to keep it in mind. I understand also current patch requires non-release version of re2c - maybe we should have some release version at least until we make PHP depend on it? Current state: Flex has been fully replaced by re2c in Zend. We have also switched to an mmap-based lexer approach for now. However, we had to drop multibyte support Were the stream support issues solved? as well as the encoding declare. The current state can be checked out from Scott's subversion repository [3] and you can follow the development on his Trac setup [4]. When you want to build php with re2c, then you need to grab re2c from its sourceforge subversion repository [5]. You can also check out the changes in a patch created Sunday 2nd March against a PHP checkout from 14th February [6]. Further steps: Commit this to PHP 5.3. Synch to HEAD. Add pecl/intl to 5.3. Discuss/recreate multibyte support with libintl. Note - pecl/intl does nothing towards multibyte support etc., at least for now. If there are voloteers to change that, it can be discussed, but so far it is for doing entirely other things (locale-dependent functionality mostly). So, I think before re2c parser can be merged the issue with multibyte compatibility must be solved - otherwise it will make the users that rely on it unable to use newer PHP. As cool as 20% faster is, I think we can't drop support for such feature, especially not in 5.3. Commit to 5.3 between the 5th and the 15th of March. Synch to HEAD a couple of days later. Moving pecl/libintl to ext (depends on the 5.3 RMs decision). After that is done, decide about multibyte support. Along with the commit to the 5.3 branch there will be a new re2c version available. I think we first need to figure out what happens to multibyte support, and not commit anything before we have it figured out. Multibyte support is important piece of functionality for some PHP users, and it works now. Breaking it without providing any alternative - especially that we have now 5.3 mostly ready for the release cycle, and solving multibyte problems with re2c may take undefined amount of time, as far as I understand. I do not think it would be acceptable to release 5.3 without multibyte support, so the option here either merge it now and have 5.3 waiting until MB is figured out, or try to figure it out before commit and if we can't in a reasonable term, go forward with 5.3 and defer the parser change for 5.4. Again, while I think the speedup is great and congratulate Marcus, Nuno and Scott on great work, I think we should keep in mind we have working parser right now and changing it in an incompatible way is very high-risk and should not be taken hastily. -- Stanislav Malyshev, Zend Software Architect [EMAIL PROTECTED] http://www.zend.com/ (408)253-8829 MSN: [EMAIL PROTECTED] -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
Stanislav Malyshev wrote: Hi! be much easier, switching to re2c promises a much faster lexer. Actually, without any specific re2c optimizations we already get around a 20% scanner I think 20% faster is very cool. However, as I understand re2c is not a standard tool found everywhere. So what happens if you wanted to use it on some exotic system where re2c is not readily available as manintainer-supported software? Also, flex is available on Windows for example as part of cygwin, while I don't see re2c there. I don't think this part is a concern since we have required re2c for quite a while now to build many critical parts of PHP. People who actually need to regenerate the parser files are the same people for whom it is trivial to figure out how to install re2c. And yes, it would of course be good to use a released version of re2c, but I think by the time 5.3 is ready to go the version of re2c we need will be out there. Since it is Marcus' baby, he can just push it out, I don't think this is a stumbling block either. Some of the new stuff in re2c was specifically added to make it easier to write a PHP parser, so I don't think backporting to an older version is really an option. -Rasmus -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
Hello Stanislav, Sunday, March 2, 2008, 11:47:57 PM, you wrote: Hi! be much easier, switching to re2c promises a much faster lexer. Actually, without any specific re2c optimizations we already get around a 20% scanner I think 20% faster is very cool. However, as I understand re2c is not a standard tool found everywhere. So what happens if you wanted to use it on some exotic system where re2c is not readily available as manintainer-supported software? Also, flex is available on Windows for example as part of cygwin, while I don't see re2c there. I understand this can be of low importance since we keep generated files in our repositories, but I think we still have to keep it in mind. I understand also current patch requires non-release version of re2c - maybe we should have some release version at least until we make PHP depend on it? Well, re2c works for on a very large amount of systems, can easily be build and comes with a read to download windows executable. Furthermore all major distributions have re2c packages. Along with storing the generated files in cvs i see no issue at all in these regards. Current state: Flex has been fully replaced by re2c in Zend. We have also switched to an mmap-based lexer approach for now. However, we had to drop multibyte support Were the stream support issues solved? We completely dropped multibyte support. The reason is that the way we were doing it, is that we constanlty switch between the full original and a recoded duplicate that simply ignores multibyte (or any encoding at all). Once we have finished the move to re2c, we can support all of those correctly. The multibyte support also duplicated the encoding tables otherwise available in ext/mbstring or ext/iconv or pecl/intl. as well as the encoding declare. The current state can be checked out from Scott's subversion repository [3] and you can follow the development on his Trac setup [4]. When you want to build php with re2c, then you need to grab re2c from its sourceforge subversion repository [5]. You can also check out the changes in a patch created Sunday 2nd March against a PHP checkout from 14th February [6]. Further steps: Commit this to PHP 5.3. Synch to HEAD. Add pecl/intl to 5.3. Discuss/recreate multibyte support with libintl. Note - pecl/intl does nothing towards multibyte support etc., at least for now. If there are voloteers to change that, it can be discussed, but so far it is for doing entirely other things (locale-dependent functionality mostly). Yes I know. However pecl/intl brings in a php/icu bridge which we can build on. So, I think before re2c parser can be merged the issue with multibyte compatibility must be solved - otherwise it will make the users that rely on it unable to use newer PHP. As cool as 20% faster is, I think we can't drop support for such feature, especially not in 5.3. Rely on a not supported undocumented feature? I am rather able to build php and rewrite that support. Commit to 5.3 between the 5th and the 15th of March. Synch to HEAD a couple of days later. Moving pecl/libintl to ext (depends on the 5.3 RMs decision). After that is done, decide about multibyte support. Along with the commit to the 5.3 branch there will be a new re2c version available. I think we first need to figure out what happens to multibyte support, and not commit anything before we have it figured out. Multibyte support is important piece of functionality for some PHP users, and it works now. Breaking it without providing any alternative - especially that we have now 5.3 mostly ready for the release cycle, and solving multibyte problems with re2c may take undefined amount of time, as far as I understand. I do not think it would be acceptable to release 5.3 without multibyte support, so the option here either merge it now and have 5.3 waiting until MB is figured out, or try to figure it out before commit and if we can't in a reasonable term, go forward with 5.3 and defer the parser change for 5.4. Again, while I think the speedup is great and congratulate Marcus, Nuno and Scott on great work, I think we should keep in mind we have working parser right now and changing it in an incompatible way is very high-risk and should not be taken hastily. You are free to contribute and make MB support working upfront. Best regards, Marcus -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
Hello Rasmus, Monday, March 3, 2008, 12:25:52 AM, you wrote: Stanislav Malyshev wrote: Hi! be much easier, switching to re2c promises a much faster lexer. Actually, without any specific re2c optimizations we already get around a 20% scanner I think 20% faster is very cool. However, as I understand re2c is not a standard tool found everywhere. So what happens if you wanted to use it on some exotic system where re2c is not readily available as manintainer-supported software? Also, flex is available on Windows for example as part of cygwin, while I don't see re2c there. I don't think this part is a concern since we have required re2c for quite a while now to build many critical parts of PHP. People who actually need to regenerate the parser files are the same people for whom it is trivial to figure out how to install re2c. And yes, it would of course be good to use a released version of re2c, but I think by the time 5.3 is ready to go the version of re2c we need will be out there. Since it is Marcus' baby, he can just push it out, I don't think this is a stumbling block either. Some of the new stuff in re2c was specifically added to make it easier to write a PHP parser, so I don't think backporting to an older version is really an option. Right. The current re2c development cycle is solely dedicated to be able to rewrite the PHP scanners. I will update re2c whenever necessary during the remaining development cycle and release a new stable release before we release PHP 5.3. Best regards, Marcus -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
Hi Stan, On Sun, Mar 2, 2008 at 11:47 PM, Stanislav Malyshev [EMAIL PROTECTED] wrote: Hi! be much easier, switching to re2c promises a much faster lexer. Actually, without any specific re2c optimizations we already get around a 20% scanner I think 20% faster is very cool. However, as I understand re2c is not a standard tool found everywhere. So what happens if you wanted to use it on some exotic system where re2c is not readily available as manintainer-supported software? Also, flex is available on Windows for example as part of cygwin, while I don't see re2c there. A quick note about this non problem. re2c works pretty well on windows and they provide a .exe as far as I remember (much easier than flex which requires cygwin or gnuwin32, even if both work :). Besides the portability of re2c, we already use it in some extensions (if I remember correctly) and nobody complained. Cheers, -- Pierre http://blog.thepimp.net | http://www.libgd.org -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
Can you clarify the Multibyte issues: - I presume this means that it can handle ASCII/UTF8/16 etc. but will not handle things like BIG5/GB encoding in source code - this may be a bit of an issue around here.. Regards Alan Marcus Boerger wrote: RFC: REPLACE THE FLEX-BASED SCANNER WITH AN RE2C [1] BASED LEXER Situation: The current flex-based lexer depends on an outdated and unsupported flex version. Alternatives include either updating to a newer version of flex or using re2c, which we already use for a variety of things (serializing, pdo sql scanning, date/time parsing). While moving towards a newer flex version would be much easier, switching to re2c promises a much faster lexer. Actually, without any specific re2c optimizations we already get around a 20% scanner performance increase. Running the tests gets an overall speedup of 2%. It is arguable whether this is enough, but re2c has more advantages. First of all, re2c allows one to scan any type of input (ASCII, UTF-8, UTF-16, UTF-32). Secondly, it allows for better integration with Lemon [2], which would be the next step. And thirdly we can switch to a reentrant scanner. Current state: Flex has been fully replaced by re2c in Zend. We have also switched to an mmap-based lexer approach for now. However, we had to drop multibyte support as well as the encoding declare. The current state can be checked out from Scott's subversion repository [3] and you can follow the development on his Trac setup [4]. When you want to build php with re2c, then you need to grab re2c from its sourceforge subversion repository [5]. You can also check out the changes in a patch created Sunday 2nd March against a PHP checkout from 14th February [6]. Further steps: Commit this to PHP 5.3. Synch to HEAD. Add pecl/intl to 5.3. Discuss/recreate multibyte support with libintl. Future steps: Replace bison with lemon in PHP 5.4 or HEAD. Time Frame: Commit to 5.3 between the 5th and the 15th of March. Synch to HEAD a couple of days later. Moving pecl/libintl to ext (depends on the 5.3 RMs decision). After that is done, decide about multibyte support. Along with the commit to the 5.3 branch there will be a new re2c version available. Marcus Boerger Nuno Lopes Scott MacVicar [1] http://re2c.org/ [2] http://www.hwaci.com/sw/lemon/ [3] svn://whisky.macvicar.net/php-re2c [4] http://trac.macvicar.net/php-re2c/ [5] https://re2c.svn.sourceforge.net/svnroot/re2c/trunk/re2c [6] http://php.net/~helly/php-re2c-20080302.diff.txt -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
Hi! Were the stream support issues solved? We completely dropped multibyte support. The reason is that the way we were I wasn't asking about multibyte (that we discuss below), but about other streams - I think I mentioned it on IRC last time re2c parser was discussed. I remember re2c used mmap, and not all files PHP can run can be mmap'ed. Was it fixed? Once we have finished the move to re2c, we can support all of those correctly. The multibyte support also duplicated the encoding tables otherwise available in ext/mbstring or ext/iconv or pecl/intl. pecl/intl per se doesn't have any encoding tables. ICU does, but that would mean you have to have ICU to run PHP. That might not be a big problem since ICU is supported by IBM (read: good chance more exotic systems would have support) it is still dependency on non-bundled 3rd party library in PHP 5 core. Of course, PHP 6 has this dependency, but we might want to not have such things in 5.x so that you won't have to change your system too much while staying on 5.x. Rely on a not supported undocumented feature? I am rather able to build php and rewrite that support. Being undocumented is nothing to be proud of, however as poorly documented as it is, it is used. I'm all for implementing it in a better way - and having new parser is a good time to do it. That's exactly the reason we shouldn't rush with it but do it right this time. There's no burning need to have a new parser right now, so we can have some moment to think - ok, how we want multibyte support there to work? And if we might need some modifications, we'd have time and flexibility to do it, not having the code in 5.3 which was supposed to go in RC in Q1 (ending 1 month from now). You are free to contribute and make MB support working upfront. I know I'm free :) However, as much as I understand the eagerness of having it in the source tree, I repeat that I do not think dropping multibyte support in 5.3 is acceptable. Thus, if it is committed right now, 5.3 would have to be deferred until this is resolved. If this is resolved timely for 5.3 - great. If not, we better get it in 5.4 right than in 5.3 wrong. -- Stanislav Malyshev, Zend Software Architect [EMAIL PROTECTED] http://www.zend.com/ (408)253-8829 MSN: [EMAIL PROTECTED] -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] Replace the flex-based scanner with an re2c [1] based lexer
I don't think this part is a concern since we have required re2c for quite a while now to build many critical parts of PHP. People who Ok, great then - only issue remaining is the multibyte support. -- Stanislav Malyshev, Zend Software Architect [EMAIL PROTECTED] http://www.zend.com/ (408)253-8829 MSN: [EMAIL PROTECTED] -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php