Re: [PHP-DEV] [RFC] Bundling ext/simdjson into core
Hey Máté, On Fri, Jan 8, 2021 at 1:20 PM Máté Kocsis wrote: > In the meanwhile I also realized that there are just too many unknowns > currently with this extension, so I cancelled my initial plans of adding > ext/simdjson as a core extension. > Since the majority of issues revolve around uncertainty/compatibility, is there perhaps a way for using the extension with an "overload `json_encode()` and `json_decode()` mode" (remember mbstring?), so that brave bleeding edge users can experiment with it as an opt-in replacement? That would certainly allow for some more precise field testing over the next few years. Marco Pivetta http://twitter.com/Ocramius http://ocramius.github.com/
Re: [PHP-DEV] [RFC] Bundling ext/simdjson into core
Hi Jakub and Nikita, Thank you very much for your elaborate answers, you made good points. In the meanwhile I also realized that there are just too many unknowns currently with this extension, so I cancelled my initial plans of adding ext/simdjson as a core extension. I also agree that using the simdjson library as a backend would be a very interesting thing, so I'll try to experiment with this approach first. Slight incompatibilities between the backends would be very problematic for sure, so if we go this way then the implementations should be exercised by a definitive test suite like the one Nikita linked. Máté
Re: [PHP-DEV] [RFC] Bundling ext/simdjson into core
On Tue, Dec 29, 2020 at 5:58 PM Máté Kocsis wrote: > Hi Internals, > > I think this will be my last proposal for quite some while :) > But this time, I'd like to propose bundling the > https://github.com/crazyxman/simdjson_php extension > with some major modifications. > > The proposed OO API is included in the description of the > PR that I've just created: https://github.com/php/php-src/pull/6551 > > The main motivation behind this RFC is two-fold: > - the underlying simdjson library (https://github.com/simdjson/simdjson) > which is used by ext/simdjson provides huge performance gains > compared to ext/json (see some benchmark results in the PR) > - we can support new use-cases, most notably the so called "on-demand" > parsing: https://github.com/simdjson/simdjson/blob/master/doc/ondemand.md > (This is not implemented currently) > > Originally, I planned to include the new API in ext/json, but > unfortunately, > simdjson is written is C++, so it would make C++ as a hard dependency, > which was not the case so far. That's why I opted for creating > ext/simdjson. > > Please let me know if you have any feedback. > > Regards: > Máté > Same as the others, I don't think it makes sense to bundle this at this point in time -- though I'm also pretty skeptical about bundling it at all. The end result would be that we have two JSON APIs, one that is always available but slower, and another that is optional but faster. That is really not great. Would make a bit more sense if simdjson was just a backend that could be optionally enabled to speed up the normal JSON API. I'd also be concerned about compatibility -- JSON parsing is a bit of a minefield and many parsers have subtle differences. I remember this comparison page http://seriot.ch/json/pruned_results.png which showed that PHP 7 (which is when we switched to jsond) had one of the only fully conforming JSON implementations. It doesn't seem unlikely that simdjson will have differences somewhere, and it would be unfortunate if there were two JSON APIs with subtly different behavior. I think the main value proposition (even though it is not part of your initial proposal) here would be a streaming JSON API, that does not require parsing or even loading the whole JSON document at once. I think the addition of such an API would be valuable -- but wouldn't it be possible to introduce this based on our existing JSON parser? Nikita
Re: [PHP-DEV] [RFC] Bundling ext/simdjson into core
Hi, On Thu, Dec 31, 2020 at 1:30 AM Máté Kocsis wrote: > Hi Remi and Jakub, > >> > I agree it's too early as the library is young and won't be available in >>> many distros. The PECL path is better in this case IMO as it will allow >>> some time . >> >> > In my opinion, this is a case where making an > exception is worth considering. > > The PECL path doesn't mean that the extension won't be used. From the user point of view, it doesn't really matter that much as distro packages are available even for PECL extensions and it can easily added to Docker image as well. The advantage is that it will give some time for the library to be available in distros and possibly more stable. I think it would actually help with stabilizing the API and introducing new features quickly. It means all the mentioned features could be provided to users in the extension release cycle and not waiting a year for new PHP version. > Should the simdjson library be written in C, > I'd propose to add the new API + parser to > ext/json directly, since ext/simdjson is just a > very small wrapper around the parser, and > not a complex piece of code in itself (compared to other parts of php-src). > > I think this wouldn't be really an option even if it was in C because ext/json is enabled by default so you couldn't have external dependency on such a new library. Otherwise it would mean that you couldn't install PHP if that library is not available. Of course unless the library is bundled but that is not a good idea for the maintanance. > Also, I think the performance benefit of using > the simdjson parser is so major that it would > be a pity if people had to wait for years until > the functionality becomes generally available > as a core extension. As json_encode() and > json_decode() are very easy to use, my guess > is that a 3rd party JSON-related extension > would never get an adoption large enough, > because only those people would install it > who have really reached the limitations of ext/json. > > But this proposal is not about changing json_decode. It introduces a new API that can be in the same way introduce in the PECL extension. As I said above, having that in PECL doesn't mean that it's not available and people have to wait for it. It would be great to actually see what the performance benefit is in the real applications. The benchmarks relies on repeated calls which is not always the way how it's in the application (e.g. due to processor cache). Also it might not be such a huge perf increase for the most apps as the actual parsing is not usually the app bottleneck. That said I think it will bring some considerable improvements anyway in the apps especially in those doing lots of parsing but would be great to see how much it is in reality. > By the way, it has just come to my mind that > our company is also affected by these > limitations. Sometimes we have to parse > very large JSON documents, and in some > cases these can end up being truncated. > Fortunately we only need a specific part of > the data, so someone wrote a partial "parser" > (this is euphemism) tailored for the schema > in question. Rather than having to use > custom hackery, it would be so much better > if PHP would offer partial parsing out of the > box, like what the proposed > JsonParser::getKeyValue() does. > As you mentiened, this could be possible in ondemand API which looks really useful indeed. There are more things that are pretty useful like JSON Pointer, better error reporting and UTF8 validation that could be potentially also re-used in encoder. I think it would be great to have at least some of the features in the extension before it gets to the core. Especially thinking about the error reporting which should no longer depend on global state. One note about the proposed API. As it's not part of the ext/json, it shouldn't be called JsonParser but rather SimdJsonParser to reflect that it's part of the simdjson extension. That's the convention that is used for other exts and it's also less confusing for users because that class won't be available for many users initially - at least until the library is available or extension installed / enabled. The methods also shouldn't be all static but rather instance should be provided that would allow getting errors or using the ondemand mode. > That said, the cost-benefit ratio of having > simdjson in core seems advantageous for me. > > Was thinking that it would be good to consider some kind of plugable >> decoder where another extension could register a parsing callback. >> Something similar to what we have for parser but instead for the whole >> decoding. That would allow to still use current parser in json_decode but >> if simdjson available / configured in ini, then it would used instead and >> would be just faster. Not sure if all options are supported though - for >> example don't see any note about UTF8 substitution >> (JSON_INVALID_UTF8_SUBSTITUTE). >> > > This is a very interesting
Re: [PHP-DEV] [RFC] Bundling ext/simdjson into core
Sorry, I need to add a small correction for my previous email: JsonParser::getKeyValue() wouldn't allow parsing a truncated JSON document. The on-demand parsing API I mentioned in my opening message is what would make such use-case possible. Besides, I accept if you feel it's too early for a new JSON API to be added to the core, so if the majority of stakeholders also share this opinion, I'll try to make an improved version of ext/simdjson available via PECL. Máté
Re: [PHP-DEV] [RFC] Bundling ext/simdjson into core
Hi Remi and Jakub, > I agree it's too early as the library is young and won't be available in >> many distros. The PECL path is better in this case IMO as it will allow >> some time . > > In my opinion, this is a case where making an exception is worth considering. Should the simdjson library be written in C, I'd propose to add the new API + parser to ext/json directly, since ext/simdjson is just a very small wrapper around the parser, and not a complex piece of code in itself (compared to other parts of php-src). Also, I think the performance benefit of using the simdjson parser is so major that it would be a pity if people had to wait for years until the functionality becomes generally available as a core extension. As json_encode() and json_decode() are very easy to use, my guess is that a 3rd party JSON-related extension would never get an adoption large enough, because only those people would install it who have really reached the limitations of ext/json. By the way, it has just come to my mind that our company is also affected by these limitations. Sometimes we have to parse very large JSON documents, and in some cases these can end up being truncated. Fortunately we only need a specific part of the data, so someone wrote a partial "parser" (this is euphemism) tailored for the schema in question. Rather than having to use custom hackery, it would be so much better if PHP would offer partial parsing out of the box, like what the proposed JsonParser::getKeyValue() does. That said, the cost-benefit ratio of having simdjson in core seems advantageous for me. Was thinking that it would be good to consider some kind of plugable > decoder where another extension could register a parsing callback. > Something similar to what we have for parser but instead for the whole > decoding. That would allow to still use current parser in json_decode but > if simdjson available / configured in ini, then it would used instead and > would be just faster. Not sure if all options are supported though - for > example don't see any note about UTF8 substitution > (JSON_INVALID_UTF8_SUBSTITUTE). > This is a very interesting approach, and it reminds me about the hashing registry RFC in certain extent. And you are right, as far as I saw, these flags are not supported by ext/simdjson. But to be honest, I haven't analyzed yet how difficult it would be to have a reasonably full compatibility with ext/json. Cheers: Máté
Re: [PHP-DEV] [RFC] Bundling ext/simdjson into core
On Wed, Dec 30, 2020 at 6:52 AM Remi Collet wrote: > Le 29/12/2020 à 17:57, Máté Kocsis a écrit : > > Hi Internals, > > > > I think this will be my last proposal for quite some while :) > > But this time, I'd like to propose bundling the > > https://github.com/crazyxman/simdjson_php extension > > with some major modifications. > > > > The proposed OO API is included in the description of the > > PR that I've just created: https://github.com/php/php-src/pull/6551 > > > > The main motivation behind this RFC is two-fold: > > - the underlying simdjson library (https://github.com/simdjson/simdjson) > > which is used by ext/simdjson provides huge performance gains > > compared to ext/json (see some benchmark results in the PR) > > - we can support new use-cases, most notably the so called "on-demand" > > parsing: > https://github.com/simdjson/simdjson/blob/master/doc/ondemand.md > > (This is not implemented currently) > > > > Originally, I planned to include the new API in ext/json, but > unfortunately, > > simdjson is written is C++, so it would make C++ as a hard dependency, > > which was not the case so far. That's why I opted for creating > ext/simdjson. > > > > Please let me know if you have any feedback. > > the library seems young and very active, and so probably not available > in most distributions. > > I would prefer to use the common way > > - propose as pecl extension > - wait for maturity (extension AND library) > - propose for merge in php-src > > IMHO: too early for this one. > > I agree it's too early as the library is too young and won't be available in many distros. The PECL path is better in this case IMO as it will allow some time . That said the simdjson lib looks pretty impressive. Definitely looks more advanced than the ext json parser that I wrote more than 6 years ago. In my defese, I had no grant for that. :) Anyway really cool to see what can be done and how much it's possible to speed it up. Was thinking that it would be good to consider some kind of plugable decoder where another extension could register a parsing callback. Something similar to what we have for parser but instead for the whole decoding. That would allow to still use current parser in json_decode but if simdjson available / configured in ini, then it would used instead and would be just faster. Not sure if all options are supported though - for example don't see any note about UTF8 substitution (JSON_INVALID_UTF8_SUBSTITUTE). Cheers Jakub
Re: [PHP-DEV] [RFC] Bundling ext/simdjson into core
Le 29/12/2020 à 17:57, Máté Kocsis a écrit : Hi Internals, I think this will be my last proposal for quite some while :) But this time, I'd like to propose bundling the https://github.com/crazyxman/simdjson_php extension with some major modifications. The proposed OO API is included in the description of the PR that I've just created: https://github.com/php/php-src/pull/6551 The main motivation behind this RFC is two-fold: - the underlying simdjson library (https://github.com/simdjson/simdjson) which is used by ext/simdjson provides huge performance gains compared to ext/json (see some benchmark results in the PR) - we can support new use-cases, most notably the so called "on-demand" parsing: https://github.com/simdjson/simdjson/blob/master/doc/ondemand.md (This is not implemented currently) Originally, I planned to include the new API in ext/json, but unfortunately, simdjson is written is C++, so it would make C++ as a hard dependency, which was not the case so far. That's why I opted for creating ext/simdjson. Please let me know if you have any feedback. the library seems young and very active, and so probably not available in most distributions. I would prefer to use the common way - propose as pecl extension - wait for maturity (extension AND library) - propose for merge in php-src IMHO: too early for this one. Remi Regards: Máté -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php
[PHP-DEV] [RFC] Bundling ext/simdjson into core
Hi Internals, I think this will be my last proposal for quite some while :) But this time, I'd like to propose bundling the https://github.com/crazyxman/simdjson_php extension with some major modifications. The proposed OO API is included in the description of the PR that I've just created: https://github.com/php/php-src/pull/6551 The main motivation behind this RFC is two-fold: - the underlying simdjson library (https://github.com/simdjson/simdjson) which is used by ext/simdjson provides huge performance gains compared to ext/json (see some benchmark results in the PR) - we can support new use-cases, most notably the so called "on-demand" parsing: https://github.com/simdjson/simdjson/blob/master/doc/ondemand.md (This is not implemented currently) Originally, I planned to include the new API in ext/json, but unfortunately, simdjson is written is C++, so it would make C++ as a hard dependency, which was not the case so far. That's why I opted for creating ext/simdjson. Please let me know if you have any feedback. Regards: Máté