Re: [PHP-DEV] Why are serialized strings wrapped in double quotes? (s::"")
On Tue, Feb 6, 2024, at 21:19, Sanford Whiteman wrote: > I'd like a little background on something we've long accepted: why > does the serialization format need double quotes around a string, even > though the byte length is explicit? > Instead we need to be aware of the leading and trailing " in our state > machine but I'm not sure what the advantage is. Dunno why, but is has made my life much easier. I've seen many situation where serialized data was converted from CP1252 to UTF8. Then the string length changes and unserialization leads to an error condition. Without the quotes possibly many cases would go undetected. > Was this just to make strings look more 'stringy', even though the > format isn't meant to be human-readable? In my mind the format is a nice pragmatic middle between reasonably efficient, reasonably robust, too feature-complete (too many allowed_classes) and somewhat human readable. At least enough for incidental debugging or manual tinkering.
Re: [PHP-DEV] Why are serialized strings wrapped in double quotes? (s::"")
On Fri, Feb 9, 2024 at 8:13 AM Michał Marcin Brzuchalski wrote: > > czw., 8 lut 2024 o 20:10 Sanford Whiteman > napisał(a): > > > Hi Michał, > > > > Thursday, February 8, 2024, 2:58:52 AM, you wrote: > > ... > > >O3:Foo:5{s4:date;O17:DateTimeImmutable:3{s4:date;s26:2024-02-08 > > > > >08:41:10.009742;s13:timezone_type;i:3;s8:timezone;s16:Europe/Amsterdam}s6:*foo;s11:Foo > > >bar > > > > >baz;s8:Foobar;i:123456789;s3:tbl;a4{i:0;i:123;i:1;b:1;i:2;d:1.1;i:3;s3:baz}s8:*color;E12:Color:Yellow} > > > > > >This is still readable by humans and keep the size/length in all places > > >where needed. > > > > Amazing. To my eyes it's more readable too. > > > > Just wondering, while null is encoded just as N the booleans are encoded > with b:0 or b:1 > I can imagine this could also be just T and F > > > > Here's another one: leading numeral *implies* Integer 'i' (so only > > 'd', 'b' and 's' are necessary). Or maybe that goes too far. > > > I was there in the very first link you can spot it but also believe this > goes too far. > > All above already goes far beyond what you initially asked and I know that. > I just like to share what can find. > > Cheers, > Michał Marcin Brzuchalski If I recall correctly, there is also a `\0` (null character) hiding in the serialized string as well. It's incredibly annoying when copy/pasting as sometimes it gets stripped out. It might be worth removing as well. Robert Landers Software Engineer Utrecht NL -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php
Re: [PHP-DEV] Why are serialized strings wrapped in double quotes? (s::"")
czw., 8 lut 2024 o 20:10 Sanford Whiteman napisał(a): > Hi Michał, > > Thursday, February 8, 2024, 2:58:52 AM, you wrote: > ... > >O3:Foo:5{s4:date;O17:DateTimeImmutable:3{s4:date;s26:2024-02-08 > > >08:41:10.009742;s13:timezone_type;i:3;s8:timezone;s16:Europe/Amsterdam}s6:*foo;s11:Foo > >bar > > >baz;s8:Foobar;i:123456789;s3:tbl;a4{i:0;i:123;i:1;b:1;i:2;d:1.1;i:3;s3:baz}s8:*color;E12:Color:Yellow} > > > >This is still readable by humans and keep the size/length in all places > >where needed. > > Amazing. To my eyes it's more readable too. > Just wondering, while null is encoded just as N the booleans are encoded with b:0 or b:1 I can imagine this could also be just T and F > Here's another one: leading numeral *implies* Integer 'i' (so only > 'd', 'b' and 's' are necessary). Or maybe that goes too far. I was there in the very first link you can spot it but also believe this goes too far. All above already goes far beyond what you initially asked and I know that. I just like to share what can find. Cheers, Michał Marcin Brzuchalski
Re: [PHP-DEV] Why are serialized strings wrapped in double quotes? (s::"")
Hi Michał, Thursday, February 8, 2024, 2:58:52 AM, you wrote: >You inspired me to play with serialization format to spot even more >unnecessary chars https://3v4l.org/DLh1U >From my PoV there are more candidates to reduce and still keep the safety, >for eg: >removing leading ':' before array/object and trailing ';' inside brackets, >you reduce by 2 bytes > >a:4:{i:0;i:123;i:1;b:1;i:2;d:1.1;i:3;s:3:"baz";} > >Could be simply > >a:4{i:0;i:123;i:1;b:1;i:2;d:1.1;i:3;s:3:baz} > >This example saves 4 bytes: double-quotes, one ; and : > >If you go further all types that require size/length also don't need extra >double-colon meaning: >a:4 could become a4 >s:3 could become s3 > >The same could apply to O: and E: > >O3:Foo:5{s4:date;O17:DateTimeImmutable:3{s4:date;s26:2024-02-08 >08:41:10.009742;s13:timezone_type;i:3;s8:timezone;s16:Europe/Amsterdam}s6:*foo;s11:Foo >bar >baz;s8:Foobar;i:123456789;s3:tbl;a4{i:0;i:123;i:1;b:1;i:2;d:1.1;i:3;s3:baz}s8:*color;E12:Color:Yellow} > >This is still readable by humans and keep the size/length in all places >where needed. Amazing. To my eyes it's more readable too. Here's another one: leading numeral *implies* Integer 'i' (so only 'd', 'b' and 's' are necessary). Or maybe that goes too far. >Interestingly when an array is serialized as object property it is not >followed by ; in field list https://3v4l.org/4p6ve > >O:3:"Foo":2:{s:3:"foo";a:3:{i:0;i:1;i:1;i:2;i:2;i:3;}s:3:"bar";s:3:"baz";} > >Missing ; between }s was a surprise to me. Yeah, that almost seems like a bug that unserialize() tolerates. — S. -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php
Re: [PHP-DEV] Why are serialized strings wrapped in double quotes? (s::"")
Hi Sandy, wt., 6 lut 2024 o 21:19 Sanford Whiteman napisał(a): > Howdy all, haven't posted in ages but good to see the list going strong. > > I'd like a little background on something we've long accepted: why > does the serialization format need double quotes around a string, even > though the byte length is explicit? > > Example: > > s:5:"hello"; > > All else being equal I would think we could have just > > s:5:hello; > > and skip forward 5 bytes. Instead we need to be aware of the leading > and trailing " in our state machine but I'm not sure what the > advantage is. > You inspired me to play with serialization format to spot even more unnecessary chars https://3v4l.org/DLh1U >From my PoV there are more candidates to reduce and still keep the safety, for eg: removing leading ':' before array/object and trailing ';' inside brackets, you reduce by 2 bytes a:4:{i:0;i:123;i:1;b:1;i:2;d:1.1;i:3;s:3:"baz";} Could be simply a:4{i:0;i:123;i:1;b:1;i:2;d:1.1;i:3;s:3:baz} This example saves 4 bytes: double-quotes, one ; and : If you go further all types that require size/length also don't need extra double-colon meaning: a:4 could become a4 s:3 could become s3 The same could apply to O: and E: O3:Foo:5{s4:date;O17:DateTimeImmutable:3{s4:date;s26:2024-02-08 08:41:10.009742;s13:timezone_type;i:3;s8:timezone;s16:Europe/Amsterdam}s6:*foo;s11:Foo bar baz;s8:Foobar;i:123456789;s3:tbl;a4{i:0;i:123;i:1;b:1;i:2;d:1.1;i:3;s3:baz}s8:*color;E12:Color:Yellow} This is still readable by humans and keep the size/length in all places where needed. My attached example is poor but shows up to ~20% size reduction. Interestingly when an array is serialized as object property it is not followed by ; in field list https://3v4l.org/4p6ve O:3:"Foo":2:{s:3:"foo";a:3:{i:0;i:1;i:1;i:2;i:2;i:3;}s:3:"bar";s:3:"baz";} Missing ; between }s was a surprise to me. Best regards, Michał Marcin Brzuchalski
Re: [PHP-DEV] Why are serialized strings wrapped in double quotes? (s::"")
Nice work, Jim. >I enjoy spelunking in the history of the project, so I did some digging. It >looks to me like Kris didn't quite get the history correct. Boris did propose >a form of serialization first, but it looks like what became serialize() and >unserialize() came into the project another way. > >https://marc.info/?l=php-general=90222513234434=2 > >The serialize() and unserialize() functions were first added in PHP 3.0.5 >with that same encoding for strings that you're asking about. Here is the >original proposal for adding the functions from Jani Lehtimäki: > >https://news-web.php.net/php.dev/1444 > >The were originally conceived as var_save() and var_load() and operated on >files, but you can see the file format uses the same string encoding, although >it used single quotes. > >It was committed to CVS by Stig here, but unfortunately the emails to the list >didn't include newly-added files. > >https://news-web.php.net/php.dev/1540 Huh. So the quotes may have just stuck around from eval()-related approaches without being officially discussed. In the grand scheme even if you're wasting 2 bytes for every string that could be a tiny % on average. The format's fascinating because it unmistakably *works*, and binary igbinary/msgpack aside, it's a pretty good byte-stream encoding. If you take Sergey's results it's way faster than JSON, at least when it's PHP doing the unserialization: https://grechin.org/2021/04/06/php-json-encode-vs-serialize-performance-comparison.html — S. -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php
Re: [PHP-DEV] Why are serialized strings wrapped in double quotes? (s::"")
On Tue, Feb 6, 2024, at 12:19 PM, Sanford Whiteman wrote: > Howdy all, haven't posted in ages but good to see the list going strong. > > I'd like a little background on something we've long accepted: why > does the serialization format need double quotes around a string, even > though the byte length is explicit? I enjoy spelunking in the history of the project, so I did some digging. It looks to me like Kris didn't quite get the history correct. Boris did propose a form of serialization first, but it looks like what became serialize() and unserialize() came into the project another way. https://marc.info/?l=php-general=90222513234434=2 The serialize() and unserialize() functions were first added in PHP 3.0.5 with that same encoding for strings that you're asking about. Here is the original proposal for adding the functions from Jani Lehtimäki: https://news-web.php.net/php.dev/1444 The were originally conceived as var_save() and var_load() and operated on files, but you can see the file format uses the same string encoding, although it used single quotes. It was committed to CVS by Stig here, but unfortunately the emails to the list didn't include newly-added files. https://news-web.php.net/php.dev/1540 I'm not sure if the old CVS history is preserved somewhere, but based on what appeared in 3.0.5, that format probably goes back to the beginning and it doesn't look like there was any on-list discussion about it. Jim -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php
Re: [PHP-DEV] Why are serialized strings wrapped in double quotes? (s::"")
> I don't have the historical context, but I'm assuming that's it. PHPs > serialization format is not efficient, and I don't think that was ever > the primary focus. Thanks Ilija. That'll have to suffice unless someone remembers a specific decision (searching all the old Internals posts nothing came up). Most of my readers are pretty junior but I hate to say something that conflicts with their intuition. — S. On Wed, Feb 7, 2024 at 7:28 AM Ilija Tovilo wrote: > > Hi Sandy > > On Tue, Feb 6, 2024 at 9:19 PM Sanford Whiteman > wrote: > > > > I'd like a little background on something we've long accepted: why > > does the serialization format need double quotes around a string, even > > though the byte length is explicit? > > > > Example: > > > > s:5:"hello"; > > > > All else being equal I would think we could have just > > > > s:5:hello; > > > > Was this just to make strings look more 'stringy', even though the > > format isn't meant to be human-readable? > > I don't have the historical context, but I'm assuming that's it. PHPs > serialization format is not efficient, and I don't think that was ever > the primary focus. If you need something more efficient, you can try > https://github.com/igbinary/igbinary which is aimed to be a drop-in > replacement. > > Ilija > > -- > PHP Internals - PHP Runtime Development Mailing List > To unsubscribe, visit: https://www.php.net/unsub.php > -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php
Re: [PHP-DEV] Why are serialized strings wrapped in double quotes? (s::"")
Hi Sandy On Tue, Feb 6, 2024 at 9:19 PM Sanford Whiteman wrote: > > I'd like a little background on something we've long accepted: why > does the serialization format need double quotes around a string, even > though the byte length is explicit? > > Example: > > s:5:"hello"; > > All else being equal I would think we could have just > > s:5:hello; > > Was this just to make strings look more 'stringy', even though the > format isn't meant to be human-readable? I don't have the historical context, but I'm assuming that's it. PHPs serialization format is not efficient, and I don't think that was ever the primary focus. If you need something more efficient, you can try https://github.com/igbinary/igbinary which is aimed to be a drop-in replacement. Ilija -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php
[PHP-DEV] Why are serialized strings wrapped in double quotes? (s::"")
Howdy all, haven't posted in ages but good to see the list going strong. I'd like a little background on something we've long accepted: why does the serialization format need double quotes around a string, even though the byte length is explicit? Example: s:5:"hello"; All else being equal I would think we could have just s:5:hello; and skip forward 5 bytes. Instead we need to be aware of the leading and trailing " in our state machine but I'm not sure what the advantage is. Was this just to make strings look more 'stringy', even though the format isn't meant to be human-readable? I read (the archive of) Kris's blog post: https://web.archive.org/web/20170813190508/http://blog.koehntopp.info/index.php/2407-php-understanding-unserialize/ but that didn't shed any light. Zigzagging through the source wasn't getting me there as fast as someone who was there from the beginning. The reason for my question is I'm writing a blog post about a SaaS app that (don't gasp/laugh) returns serialize() format from one of its APIs. In discussing why the PHP format can make sense vs. JSON, I wanted to point to the faster parsing you get with length-prefixed strings. Then I started wondering about why we have both the length prefix and the extra quotes. Thanks, Sandy -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php