Re: [PHP-DEV] Why are serialized strings wrapped in double quotes? (s::"")

2024-02-09 Thread Casper Langemeijer
On Tue, Feb 6, 2024, at 21:19, Sanford Whiteman wrote:
> I'd like a little background on something we've long accepted: why
> does the serialization format need double quotes around a string, even
> though the byte length is explicit?

> Instead we need to be aware of the leading and trailing " in our state
> machine but I'm not sure what the advantage is.

Dunno why, but is has made my life much easier. I've seen many situation where 
serialized data was converted from CP1252 to UTF8. Then the string length 
changes and unserialization leads to an error condition. Without the quotes 
possibly many cases would go undetected.

> Was this just to make strings look more 'stringy', even though the
> format isn't meant to be human-readable?

In my mind the format is a nice pragmatic middle between reasonably efficient, 
reasonably robust, too feature-complete (too many allowed_classes) and somewhat 
human readable. At least enough for incidental debugging or manual tinkering.

Re: [PHP-DEV] Why are serialized strings wrapped in double quotes? (s::"")

2024-02-09 Thread Robert Landers
On Fri, Feb 9, 2024 at 8:13 AM Michał Marcin Brzuchalski
 wrote:
>
> czw., 8 lut 2024 o 20:10 Sanford Whiteman 
> napisał(a):
>
> > Hi Michał,
> >
> > Thursday, February 8, 2024, 2:58:52 AM, you wrote:
> > ...
> > >O3:Foo:5{s4:date;O17:DateTimeImmutable:3{s4:date;s26:2024-02-08
> >
> > >08:41:10.009742;s13:timezone_type;i:3;s8:timezone;s16:Europe/Amsterdam}s6:*foo;s11:Foo
> > >bar
> >
> > >baz;s8:Foobar;i:123456789;s3:tbl;a4{i:0;i:123;i:1;b:1;i:2;d:1.1;i:3;s3:baz}s8:*color;E12:Color:Yellow}
> > >
> > >This is still readable by humans and keep the size/length in all places
> > >where needed.
> >
> > Amazing. To my eyes it's more readable too.
> >
>
> Just wondering, while null is encoded just as N the booleans are encoded
> with b:0 or b:1
> I can imagine this could also be just T and F
>
>
> > Here's another one: leading numeral *implies* Integer 'i' (so only
> > 'd', 'b' and 's' are necessary). Or maybe that goes too far.
>
>
> I was there in the very first link you can spot it but also believe this
> goes too far.
>
> All above already goes far beyond what you initially asked and I know that.
> I just like to share what can find.
>
> Cheers,
> Michał Marcin Brzuchalski

If I recall correctly, there is also a `\0` (null character) hiding in
the serialized string as well. It's incredibly annoying when
copy/pasting as sometimes it gets stripped out. It might be worth
removing as well.

Robert Landers
Software Engineer
Utrecht NL

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] Why are serialized strings wrapped in double quotes? (s::"")

2024-02-08 Thread Michał Marcin Brzuchalski
czw., 8 lut 2024 o 20:10 Sanford Whiteman 
napisał(a):

> Hi Michał,
>
> Thursday, February 8, 2024, 2:58:52 AM, you wrote:
> ...
> >O3:Foo:5{s4:date;O17:DateTimeImmutable:3{s4:date;s26:2024-02-08
>
> >08:41:10.009742;s13:timezone_type;i:3;s8:timezone;s16:Europe/Amsterdam}s6:*foo;s11:Foo
> >bar
>
> >baz;s8:Foobar;i:123456789;s3:tbl;a4{i:0;i:123;i:1;b:1;i:2;d:1.1;i:3;s3:baz}s8:*color;E12:Color:Yellow}
> >
> >This is still readable by humans and keep the size/length in all places
> >where needed.
>
> Amazing. To my eyes it's more readable too.
>

Just wondering, while null is encoded just as N the booleans are encoded
with b:0 or b:1
I can imagine this could also be just T and F


> Here's another one: leading numeral *implies* Integer 'i' (so only
> 'd', 'b' and 's' are necessary). Or maybe that goes too far.


I was there in the very first link you can spot it but also believe this
goes too far.

All above already goes far beyond what you initially asked and I know that.
I just like to share what can find.

Cheers,
Michał Marcin Brzuchalski


Re: [PHP-DEV] Why are serialized strings wrapped in double quotes? (s::"")

2024-02-08 Thread Sanford Whiteman
Hi Michał,

Thursday, February 8, 2024, 2:58:52 AM, you wrote:

>You inspired me to play with serialization format to spot even more
>unnecessary chars https://3v4l.org/DLh1U
>From my PoV there are more candidates to reduce and still keep the safety,
>for eg:
>removing leading ':' before array/object and trailing ';' inside brackets,
>you reduce by 2 bytes
>
>a:4:{i:0;i:123;i:1;b:1;i:2;d:1.1;i:3;s:3:"baz";}
>
>Could be simply
>
>a:4{i:0;i:123;i:1;b:1;i:2;d:1.1;i:3;s:3:baz}
>
>This example saves 4 bytes: double-quotes, one ; and :
>
>If you go further all types that require size/length also don't need extra
>double-colon meaning:
>a:4 could become a4
>s:3 could become s3
>
>The same could apply to O: and E:
>
>O3:Foo:5{s4:date;O17:DateTimeImmutable:3{s4:date;s26:2024-02-08
>08:41:10.009742;s13:timezone_type;i:3;s8:timezone;s16:Europe/Amsterdam}s6:*foo;s11:Foo
>bar
>baz;s8:Foobar;i:123456789;s3:tbl;a4{i:0;i:123;i:1;b:1;i:2;d:1.1;i:3;s3:baz}s8:*color;E12:Color:Yellow}
>
>This is still readable by humans and keep the size/length in all places
>where needed.

Amazing. To my eyes it's more readable too.

Here's another one: leading numeral *implies* Integer 'i' (so only
'd', 'b' and 's' are necessary). Or maybe that goes too far.


>Interestingly when an array is serialized as object property it is not
>followed by ; in field list https://3v4l.org/4p6ve
>
>O:3:"Foo":2:{s:3:"foo";a:3:{i:0;i:1;i:1;i:2;i:2;i:3;}s:3:"bar";s:3:"baz";}
>
>Missing ; between }s was a surprise to me.

Yeah, that almost seems like a bug that unserialize() tolerates.

— S.

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] Why are serialized strings wrapped in double quotes? (s::"")

2024-02-07 Thread Michał Marcin Brzuchalski
Hi Sandy,

wt., 6 lut 2024 o 21:19 Sanford Whiteman 
napisał(a):

> Howdy all, haven't posted in ages but good to see the list going strong.
>
> I'd like a little background on something we've long accepted: why
> does the serialization format need double quotes around a string, even
> though the byte length is explicit?
>
> Example:
>
>   s:5:"hello";
>
> All else being equal I would think we could have just
>
>   s:5:hello;
>
> and skip forward 5 bytes. Instead we need to be aware of the leading
> and trailing " in our state machine but I'm not sure what the
> advantage is.
>

You inspired me to play with serialization format to spot even more
unnecessary chars https://3v4l.org/DLh1U
>From my PoV there are more candidates to reduce and still keep the safety,
for eg:
removing leading ':' before array/object and trailing ';' inside brackets,
you reduce by 2 bytes

a:4:{i:0;i:123;i:1;b:1;i:2;d:1.1;i:3;s:3:"baz";}

Could be simply

a:4{i:0;i:123;i:1;b:1;i:2;d:1.1;i:3;s:3:baz}

This example saves 4 bytes: double-quotes, one ; and :

If you go further all types that require size/length also don't need extra
double-colon meaning:
a:4 could become a4
s:3 could become s3

The same could apply to O: and E:

O3:Foo:5{s4:date;O17:DateTimeImmutable:3{s4:date;s26:2024-02-08
08:41:10.009742;s13:timezone_type;i:3;s8:timezone;s16:Europe/Amsterdam}s6:*foo;s11:Foo
bar
baz;s8:Foobar;i:123456789;s3:tbl;a4{i:0;i:123;i:1;b:1;i:2;d:1.1;i:3;s3:baz}s8:*color;E12:Color:Yellow}

This is still readable by humans and keep the size/length in all places
where needed.
My attached example is poor but shows up to ~20% size reduction.

Interestingly when an array is serialized as object property it is not
followed by ; in field list https://3v4l.org/4p6ve

O:3:"Foo":2:{s:3:"foo";a:3:{i:0;i:1;i:1;i:2;i:2;i:3;}s:3:"bar";s:3:"baz";}

Missing ; between }s was a surprise to me.

Best regards,
Michał Marcin Brzuchalski


Re: [PHP-DEV] Why are serialized strings wrapped in double quotes? (s::"")

2024-02-07 Thread Sanford Whiteman
Nice work, Jim.

>I enjoy spelunking in the history of the project, so I did some digging. It
>looks to me like Kris didn't quite get the history correct. Boris did propose
>a form of serialization first, but it looks like what became serialize() and
>unserialize() came into the project another way.
>
>https://marc.info/?l=php-general=90222513234434=2
>
>The serialize() and unserialize() functions were first added in PHP 3.0.5
>with that same encoding for strings that you're asking about. Here is the
>original proposal for adding the functions from Jani Lehtimäki:
>
>https://news-web.php.net/php.dev/1444
>
>The were originally conceived as var_save() and var_load() and operated on
>files, but you can see the file format uses the same string encoding, although 
>it used single quotes.
>
>It was committed to CVS by Stig here, but unfortunately the emails to the list 
>didn't include newly-added files.
>
>https://news-web.php.net/php.dev/1540

Huh. So the quotes may have just stuck around from eval()-related approaches
without being officially discussed. In the grand scheme even if you're wasting 2
bytes for every string that could be a tiny % on average.

The format's fascinating because it unmistakably *works*, and binary
igbinary/msgpack aside, it's a pretty good byte-stream encoding. If you take
Sergey's results it's way faster than JSON, at least when it's PHP doing the
unserialization:
https://grechin.org/2021/04/06/php-json-encode-vs-serialize-performance-comparison.html

— S.

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] Why are serialized strings wrapped in double quotes? (s::"")

2024-02-07 Thread Jim Winstead
On Tue, Feb 6, 2024, at 12:19 PM, Sanford Whiteman wrote:
> Howdy all, haven't posted in ages but good to see the list going strong.
>
> I'd like a little background on something we've long accepted: why
> does the serialization format need double quotes around a string, even
> though the byte length is explicit?

I enjoy spelunking in the history of the project, so I did some digging. It 
looks to me like Kris didn't quite get the history correct. Boris did propose a 
form of serialization first, but it looks like what became serialize() and 
unserialize() came into the project another way.

https://marc.info/?l=php-general=90222513234434=2

The serialize() and unserialize() functions were first added in PHP 3.0.5 with 
that same encoding for strings that you're asking about. Here is the original 
proposal for adding the functions from Jani Lehtimäki:

https://news-web.php.net/php.dev/1444

The were originally conceived as var_save() and var_load() and operated on 
files, but you can see the file format uses the same string encoding, although 
it used single quotes.

It was committed to CVS by Stig here, but unfortunately the emails to the list 
didn't include newly-added files.

https://news-web.php.net/php.dev/1540

I'm not sure if the old CVS history is preserved somewhere, but based on what 
appeared in 3.0.5, that format probably goes back to the beginning and it 
doesn't look like there was any on-list discussion about it.

Jim

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] Why are serialized strings wrapped in double quotes? (s::"")

2024-02-07 Thread Sanford Whiteman
> I don't have the historical context, but I'm assuming that's it. PHPs
> serialization format is not efficient, and I don't think that was ever
> the primary focus.

Thanks Ilija. That'll have to suffice unless someone remembers a specific
decision (searching all the old Internals posts nothing came up). Most of my
readers are pretty junior but I hate to say something that conflicts with
their intuition.

— S.

On Wed, Feb 7, 2024 at 7:28 AM Ilija Tovilo  wrote:
>
> Hi Sandy
>
> On Tue, Feb 6, 2024 at 9:19 PM Sanford Whiteman  
> wrote:
> >
> > I'd like a little background on something we've long accepted: why
> > does the serialization format need double quotes around a string, even
> > though the byte length is explicit?
> >
> > Example:
> >
> >   s:5:"hello";
> >
> > All else being equal I would think we could have just
> >
> >   s:5:hello;
> >
> > Was this just to make strings look more 'stringy', even though the
> > format isn't meant to be human-readable?
>
> I don't have the historical context, but I'm assuming that's it. PHPs
> serialization format is not efficient, and I don't think that was ever
> the primary focus. If you need something more efficient, you can try
> https://github.com/igbinary/igbinary which is aimed to be a drop-in
> replacement.
>
> Ilija
>
> --
> PHP Internals - PHP Runtime Development Mailing List
> To unsubscribe, visit: https://www.php.net/unsub.php
>

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] Why are serialized strings wrapped in double quotes? (s::"")

2024-02-07 Thread Ilija Tovilo
Hi Sandy

On Tue, Feb 6, 2024 at 9:19 PM Sanford Whiteman  wrote:
>
> I'd like a little background on something we've long accepted: why
> does the serialization format need double quotes around a string, even
> though the byte length is explicit?
>
> Example:
>
>   s:5:"hello";
>
> All else being equal I would think we could have just
>
>   s:5:hello;
>
> Was this just to make strings look more 'stringy', even though the
> format isn't meant to be human-readable?

I don't have the historical context, but I'm assuming that's it. PHPs
serialization format is not efficient, and I don't think that was ever
the primary focus. If you need something more efficient, you can try
https://github.com/igbinary/igbinary which is aimed to be a drop-in
replacement.

Ilija

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



[PHP-DEV] Why are serialized strings wrapped in double quotes? (s::"")

2024-02-06 Thread Sanford Whiteman
Howdy all, haven't posted in ages but good to see the list going strong.

I'd like a little background on something we've long accepted: why
does the serialization format need double quotes around a string, even
though the byte length is explicit?

Example:

  s:5:"hello";

All else being equal I would think we could have just

  s:5:hello;

and skip forward 5 bytes. Instead we need to be aware of the leading
and trailing " in our state machine but I'm not sure what the
advantage is.

Was this just to make strings look more 'stringy', even though the
format isn't meant to be human-readable?

I read (the archive of) Kris's blog post:

  
https://web.archive.org/web/20170813190508/http://blog.koehntopp.info/index.php/2407-php-understanding-unserialize/

but that didn't shed any light. Zigzagging through the source wasn't
getting me there as fast as someone who was there from the beginning.

The reason for my question is I'm writing a blog post about a SaaS app
that (don't gasp/laugh) returns serialize() format from one of its
APIs. In discussing why the PHP format can make sense vs. JSON, I
wanted to point to the faster parsing you get with length-prefixed
strings. Then I started wondering about why we have both the length
prefix and the extra quotes.

Thanks,

Sandy

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php