Re: [PHP-DEV] [RFC] Data Classes

Rob Landers Sun, 24 Nov 2024 10:12:56 -0800

On Sat, Nov 23, 2024, at 23:10, Ilija Tovilo wrote:
> Hi Rob
> 
> On Sat, Nov 23, 2024 at 2:12 PM Rob Landers <rob@bottled.codes> wrote:
> >
> > Born from the Records RFC (https://wiki.php.net/rfc/records) discussion, I 
> > would like to introduce to you a competing RFC: Data Classes 
> > (https://wiki.php.net/rfc/dataclass).
> 
> As others have pointed out, your RFC is very similar to my proposal
> for struct. I don't quite understand the reason to compete and race
> each other to the finish line. Combined efforts are usually better.


This isn't a race or competition, and I meant "competing with records" and not 
"competing with Ilija." Though, I see how it could be interpreted that way.

To be honest, I don't like this RFC. I think records and/or structs are the 
right answer (dedicated behavior vs. trying to stuff more behavior into 
classes). As mentioned elsewhere in the thread, the behavior here is more of an 
emergent property than any actual thinking through it. So, the fact that it is 
similar to your proposal is merely coincidental; it wasn't planned that way 
(nor did I catch it before submitting the RFC). 

This mailing list is the only real interaction I have with other php-src devs 
(and the occasional PR to php-src), so please forgive me if I don't notice 
something. Yes, that isn't a real excuse, but not having other people to bounce 
ideas off of usually means I'll make an idiot out of myself, eventually.

That being said, there are a ton of edge cases I've uncovered where everything 
breaks down in this RFC:
 • PHP references are "evil", so I've discovered.
 • "new" generates some strange opcodes—I'm currently investigating this—that 
make dealing with value types difficult. I'm trying to solve this in some way 
that doesn't require changing the generated opcodes, but that might be 
impossible. This solution would allow using "new" with records and solve the 
constructor problem in this RFC.
 • Did I mention PHP references are evil?
In any case, I'd much rather help with your structs proposal, as the more I 
work on this, the more I don't like it.

> 
> One of the bigger differences between our proposals is the addition of
> mutating methods in my proposal compared to yours. You show the
> following example in your RFC:
> 
> ```php
> data class Rectangle {
>     public function __construct(public int $width, public int $height) {}
> 
>     public function resize(int $width, int $height): static {
>         $this->height = $height;
>         $this->width = $width;
>         return $this;
>     }
> }
> ```
> 
> The resize method here modifies the instance and thus implicitly
> creates a copy. That's _fine_ for such a small structure. However,
> note that this still leads to the performance issues we have
> previously discussed for growable data structures.
> 
> ```php
> data class Vector {
>     public function append(mixed $value): static {
>         /* Internal implementation, $values is some underlying storage. */
>         $this->values[] = $value;
>         return $this;
>     }
> }
> ```
> 
> Calling `$vector->append(42);` will increase the refcount of
> `$vector`, and cause separation on `$this->values[] = ...;`. If
> `$vector->values` is a big storage, cloning will be very expensive.
> Hence, appending becomes an O(n) operation (because each element in
> the vector is copied to the new structure), and hence appending to an
> array in a loop will tank your performance.  That's the reason for the
> introduction of the `$vector->append!(42)` syntax in my proposal. It
> separates the value at call-site when necessary, and avoids separation
> on `$this` in methods altogether.

As I mentioned to Larry in the records discussion, the biggest problem with a 
"data class" is that it really needs dedicated syntax to be done properly (such 
as the ones in structs, and I plan to remove some syntax features from records 
since I got a lot of negative feedback about the syntax there; shout out to 
reddit). I don't think that would belong to a general class but rather a 
dedicated type. There really isn't a "one size fits all" solution here.

> 
> There might be some general confusion on the performance issue. In one
> of your e-mails in the last thread, you have mentioned:
> 
> > Like Ilija mentioned in their email, there are significant performance 
> > optimizations to be had here that are simply not possible using regular 
> > (readonly) classes. I didn't go into detail as to how it works because it 
> > feels like an implementation detail, but I will spend some time distilling 
> > this and its consequences, into the RFC, over the coming days. As a simple 
> > illustration, there can be significant memory usage improvements:
> >
> > 100,000 arrays: https://3v4l.org/Z4CcV
> > 100,000 readonly classes: https://3v4l.org/1vhNp
> 
> First off, the array example only uses less memory because [1, 2] is a
> constant array. When you make it dynamic, they will become way less
> efficient than objects. https://3v4l.org/pETM9
> 
> But this is not the point I was trying to make either. Rather, when it
> comes to immutable, growable data structures, every mutation becomes
> an extremely expensive operation because the entire data structure,
> including its underlying storage, needs to be copied. For example:
> 
> https://3v4l.org/BEsYT
> 
> ```php
> class Vector {
>     private $values;
> 
>     public function populate() {
>         $this->values = range(1, 1_000_000);
>     }
> 
>     public function appendMutable() {
>         $this->values[] = 100_000_001;
>     }
> 
>     public function appendImmutable() {
>         $new = clone $this;
>         $this->values[] = 100_000_001;
>     }
> }
> ```
> 
> > appendMutable(): float(8.106231689453125E-6)
> > appendImmutable(): float(0.012187957763671875)
> 
> That's a factor of 1 500 difference for an array containing 1 million
> numbers. Obviously, concrete numbers will vary, but the problem grows
> the bigger the array becomes.
> 
> Ilija
> 

I'm not quite focused on actual performance (yet), but I understand your point. 
To that, I have an idea I've been playing around with to have "layered 
hashmaps" where a copy/clone is just a layer on the original (made immutable) 
hashmap, thus having a near zero cost for copies. There's still extra 
indirection involved, so it potentially could be worse 🤷. I would think an 
average case O(1) would be faster than a copy, at least for large maps, but 
wall-clock time and Big O don't always correlate. I also note that the 
benchmarks in the repo are based on operations, not time, so it is quite 
difficult to show that a feature that increases the number of operations 
decreases the overall time. In other words, you may add 100 operations with a 
strong cache locality and remove 10 with poor cache locality -- quality vs. 
quantity, an age-old problem. Anyway, there's no way to tell the exact 
performance characteristics until I finish the implementation. I'd love to 
discuss this more, as it's actually pretty neat and interesting to solve in a 
performant way.

It could also be like my zend_string refactor I spent a few months on over the 
summer, where the performance isn't affected (at least for wall-clock time), 
and the only added benefits showing when you perform lots of string 
modifications. Other benefits won't make sense without already having an RFC I 
was working on (that is still in the draft stage). So, for now, that branch 
will just gather dust.

Things like this are largely why I didn't put too much information about the 
performance characteristics of records in my RFC. It's really an implementation 
detail, and I didn't want people's decisions to be tied to an implementation 
detail that could change drastically. In other words, maybe the performance 
would be poor now, but I'm confident it can be improved later. IMHO, we should 
focus on whether we want the feature in the first place rather than worrying 
about how fast or slow it is. Many improvements don't make sense for the sake 
of improving them, and it isn't until there is a problem to be solved (such as 
a poor performing feature) that the improvement becomes worthwhile.

— Rob

Re: [PHP-DEV] [RFC] Data Classes

Reply via email to