peter-toth edited a comment on pull request #31682:
URL: https://github.com/apache/spark/pull/31682#issuecomment-788950105
> correct me if I'm wrong: pickler recursively serializes the input and
applies the cache. The input is a row of `(c1, c2)`, but pickler recursively
serializes the row of `c1` and `c2`, and causes a problem because of the cache.
You are right that caching has an important role in this issue. But IMO
cache lookup by references can't cause issues if we use immutable objects. The
issue here is that pytolite 4.21 introduced cache lookup by value and some of
our data structures (`GenericRowWithSchema`) behaves weird when comparing them
with `.equals()`...
> Then I think it's not realistic to make one pickler instance to handle
data with the same schema. Turning off `valueCompare` may be the only choice.
Agreed, this looks like to be the easiest way to fix the issue now. I've
already modified this PR to revert previous changes and add
`valueCompare=false`.
I would argue though that `valueCompare=false` is not the only way to fix
it. If `GenericRowWithSchema.equals()` took the schema into account in its
`.equals()` then it would also fix it.
> To evaluate the severity of the problem, it seems only an issue when there
are nested struct types?
Yes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]