[
https://issues.apache.org/jira/browse/SPARK-35108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325252#comment-17325252
]
Thomas Graves commented on SPARK-35108:
---------------------------------------
seems like a correctness issue in some cases so marking it as such until to
investigate. [~hyukjin.kwon] [~cloud_fan]
> Pickle produces incorrect key labels for GenericRowWithSchema (data
> corruption)
> -------------------------------------------------------------------------------
>
> Key: SPARK-35108
> URL: https://issues.apache.org/jira/browse/SPARK-35108
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.0.1, 3.0.2
> Reporter: Robert Joseph Evans
> Priority: Blocker
> Labels: correctness
> Attachments: test.py, test.sh
>
>
> I think this also shows up for all versions of Spark that pickle the data
> when doing a collect from python.
> When you do a collect in python java will do a collect and convert the
> UnsafeRows into GenericRowWithSchema instances before it sends them to the
> Pickler. The Pickler, by default, will try to dedupe objects using hashCode
> and .equals for the object. But .equals and .hashCode for
> GenericRowWithSchema only looks at the data, not the schema. But when we
> pickle the row the keys from the schema are written out.
> This can result in data corruption, sort of, in a few cases where a row has
> the same number of elements as a struct within the row does, or a sub-struct
> within another struct.
> If the data happens to be the same, the keys for the resulting row or struct
> can be wrong.
> My repro case is a bit convoluted, but it does happen.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]