Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/5445#discussion_r28200422
--- Diff: python/pyspark/sql/types.py ---
@@ -997,12 +997,13 @@ def _restore_object(dataType, obj):
# same object in most cases.
k = id(dataType)
cls = _cached_cls.get(k)
- if cls is None:
+ if cls is None or cls.__datatype is not dataType:
--- End diff --
After some debugging, I found that this check is necessary, for example:
At first, we create DataType `d1` in the first batch, and create a class
Row1, R1.__datatype__ will pointer to d1, it will be cached as id(d1) and d1:
```
id(d1) -> Row
d1 -> Row
Row.__datatype__ -> d1
```
In the second batch, it will create d2, which is the same as d1, lookup by
id(d2) will miss, but will find Row by d2 as key, then the cache become
```
id(d1) -> Row
d1 -> Row
id(d2) -> Row
Row.__datatype__ -> d2
```
This time, d1 could be collected by GC, id(d1) will by recycled to other
object.
Later, there could be a object dd, which has the same id than d1, then it
will find Row via id(dd), but dd is different than d1 (different struct type),
the assert will fail.
Finally, I think that the current approach is the correct one.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]