Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5445#discussion_r28200422
  
    --- Diff: python/pyspark/sql/types.py ---
    @@ -997,12 +997,13 @@ def _restore_object(dataType, obj):
         # same object in most cases.
         k = id(dataType)
         cls = _cached_cls.get(k)
    -    if cls is None:
    +    if cls is None or cls.__datatype is not dataType:
    --- End diff --
    
    After some debugging, I found that this check is necessary, for example:
    
    At first, we create DataType `d1` in the first batch, and create a class 
Row1, R1.__datatype__ will pointer to d1, it will be cached as id(d1) and d1:
    
    ```
    id(d1) -> Row
    d1 -> Row
    Row.__datatype__ -> d1
    ```
    
    In the second batch, it will create d2, which is the same as d1, lookup by 
id(d2) will miss, but will find Row by d2 as key, then the cache become
    ```
    id(d1) -> Row
    d1 -> Row
    id(d2) -> Row
    Row.__datatype__ -> d2
    ```
    This time, d1 could be collected by GC, id(d1) will by recycled to other 
object.
    
    Later, there could be a object dd, which has the same id than d1, then it 
will find Row via id(dd), but dd is different than d1 (different struct type), 
the assert will fail.
    
    Finally, I think that the current approach is the correct one.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to