HyukjinKwon edited a comment on issue #24519: [SPARK-27612][PYTHON] Use 
Python's default protocol instead of highest protocol
URL: https://github.com/apache/spark/pull/24519#issuecomment-488945188
 
 
   I can almost confirm that this is a bug in Pyrolite.
   
   1. Python lists are chunked via `BatchedSerializer` at `parallelize` (in 
`createDataFrame`), for instance, as below (each line is each batch):
   
       ```bash
       [[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], 
[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]]
       [[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], 
[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]]
       ...
       [[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]]
       ```
   
       Here is opcodes for each batch (separated by `===..`) with protocol 4:
       <details>
   
       ```
       ==============================
           0: \x80 PROTO      4
           2: \x95 FRAME      47
          11: ]    EMPTY_LIST
          12: \x94 MEMOIZE    (as 0)
          13: (    MARK
          14: ]        EMPTY_LIST
          15: \x94     MEMOIZE    (as 1)
          16: (        MARK
          17: K            BININT1    1
          19: K            BININT1    2
          21: K            BININT1    3
          23: K            BININT1    4
          25: e            APPENDS    (MARK at 16)
          26: \x85     TUPLE1
          27: \x94     MEMOIZE    (as 2)
          28: h        BINGET     1
          30: \x85     TUPLE1
          31: \x94     MEMOIZE    (as 3)
          32: h        BINGET     1
          34: \x85     TUPLE1
          35: \x94     MEMOIZE    (as 4)
          36: h        BINGET     1
          38: \x85     TUPLE1
          39: \x94     MEMOIZE    (as 5)
          40: h        BINGET     1
          42: \x85     TUPLE1
          43: \x94     MEMOIZE    (as 6)
          44: h        BINGET     1
          46: \x85     TUPLE1
          47: \x94     MEMOIZE    (as 7)
          48: h        BINGET     1
          50: \x85     TUPLE1
          51: \x94     MEMOIZE    (as 8)
          52: h        BINGET     1
          54: \x85     TUPLE1
          55: \x94     MEMOIZE    (as 9)
          56: e        APPENDS    (MARK at 13)
          57: .    STOP
       highest protocol among opcodes = 4
       ==============================
       ==============================
           0: \x80 PROTO      4
           2: \x95 FRAME      47
          11: ]    EMPTY_LIST
          12: \x94 MEMOIZE    (as 0)
          13: (    MARK
          14: ]        EMPTY_LIST
          15: \x94     MEMOIZE    (as 1)
          16: (        MARK
          17: K            BININT1    1
          19: K            BININT1    2
          21: K            BININT1    3
          23: K            BININT1    4
          25: e            APPENDS    (MARK at 16)
          26: \x85     TUPLE1
          27: \x94     MEMOIZE    (as 2)
          28: h        BINGET     1
          30: \x85     TUPLE1
          31: \x94     MEMOIZE    (as 3)
          32: h        BINGET     1
          34: \x85     TUPLE1
          35: \x94     MEMOIZE    (as 4)
          36: h        BINGET     1
          38: \x85     TUPLE1
          39: \x94     MEMOIZE    (as 5)
          40: h        BINGET     1
          42: \x85     TUPLE1
          43: \x94     MEMOIZE    (as 6)
          44: h        BINGET     1
          46: \x85     TUPLE1
          47: \x94     MEMOIZE    (as 7)
          48: h        BINGET     1
          50: \x85     TUPLE1
          51: \x94     MEMOIZE    (as 8)
          52: h        BINGET     1
          54: \x85     TUPLE1
          55: \x94     MEMOIZE    (as 9)
          56: e        APPENDS    (MARK at 13)
          57: .    STOP
       highest protocol among opcodes = 4
       ==============================
       ...
       ==============================
           0: \x80 PROTO      4
           2: \x95 FRAME      31
          11: ]    EMPTY_LIST
          12: \x94 MEMOIZE    (as 0)
          13: (    MARK
          14: ]        EMPTY_LIST
          15: \x94     MEMOIZE    (as 1)
          16: (        MARK
          17: K            BININT1    1
          19: K            BININT1    2
          21: K            BININT1    3
          23: K            BININT1    4
          25: e            APPENDS    (MARK at 16)
          26: \x85     TUPLE1
          27: \x94     MEMOIZE    (as 2)
          28: h        BINGET     1
          30: \x85     TUPLE1
          31: \x94     MEMOIZE    (as 3)
          32: h        BINGET     1
          34: \x85     TUPLE1
          35: \x94     MEMOIZE    (as 4)
          36: h        BINGET     1
          38: \x85     TUPLE1
          39: \x94     MEMOIZE    (as 5)
          40: e        APPENDS    (MARK at 13)
          41: .    STOP
       highest protocol among opcodes = 4
       ==============================
       ```
       </details>
   
   2. Those batches become binary in an RDD at `parallelize` - so far results 
look fine
   
   3. `rdd._to_java_object_rdd` -> `SerDeUtil.pythonToJava` -> `Unpickler.loads`
   
   4. Here, Pryolite `Unpickler.loads` converts each Python object batch into 
Java objects
   
   5. The converted Java objects look a bit odd from Pryolite as below (see the 
last three lists):
   
       ```
       [1, 2, 3, 4]
       [1, 2, 3, 4]
       ...
       [[[1, 2, 3, 4]], [[1, 2, 3, 4]]]
       [[[1, 2, 3, 4]], [[1, 2, 3, 4]]]
       [[[1, 2, 3, 4]], [[1, 2, 3, 4]]]
       ```
   
   6. Last three nested lists are ignored via `EvaluatePython.makeFromJava` and 
becomes `null` -> `None`.
   
   7. So the Jenkins test was failed at 
https://github.com/apache/spark/pull/24519#issuecomment-488913729 as below:
   
       ```
       Traceback (most recent call last):
         File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/tests/test_serde.py",
 line 134, in test_int_array_serialization
           self.assertEqual(len(list(filter(lambda r: None in r.value, 
df.collect()))), 0)
       AssertionError: 3 != 0
       ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to