HyukjinKwon edited a comment on issue #24519: [SPARK-27612][PYTHON] Use 
Python's default protocol instead of highest protocol
URL: https://github.com/apache/spark/pull/24519#issuecomment-488945188
 
 
   I can almost confirm that this is a bug in Pyrolite.
   
   1. Python lists are chunked via `BatchedSerializer` at `parallelize`, for 
instance, as below (each line is each batch):
   
       ```bash
       [[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], 
[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]]
       [[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], 
[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]]
       ...
       [[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]]
       ```
   
       Here is opcodes for each batch (separated by `===..`) with protocol 4:
       <details>
   
       ```
       ==============================
           0: \x80 PROTO      4
           2: \x95 FRAME      47
          11: ]    EMPTY_LIST
          12: \x94 MEMOIZE    (as 0)
          13: (    MARK
          14: ]        EMPTY_LIST
          15: \x94     MEMOIZE    (as 1)
          16: (        MARK
          17: K            BININT1    1
          19: K            BININT1    2
          21: K            BININT1    3
          23: K            BININT1    4
          25: e            APPENDS    (MARK at 16)
          26: \x85     TUPLE1
          27: \x94     MEMOIZE    (as 2)
          28: h        BINGET     1
          30: \x85     TUPLE1
          31: \x94     MEMOIZE    (as 3)
          32: h        BINGET     1
          34: \x85     TUPLE1
          35: \x94     MEMOIZE    (as 4)
          36: h        BINGET     1
          38: \x85     TUPLE1
          39: \x94     MEMOIZE    (as 5)
          40: h        BINGET     1
          42: \x85     TUPLE1
          43: \x94     MEMOIZE    (as 6)
          44: h        BINGET     1
          46: \x85     TUPLE1
          47: \x94     MEMOIZE    (as 7)
          48: h        BINGET     1
          50: \x85     TUPLE1
          51: \x94     MEMOIZE    (as 8)
          52: h        BINGET     1
          54: \x85     TUPLE1
          55: \x94     MEMOIZE    (as 9)
          56: e        APPENDS    (MARK at 13)
          57: .    STOP
       highest protocol among opcodes = 4
       ==============================
       ==============================
           0: \x80 PROTO      4
           2: \x95 FRAME      47
          11: ]    EMPTY_LIST
          12: \x94 MEMOIZE    (as 0)
          13: (    MARK
          14: ]        EMPTY_LIST
          15: \x94     MEMOIZE    (as 1)
          16: (        MARK
          17: K            BININT1    1
          19: K            BININT1    2
          21: K            BININT1    3
          23: K            BININT1    4
          25: e            APPENDS    (MARK at 16)
          26: \x85     TUPLE1
          27: \x94     MEMOIZE    (as 2)
          28: h        BINGET     1
          30: \x85     TUPLE1
          31: \x94     MEMOIZE    (as 3)
          32: h        BINGET     1
          34: \x85     TUPLE1
          35: \x94     MEMOIZE    (as 4)
          36: h        BINGET     1
          38: \x85     TUPLE1
          39: \x94     MEMOIZE    (as 5)
          40: h        BINGET     1
          42: \x85     TUPLE1
          43: \x94     MEMOIZE    (as 6)
          44: h        BINGET     1
          46: \x85     TUPLE1
          47: \x94     MEMOIZE    (as 7)
          48: h        BINGET     1
          50: \x85     TUPLE1
          51: \x94     MEMOIZE    (as 8)
          52: h        BINGET     1
          54: \x85     TUPLE1
          55: \x94     MEMOIZE    (as 9)
          56: e        APPENDS    (MARK at 13)
          57: .    STOP
       highest protocol among opcodes = 4
       ==============================
       ...
       ==============================
           0: \x80 PROTO      4
           2: \x95 FRAME      31
          11: ]    EMPTY_LIST
          12: \x94 MEMOIZE    (as 0)
          13: (    MARK
          14: ]        EMPTY_LIST
          15: \x94     MEMOIZE    (as 1)
          16: (        MARK
          17: K            BININT1    1
          19: K            BININT1    2
          21: K            BININT1    3
          23: K            BININT1    4
          25: e            APPENDS    (MARK at 16)
          26: \x85     TUPLE1
          27: \x94     MEMOIZE    (as 2)
          28: h        BINGET     1
          30: \x85     TUPLE1
          31: \x94     MEMOIZE    (as 3)
          32: h        BINGET     1
          34: \x85     TUPLE1
          35: \x94     MEMOIZE    (as 4)
          36: h        BINGET     1
          38: \x85     TUPLE1
          39: \x94     MEMOIZE    (as 5)
          40: e        APPENDS    (MARK at 13)
          41: .    STOP
       highest protocol among opcodes = 4
       ==============================
       ```
       </details>
   
   2. Those batches become binary in an RDD at `parallelize` - so far results 
look fine
   
   3. `rdd._to_java_object_rdd` -> `SerDeUtil.pythonToJava` -> `Unpickler.loads`
   
   4. Here, Pryolite `Unpickler.loads` converts each Python object batch into 
Java objects
   
   5. The converted Java objects look a bit odd from Pryolite as below:
   
       ```
       [1, 2, 3, 4]
       [1, 2, 3, 4]
       ...
       [[[1, 2, 3, 4]], [[1, 2, 3, 4]]]
       [[[1, 2, 3, 4]], [[1, 2, 3, 4]]]
       [[[1, 2, 3, 4]], [[1, 2, 3, 4]]]
       ```
   
   6. Nested last three list are ignored via `EvaluatePython.makeFromJava` and 
becomes `null` -> `None`.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to