HyukjinKwon edited a comment on issue #24519: [SPARK-27612][PYTHON] Use Python's default protocol instead of highest protocol URL: https://github.com/apache/spark/pull/24519#issuecomment-488945188 I can almost confirm that this is a bug in Pyrolite. 1. Python lists are chunked via `BatchedSerializer` at `parallelize`, for instance, as below (each line is each batch): ```bash [[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]] [[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]] ... [[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]] ``` Here is opcodes for each batch (separated by `===..`) with protocol 4: <details> ``` ============================== 0: \x80 PROTO 4 2: \x95 FRAME 47 11: ] EMPTY_LIST 12: \x94 MEMOIZE (as 0) 13: ( MARK 14: ] EMPTY_LIST 15: \x94 MEMOIZE (as 1) 16: ( MARK 17: K BININT1 1 19: K BININT1 2 21: K BININT1 3 23: K BININT1 4 25: e APPENDS (MARK at 16) 26: \x85 TUPLE1 27: \x94 MEMOIZE (as 2) 28: h BINGET 1 30: \x85 TUPLE1 31: \x94 MEMOIZE (as 3) 32: h BINGET 1 34: \x85 TUPLE1 35: \x94 MEMOIZE (as 4) 36: h BINGET 1 38: \x85 TUPLE1 39: \x94 MEMOIZE (as 5) 40: h BINGET 1 42: \x85 TUPLE1 43: \x94 MEMOIZE (as 6) 44: h BINGET 1 46: \x85 TUPLE1 47: \x94 MEMOIZE (as 7) 48: h BINGET 1 50: \x85 TUPLE1 51: \x94 MEMOIZE (as 8) 52: h BINGET 1 54: \x85 TUPLE1 55: \x94 MEMOIZE (as 9) 56: e APPENDS (MARK at 13) 57: . STOP highest protocol among opcodes = 4 ============================== ============================== 0: \x80 PROTO 4 2: \x95 FRAME 47 11: ] EMPTY_LIST 12: \x94 MEMOIZE (as 0) 13: ( MARK 14: ] EMPTY_LIST 15: \x94 MEMOIZE (as 1) 16: ( MARK 17: K BININT1 1 19: K BININT1 2 21: K BININT1 3 23: K BININT1 4 25: e APPENDS (MARK at 16) 26: \x85 TUPLE1 27: \x94 MEMOIZE (as 2) 28: h BINGET 1 30: \x85 TUPLE1 31: \x94 MEMOIZE (as 3) 32: h BINGET 1 34: \x85 TUPLE1 35: \x94 MEMOIZE (as 4) 36: h BINGET 1 38: \x85 TUPLE1 39: \x94 MEMOIZE (as 5) 40: h BINGET 1 42: \x85 TUPLE1 43: \x94 MEMOIZE (as 6) 44: h BINGET 1 46: \x85 TUPLE1 47: \x94 MEMOIZE (as 7) 48: h BINGET 1 50: \x85 TUPLE1 51: \x94 MEMOIZE (as 8) 52: h BINGET 1 54: \x85 TUPLE1 55: \x94 MEMOIZE (as 9) 56: e APPENDS (MARK at 13) 57: . STOP highest protocol among opcodes = 4 ============================== ... ============================== 0: \x80 PROTO 4 2: \x95 FRAME 31 11: ] EMPTY_LIST 12: \x94 MEMOIZE (as 0) 13: ( MARK 14: ] EMPTY_LIST 15: \x94 MEMOIZE (as 1) 16: ( MARK 17: K BININT1 1 19: K BININT1 2 21: K BININT1 3 23: K BININT1 4 25: e APPENDS (MARK at 16) 26: \x85 TUPLE1 27: \x94 MEMOIZE (as 2) 28: h BINGET 1 30: \x85 TUPLE1 31: \x94 MEMOIZE (as 3) 32: h BINGET 1 34: \x85 TUPLE1 35: \x94 MEMOIZE (as 4) 36: h BINGET 1 38: \x85 TUPLE1 39: \x94 MEMOIZE (as 5) 40: e APPENDS (MARK at 13) 41: . STOP highest protocol among opcodes = 4 ============================== ``` </details> 2. Those batches become binary in an RDD at `parallelize` - so far results look fine 3. `rdd._to_java_object_rdd` -> `SerDeUtil.pythonToJava` -> `Unpickler.loads` 4. Here, Pryolite `Unpickler.loads` converts each Python object batch into Java objects 5. The converted Java objects look a bit odd from Pryolite as below: ``` [1, 2, 3, 4] [1, 2, 3, 4] ... [[[1, 2, 3, 4]], [[1, 2, 3, 4]]] [[[1, 2, 3, 4]], [[1, 2, 3, 4]]] [[[1, 2, 3, 4]], [[1, 2, 3, 4]]] ``` 6. Nested last three list are ignored via `EvaluatePython.makeFromJava` and becomes `null` -> `None`.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
