WangGuangxin opened a new pull request #26548: [SPARK-][SQL] RecordBinaryComparator should check endianness when compared by long URL: https://github.com/apache/spark/pull/26548 ### What changes were proposed in this pull request? This PR try to make sure the comparison results of `compared by 8 bytes at a time` and `compared by bytes wise` in RecordBinaryComparator is *consistent*, by checking the *endianness*. Otherwise, `Repartition+Shuffle` will still produce incorrect results in a yarn cluster where its machines have different architecture (some machines supports unaligned-access, while other machines not) ### Why are the changes needed? If the architecture supports unaligned or the offset is 8 bytes aligned, `RecordBinaryComparator` compare 8 bytes at a time by reading 8 bytes as a long. Related code is ``` if (Platform.unaligned() || (((leftOff + i) % 8 == 0) && ((rightOff + i) % 8 == 0))) { while (i <= leftLen - 8) { long v1 = Platform.getLong(leftObj, leftOff + i); long v2 = Platform.getLong(rightObj, rightOff + i); if (v1 != v2) { if (ByteOrder.nativeOrder().equals(ByteOrder.LITTLE_ENDIAN)) { v1 = Long.reverseBytes(v1); v2 = Long.reverseBytes(v2); } return v1 > v2 ? 1 : -1; } i += 8; } } ``` Otherwise, it will compare bytes by bytes. Related code is ``` while (i < leftLen) { final int v1 = Platform.getByte(leftObj, leftOff + i) & 0xff; final int v2 = Platform.getByte(rightObj, rightOff + i) & 0xff; if (v1 != v2) { return v1 > v2 ? 1 : -1; } i += 1; } ``` However, on little-endian machine, the result of *compared by a long value* and *compared bytes by bytes* maybe different. If the architectures in a yarn cluster is different(Some is unaligned-access capable while others not), then the sequence of two records after sorted is undetermined, which will result in the same problem as in https://issues.apache.org/jira/browse/SPARK-23207 ### Does this PR introduce any user-facing change? No ### How was this patch tested? Tested manually in our cluster. I have no idea how to test it in UT.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
