WangGuangxin commented on issue #26548: [SPARK-29918][SQL] RecordBinaryComparator should check endianness when compared by long URL: https://github.com/apache/spark/pull/26548#issuecomment-554554331 > Is https://issues.apache.org/jira/browse/SPARK-23207 actually the same issue? that was marked fixed. > > (Edited below to fix my example) > > So, hm, do we not also have a subtle problem with the signed nature of bytes and longs? Putting aside endianness issues, if you compare two longs starting with bytes like: > > 1000 0000 ... > 0000 0000 ... > > The first is less than the second, because it's clearly negative while the other isn't. > But comparing byte by byte, we'd consider the first one to be greater than the second. > Would this also be an issue between aligned and unaligned machines? > > That said, I also don't know how common it is to mix this hardware? (I don't even know which machines do and don't support aligned access.) Is it common? https://issues.apache.org/jira/browse/SPARK-23207 is the initial commit to address `Shuffle + Repartition` case, and `RecordBinaryComparator` is introduced in this commit as a local sort comparator to make sure the partition output is consistent even after rerun due to stage failure. What this PR does is to make sure the `RecordBinaryComparator` always produce the same result even with different architecture. To clarify, suppose there is a map task whose input has two records A and B, both has 8 bytes >A: 00000001 00000000 * 6 00000000 >B: 00000000 00000000 * 6 00000001 In the first run, suppose it runs on a *little endian* machine whose `Platform.unaligned()` is true, then A and B is compared by reading it as a Long, so A = 1 and B = 72057594037927936L, so A < B. But in the following rerun due to shuffle fetch failure, suppose it runs on a machine whose `Platform.unaligned()` is false, then A and B is compared by byte-wise from low address to high address, so A > B. The inconsistent after rerun can still cause incorrect data.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
