WangGuangxin commented on issue #26548: [SPARK-29918][SQL] 
RecordBinaryComparator should check endianness when compared by long
URL: https://github.com/apache/spark/pull/26548#issuecomment-554554331
 
 
   > Is https://issues.apache.org/jira/browse/SPARK-23207 actually the same 
issue? that was marked fixed.
   > 
   > (Edited below to fix my example)
   > 
   > So, hm, do we not also have a subtle problem with the signed nature of 
bytes and longs? Putting aside endianness issues, if you compare two longs 
starting with bytes like:
   > 
   > 1000 0000 ...
   > 0000 0000 ...
   > 
   > The first is less than the second, because it's clearly negative while the 
other isn't.
   > But comparing byte by byte, we'd consider the first one to be greater than 
the second.
   > Would this also be an issue between aligned and unaligned machines?
   > 
   > That said, I also don't know how common it is to mix this hardware? (I 
don't even know which machines do and don't support aligned access.) Is it 
common?
   
   https://issues.apache.org/jira/browse/SPARK-23207   is the initial commit to 
address `Shuffle + Repartition` case, and `RecordBinaryComparator` is 
introduced in this commit as a local sort comparator to make sure the partition 
output is consistent even after rerun due to stage failure. 
   What this PR does is to make sure the `RecordBinaryComparator` always 
produce the same result even with different architecture.
   
   To clarify, suppose there is a map task whose input has two records A and B, 
both has 8 bytes
   >A:  00000001  00000000 * 6  00000000
   >B:  00000000 00000000 * 6  00000001
   
   In the first run, suppose it runs on a *little endian* machine whose 
`Platform.unaligned()` is true, then A and B is compared by reading it as a 
Long, so A = 1 and B = 72057594037927936L, so A < B.
   
   But in the following rerun due to shuffle fetch failure, suppose it runs on 
a machine whose `Platform.unaligned()` is false, then A and B is compared by 
byte-wise from low address to high address, so A > B.
   
   The inconsistent after rerun can still cause incorrect data.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to