gatorsmile opened a new pull request #24032: [SPARK-27097] [CHERRY-PICK 2.4] 
Avoid embedding platform-dependent offsets literally in whole-stage generated 
code
URL: https://github.com/apache/spark/pull/24032
 
 
   ## What changes were proposed in this pull request?
   
   Spark SQL performs whole-stage code generation to speed up query execution. 
There are two steps to it:
   - Java source code is generated from the physical query plan on the driver. 
A single version of the source code is generated from a query plan, and sent to 
all executors.
     - It's compiled to bytecode on the driver to catch compilation errors 
before sending to executors, but currently only the generated source code gets 
sent to the executors. The bytecode compilation is for fail-fast only.
   - Executors receive the generated source code and compile to bytecode, then 
the query runs like a hand-written Java program.
   
   In this model, there's an implicit assumption about the driver and executors 
being run on similar platforms. Some code paths accidentally embedded 
platform-dependent object layout information into the generated code, such as:
   ```java
   Platform.putLong(buffer, /* offset */ 24, /* value */ 1);
   ```
   This code expects a field to be at offset +24 of the `buffer` object, and 
sets a value to that field.
   But whole-stage code generation generally uses platform-dependent 
information from the driver. If the object layout is significantly different on 
the driver and executors, the generated code can be reading/writing to wrong 
offsets on the executors, causing all kinds of data corruption.
   
   One code pattern that leads to such problem is the use of `Platform.XXX` 
constants in generated code, e.g. `Platform.BYTE_ARRAY_OFFSET`.
   
   Bad:
   ```scala
   val baseOffset = Platform.BYTE_ARRAY_OFFSET
   // codegen template:
   s"Platform.putLong($buffer, $baseOffset, $value);"
   ```
   This will embed the value of `Platform.BYTE_ARRAY_OFFSET` on the driver into 
the generated code.
   
   Good:
   ```scala
   val baseOffset = "Platform.BYTE_ARRAY_OFFSET"
   // codegen template:
   s"Platform.putLong($buffer, $baseOffset, $value);"
   ```
   This will generate the offset symbolically -- `Platform.putLong(buffer, 
Platform.BYTE_ARRAY_OFFSET, value)`, which will be able to pick up the correct 
value on the executors.
   
   Caveat: these offset constants are declared as runtime-initialized `static 
final` in Java, so they're not compile-time constants from the Java language's 
perspective. It does lead to a slightly increased size of the generated code, 
but this is necessary for correctness.
   
   NOTE: there can be other patterns that generate platform-dependent code on 
the driver which is invalid on the executors. e.g. if the endianness is 
different between the driver and the executors, and if some generated code 
makes strong assumption about endianness, it would also be problematic.
   
   ## How was this patch tested?
   
   Added a new test suite `WholeStageCodegenSparkSubmitSuite`. This test suite 
needs to set the driver's extraJavaOptions to force the driver and executor use 
different Java object layouts, so it's run as an actual SparkSubmit job.
   
   Authored-by: Kris Mok <[email protected]>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to