[
https://issues.apache.org/jira/browse/SPARK-27097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063909#comment-17063909
]
angerszhu commented on SPARK-27097:
-----------------------------------
[~irashid] to be honest, I meet this problem these days.
[~dbtsai] I have some question.
We start a self-developed thrift server program and use spark as compute
engine with below javaOptions parameter
{color:#e14141}-Xmx64g {color}
{color:#e14141}-Djava.library.path=/home/hadoop/hadoop/lib/native {color}
{color:#e14141}-Djavax.security.auth.useSubjectCredsOnly=false {color}
{color:#e14141}-Dcom.sun.management.jmxremote.port=9021 {color}
{color:#e14141}-Dcom.sun.management.jmxremote.authenticate=false {color}
{color:#e14141}-Dcom.sun.management.jmxremote.ssl=false {color}
{color:#e14141}-XX:MaxPermSize=1024m -XX:PermSize=256m
-XX:MaxDirectMemorySize=8192m -XX:-TraceClassUnloading {color}
{color:#e14141}-XX:+UseCompressedOops -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:+CMSClassUnloadingEnabled -XX:+UseCMSCompactAtFullCollection
-XX:CMSFullGCsBeforeCompaction=0 -XX:+CMSParallelRemarkEnabled
-XX:+DisableExplicitGC -XX:+PrintTenuringDistribution
-XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=75
-Xnoclassgc -XX:+PrintGCDetails -XX:+PrintGCDateStamps {color}
{color:#e14141} {color}
{color:#e14141} {color}
Then the {color:#347eec}Platform{color}{color:#e14141}.{color}
{color:#347eec}BYTE_ARRAY_OFFSET{color} will be 24, when we start a normal
spark thrift server, the value will be 16, this problem cause strange data
corruption.
After few days check, I located the problem because of spark *codegen*, and
this pr can fix our problem , but I can’t find evidence why
Platform.BYTE_ARRAY_OFFSET will be 24 in above parameter. Since I test in local
that when we set {color:#e14141} -XX:+ UseCompressedOops, {color} using
pointer compression it's going to be 16.
{color:#e14141} -XX:- UseCompressedOops, {color} not using pointer compression
it's going to be 24. This is easy to understand why the offset is not same.
But I don’t know why above parameter will be 24 since I am not a professor
about java compiler and Basic computer knowledge.
Can you give me some advisor or information about how to understand and find
the root cause.
> Avoid embedding platform-dependent offsets literally in whole-stage generated
> code
> ----------------------------------------------------------------------------------
>
> Key: SPARK-27097
> URL: https://issues.apache.org/jira/browse/SPARK-27097
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.0.0, 2.1.3, 2.2.3, 2.3.4, 2.4.0
> Reporter: Xiao Li
> Assignee: Kris Mok
> Priority: Critical
> Labels: correctness
> Fix For: 2.4.1
>
>
> Avoid embedding platform-dependent offsets literally in whole-stage generated
> code.
> Spark SQL performs whole-stage code generation to speed up query execution.
> There are two steps to it:
> Java source code is generated from the physical query plan on the driver. A
> single version of the source code is generated from a query plan, and sent to
> all executors.
> It's compiled to bytecode on the driver to catch compilation errors before
> sending to executors, but currently only the generated source code gets sent
> to the executors. The bytecode compilation is for fail-fast only.
> Executors receive the generated source code and compile to bytecode, then the
> query runs like a hand-written Java program.
> In this model, there's an implicit assumption about the driver and executors
> being run on similar platforms. Some code paths accidentally embedded
> platform-dependent object layout information into the generated code, such as:
> {code:java}
> Platform.putLong(buffer, /* offset */ 24, /* value */ 1);
> {code}
> This code expects a field to be at offset +24 of the buffer object, and sets
> a value to that field.
> But whole-stage code generation generally uses platform-dependent information
> from the driver. If the object layout is significantly different on the
> driver and executors, the generated code can be reading/writing to wrong
> offsets on the executors, causing all kinds of data corruption.
> One code pattern that leads to such problem is the use of Platform.XXX
> constants in generated code, e.g. Platform.BYTE_ARRAY_OFFSET.
> Bad:
> {code:java}
> val baseOffset = Platform.BYTE_ARRAY_OFFSET
> // codegen template:
> s"Platform.putLong($buffer, $baseOffset, $value);"
> This will embed the value of Platform.BYTE_ARRAY_OFFSET on the driver into
> the generated code.
> {code}
> Good:
> {code:java}
> val baseOffset = "Platform.BYTE_ARRAY_OFFSET"
> // codegen template:
> s"Platform.putLong($buffer, $baseOffset, $value);"
> This will generate the offset symbolically -- Platform.putLong(buffer,
> Platform.BYTE_ARRAY_OFFSET, value), which will be able to pick up the correct
> value on the executors.
> {code}
> Caveat: these offset constants are declared as runtime-initialized static
> final in Java, so they're not compile-time constants from the Java language's
> perspective. It does lead to a slightly increased size of the generated code,
> but this is necessary for correctness.
> NOTE: there can be other patterns that generate platform-dependent code on
> the driver which is invalid on the executors. e.g. if the endianness is
> different between the driver and the executors, and if some generated code
> makes strong assumption about endianness, it would also be problematic.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]