[
https://issues.apache.org/jira/browse/SPARK-12778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ted Yu updated SPARK-12778:
---------------------------
Description:
In Platform.java, methods of Java Unsafe are called directly without
considering endianness.
In thread, 'Tungsten in a mixed endian environment', Adam Roberts reported data
corruption when "spark.sql.tungsten.enabled" is enabled in mixed endian
environment.
Platform.java should take endianness into account.
Below is a copy of Adam's report:
I've been experimenting with DataFrame operations in a mixed endian environment
- a big endian master with little endian workers. With tungsten enabled I'm
encountering data corruption issues.
For example, with this simple test code:
{code}
import org.apache.spark.SparkContext
import org.apache.spark._
import org.apache.spark.sql.SQLContext
object SimpleSQL {
def main(args: Array[String]): Unit = {
if (args.length != 1) {
println("Not enough args, you need to specify the master url")
}
val masterURL = args(0)
println("Setting up Spark context at: " + masterURL)
val sparkConf = new SparkConf
val sc = new SparkContext(masterURL, "Unsafe endian test", sparkConf)
println("Performing SQL tests")
val sqlContext = new SQLContext(sc)
println("SQL context set up")
val df = sqlContext.read.json("/tmp/people.json")
df.show()
println("Selecting everyone's age and adding one to it")
df.select(df("name"), df("age") + 1).show()
println("Showing all people over the age of 21")
df.filter(df("age") > 21).show()
println("Counting people by age")
df.groupBy("age").count().show()
}
}
{code}
Instead of getting
+----+-----+
| age|count|
+----+-----+
|null| 1|
| 19| 1|
| 30| 1|
+----+-----+
I get the following with my mixed endian set up:
+-------------------+-----------------+
| age| count|
+-------------------+-----------------+
| null| 1|
|1369094286720630784|72057594037927936|
| 30| 1|
+-------------------+-----------------+
and on another run:
+-------------------+-----------------+
| age| count|
+-------------------+-----------------+
| 0|72057594037927936|
| 19| 1|
was:
In Platform.java, methods of Java Unsafe are called directly without
considering endianness.
In thread, 'Tungsten in a mixed endian environment', Adam Roberts reported data
corruption when "spark.sql.tungsten.enabled" is enabled in mixed endian
environment.
Platform.java should take endianness into account.
> Use of Java Unsafe should take endianness into account
> ------------------------------------------------------
>
> Key: SPARK-12778
> URL: https://issues.apache.org/jira/browse/SPARK-12778
> Project: Spark
> Issue Type: Bug
> Components: Input/Output
> Reporter: Ted Yu
>
> In Platform.java, methods of Java Unsafe are called directly without
> considering endianness.
> In thread, 'Tungsten in a mixed endian environment', Adam Roberts reported
> data corruption when "spark.sql.tungsten.enabled" is enabled in mixed endian
> environment.
> Platform.java should take endianness into account.
> Below is a copy of Adam's report:
> I've been experimenting with DataFrame operations in a mixed endian
> environment - a big endian master with little endian workers. With tungsten
> enabled I'm encountering data corruption issues.
> For example, with this simple test code:
> {code}
> import org.apache.spark.SparkContext
> import org.apache.spark._
> import org.apache.spark.sql.SQLContext
> object SimpleSQL {
> def main(args: Array[String]): Unit = {
> if (args.length != 1) {
> println("Not enough args, you need to specify the master url")
> }
> val masterURL = args(0)
> println("Setting up Spark context at: " + masterURL)
> val sparkConf = new SparkConf
> val sc = new SparkContext(masterURL, "Unsafe endian test", sparkConf)
> println("Performing SQL tests")
> val sqlContext = new SQLContext(sc)
> println("SQL context set up")
> val df = sqlContext.read.json("/tmp/people.json")
> df.show()
> println("Selecting everyone's age and adding one to it")
> df.select(df("name"), df("age") + 1).show()
> println("Showing all people over the age of 21")
> df.filter(df("age") > 21).show()
> println("Counting people by age")
> df.groupBy("age").count().show()
> }
> }
> {code}
> Instead of getting
> +----+-----+
> | age|count|
> +----+-----+
> |null| 1|
> | 19| 1|
> | 30| 1|
> +----+-----+
> I get the following with my mixed endian set up:
> +-------------------+-----------------+
> | age| count|
> +-------------------+-----------------+
> | null| 1|
> |1369094286720630784|72057594037927936|
> | 30| 1|
> +-------------------+-----------------+
> and on another run:
> +-------------------+-----------------+
> | age| count|
> +-------------------+-----------------+
> | 0|72057594037927936|
> | 19| 1|
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]