How big of a deal this use case is in a heterogeneous endianness environment? If we do want to fix it, we should do it when right before Spark shuffles data to minimize performance penalty, i.e. turn big-endian encoded data into little-indian encoded data before it goes on the wire. This is a pretty involved change and given other things that might break across heterogeneous endianness environments, I am not sure if it is high priority enough to even warrant review bandwidth right now.
On Tue, Jan 12, 2016 at 7:30 AM, Ted Yu <yuzhih...@gmail.com> wrote: > I logged SPARK-12778 where endian awareness in Platform.java should help > in mixed endian set up. > > There could be other parts of the code base which are related. > > Cheers > > On Tue, Jan 12, 2016 at 7:01 AM, Adam Roberts <arobe...@uk.ibm.com> wrote: > >> Hi all, I've been experimenting with DataFrame operations in a mixed >> endian environment - a big endian master with little endian workers. With >> tungsten enabled I'm encountering data corruption issues. >> >> For example, with this simple test code: >> >> import org.apache.spark.SparkContext >> import org.apache.spark._ >> import org.apache.spark.sql.SQLContext >> >> object SimpleSQL { >> def main(args: Array[String]): Unit = { >> if (args.length != 1) { >> println("Not enough args, you need to specify the master url") >> } >> val masterURL = args(0) >> println("Setting up Spark context at: " + masterURL) >> val sparkConf = new SparkConf >> val sc = new SparkContext(masterURL, "Unsafe endian test", sparkConf) >> >> println("Performing SQL tests") >> >> val sqlContext = new SQLContext(sc) >> println("SQL context set up") >> val df = sqlContext.read.json("/tmp/people.json") >> df.show() >> println("Selecting everyone's age and adding one to it") >> df.select(df("name"), df("age") + 1).show() >> println("Showing all people over the age of 21") >> df.filter(df("age") > 21).show() >> println("Counting people by age") >> df.groupBy("age").count().show() >> } >> } >> >> Instead of getting >> >> +----+-----+ >> | age|count| >> +----+-----+ >> |null| 1| >> | 19| 1| >> | 30| 1| >> +----+-----+ >> >> I get the following with my mixed endian set up: >> >> +-------------------+-----------------+ >> | age| count| >> +-------------------+-----------------+ >> | null| 1| >> |1369094286720630784|72057594037927936| >> | 30| 1| >> +-------------------+-----------------+ >> >> and on another run: >> >> +-------------------+-----------------+ >> | age| count| >> +-------------------+-----------------+ >> | 0|72057594037927936| >> | 19| 1| >> >> Is Spark expected to work in such an environment? If I turn off tungsten >> (sparkConf.set("spark.sql.tungsten.enabled", "false"), in 20 runs I don't >> see any problems. >> >> Unless stated otherwise above: >> IBM United Kingdom Limited - Registered in England and Wales with number >> 741598. >> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU >> > >