[jira] [Updated] (SPARK-17706) DataFrame losing string data in yarn mode

Andrey Dmitriev (JIRA) Wed, 28 Sep 2016 03:17:43 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-17706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andrey Dmitriev updated SPARK-17706:
------------------------------------
    Summary: DataFrame losing string data in yarn mode  (was: dataframe losing 
string data in yarn mode)

> DataFrame losing string data in yarn mode
> -----------------------------------------
>
>                 Key: SPARK-17706
>                 URL: https://issues.apache.org/jira/browse/SPARK-17706
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL, YARN
>    Affects Versions: 1.5.0
>         Environment: RedHat 6.6, CDH 5.5.2
>            Reporter: Andrey Dmitriev
>
> By some reason when I add new column or append string to existing data/column 
> or creating new DataFrame from code, it misinterpreting string data, so 
> function show() doesn't work properly, filters (such as withColumn, where, 
> when, etc.) doesn't work ether.
> Here is example code:
> {code}
> object MissingValue {
>   def hex(str: String): String = str
>     .getBytes("UTF-8")
>     .map(f => Integer.toHexString(f&0xFF).toUpperCase)
>     .mkString("-")
>   def main(args: Array[String]) {
>     val conf = new SparkConf().setAppName("MissingValue")
>     val sc = new SparkContext(conf)
>     sc.setLogLevel("WARN")
>     val sqlContext = new SQLContext(sc)
>     import sqlContext.implicits._
>     val list = List((101,"ABC"),(102,"BCD"),(103,"CDE"))
>     val rdd = sc.parallelize(list).map(f => Row(f._1,f._2))
>     val schema = StructType(
>       StructField("COL1",IntegerType,true)
>       ::StructField("COL2",StringType,true)
>       ::Nil
>     )
>     val df = sqlContext.createDataFrame(rdd,schema)
>     df.show()
>     val str = df.first().getString(1)
>     println(s"${str} == ${hex(str)}")
>     sc.stop()
>   }
> }
> {code}
> When I run it in local mode then everything works as expected:
> {code}
>     +----+----+
>     |COL1|COL2|
>     +----+----+
>     | 101| ABC|
>     | 102| BCD|
>     | 103| CDE|
>     +----+----+
>     
>     ABC == 41-42-43
> {code}
> But if I run the same code in yarn-client mode it produces:
> {code}
>     +----+----+
>     |COL1|COL2|
>     +----+----+
>     | 101| ^E^@^@|
>     | 102| ^E^@^@|
>     | 103| ^E^@^@|
>     +----+----+
>     ^E^@^@ == 5-0-0
> {code}
> This problem exists only for string values, so first column (Integer) is fine.
> Also if I'm creating rdd from the dataframe then everything is fine i.e.  
> {{df.rdd.take(1).apply(0).getString(1)}}
> I'm using Spark 1.5.0 from CDH 5.5.2
> It seems that this happens when the difference between driver memory and 
> executor memory is too high {{--driver-memory xxG --executor-memory yyG}} 
> i.e. when I decreasing executor memory or increasing driver memory then the 
> problem disappears.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-17706) DataFrame losing string data in yarn mode

Reply via email to