[ 
https://issues.apache.org/jira/browse/SPARK-16542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiang Gao updated SPARK-16542:
------------------------------
    Description: 
This is a bugs about types that result an array of null when creating dataframe 
using python.

Python's array.array have richer type than python itself, e.g. we can have 
array('f',[1,2,3]) and array('d',[1,2,3]). Codes in spark-sql didn't take this 
into consideration which might cause a problem that you get an array of null 
values when you have array('f') in your rows.

A simple code to reproduce this is:
{{from pyspark import SparkContext}}
{{from pyspark.sql import SQLContext,Row,DataFrame}}
{{from array import array}}

{{sc = SparkContext()}}
{{sqlContext = SQLContext(sc)}}

{{row1 = Row(floatarray=array('f',[1,2,3]), doublearray=array('d',[1,2,3]))}}
{{rows = sc.parallelize([ row1 ])}}
{{df = sqlContext.createDataFrame(rows)}}
{{df.show()}}

which have output
{{+---------------+------------------+}}
{{| doublearray| floatarray|}}
{{+---------------+------------------+}}
{{|[1.0, 2.0, 3.0]|[null, null, null]|}}
+---------------+------------------+}}

  was:
This is a bugs about types that result an array of null when creating dataframe 
using python.

Python's array.array have richer type than python itself, e.g. we can have 
array('f',[1,2,3]) and array('d',[1,2,3]). Codes in spark-sql didn't take this 
into consideration which might cause a problem that you get an array of null 
values when you have array('f') in your rows.

A simple code to reproduce this is:
{{from pyspark import SparkContext}}
{{from pyspark.sql import SQLContext,Row,DataFrame}}
{{from array import array}}

{{sc = SparkContext()}}
{{sqlContext = SQLContext(sc)}}

{{row1 = Row(floatarray=array('f',[1,2,3]), doublearray=array('d',[1,2,3]))}}
{{rows = sc.parallelize([ row1 ])}}
{{df = sqlContext.createDataFrame(rows)}}
{{df.show()}}

which have output
+---------------+------------------+
| doublearray| floatarray|
+---------------+------------------+
|[1.0, 2.0, 3.0]|[null, null, null]|
+---------------+------------------+


> bugs about types that result an array of null when creating dataframe using 
> python
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-16542
>                 URL: https://issues.apache.org/jira/browse/SPARK-16542
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>            Reporter: Xiang Gao
>
> This is a bugs about types that result an array of null when creating 
> dataframe using python.
> Python's array.array have richer type than python itself, e.g. we can have 
> array('f',[1,2,3]) and array('d',[1,2,3]). Codes in spark-sql didn't take 
> this into consideration which might cause a problem that you get an array of 
> null values when you have array('f') in your rows.
> A simple code to reproduce this is:
> {{from pyspark import SparkContext}}
> {{from pyspark.sql import SQLContext,Row,DataFrame}}
> {{from array import array}}
> {{sc = SparkContext()}}
> {{sqlContext = SQLContext(sc)}}
> {{row1 = Row(floatarray=array('f',[1,2,3]), doublearray=array('d',[1,2,3]))}}
> {{rows = sc.parallelize([ row1 ])}}
> {{df = sqlContext.createDataFrame(rows)}}
> {{df.show()}}
> which have output
> {{+---------------+------------------+}}
> {{| doublearray| floatarray|}}
> {{+---------------+------------------+}}
> {{|[1.0, 2.0, 3.0]|[null, null, null]|}}
> +---------------+------------------+}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to