GitHub user zasdfgbnm opened a pull request:
https://github.com/apache/spark/pull/18444
[SPARK-16542][SQL][PYSPARK] Fix bugs about types that result an array of
null when creating DataFrame using python
## What changes were proposed in this pull request?
This is the reopen of https://github.com/apache/spark/pull/14198, with
merge conflicts resolved.
@ueshin Could you please take a look at my code?
Fix bugs about types that result an array of null when creating DataFrame
using python.
Python's array.array have richer type than python itself, e.g. we can have
`array('f',[1,2,3])` and `array('d',[1,2,3])`. Codes in spark-sql and pyspark
didn't take this into consideration which might cause a problem that you get an
array of null values when you have `array('f')` in your rows.
A simple code to reproduce this bug is:
```
from pyspark import SparkContext
from pyspark.sql import SQLContext,Row,DataFrame
from array import array
sc = SparkContext()
sqlContext = SQLContext(sc)
row1 = Row(floatarray=array('f',[1,2,3]), doublearray=array('d',[1,2,3]))
rows = sc.parallelize([ row1 ])
df = sqlContext.createDataFrame(rows)
df.show()
```
which have output
```
+---------------+------------------+
| doublearray| floatarray|
+---------------+------------------+
|[1.0, 2.0, 3.0]|[null, null, null]|
+---------------+------------------+
```
## How was this patch tested?
New test case added
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/zasdfgbnm/spark fix_array_infer
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/18444.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #18444
----
commit a127486d59528eae452dcbcc2ccfb68fdd7769b7
Author: Xiang Gao <[email protected]>
Date: 2016-07-09T00:58:14Z
use array.typecode to infer type
Python's array has more type than python it self, for example
python only has float while array support 'f' (float) and 'd' (double)
Switching to array.typecode helps spark make a better inference
For example, for the code:
from pyspark.sql.types import _infer_type
from array import array
a = array('f',[1,2,3,4,5,6])
_infer_type(a)
We will get ArrayType(DoubleType,true) before change,
but ArrayType(FloatType,true) after change
commit 70131f3b81575edf9073d5be72553730d6316bd6
Author: Xiang Gao <[email protected]>
Date: 2016-07-09T06:21:31Z
Merge branch 'master' into fix_array_infer
commit 505e819f415c2f754b5147908516ace6f6ddfe78
Author: Xiang Gao <[email protected]>
Date: 2016-07-13T12:53:18Z
sync with upstream
commit 05979ca6eabf723cf3849ec2bf6f6e9de26cb138
Author: Xiang Gao <[email protected]>
Date: 2016-07-14T08:07:12Z
add case (c: Float, FloatType) to fromJava
commit 5cd817a4e7ec68a693ee2a878a2e36b09b1965b6
Author: Xiang Gao <[email protected]>
Date: 2016-07-14T08:09:25Z
sync with upstream
commit cd2ec6bc707fb6e7255b3a6a6822c3667866c63c
Author: Xiang Gao <[email protected]>
Date: 2016-10-17T02:44:48Z
add test for array in dataframe
commit 527d969067e447f8bff6004570c27130346cdf76
Author: Xiang Gao <[email protected]>
Date: 2016-10-17T03:13:47Z
merge with upstream/master
commit 82223c02082793b899c7eeca70f7bbfcea516c28
Author: Xiang Gao <[email protected]>
Date: 2016-10-17T03:35:47Z
set unsigned types and Py_UNICODE as unsupported
commit 0a967e280b3250bf7217e61905ad28f010c4ed40
Author: Xiang Gao <[email protected]>
Date: 2016-10-17T17:46:35Z
fix code style
commit 2059435b45ed1f6337a4f935adcd029084cfec91
Author: Xiang Gao <[email protected]>
Date: 2016-10-18T00:11:05Z
fix the same problem for byte and short
commit 58b120c4d207d9332e6dcde20109651ad8e17190
Author: Xiang Gao <[email protected]>
Date: 2017-06-28T01:28:03Z
sync with upstream
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]