[ 
https://issues.apache.org/jira/browse/SPARK-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14498593#comment-14498593
 ] 

Davies Liu commented on SPARK-6857:
-----------------------------------

It's not good that we use array or numpy.array as part of API, we can not 
change it right now. I'd like to suggest to use Vector as part of API in ml, 
and support conversion from/to numpy.array easy and fast.

numpy/scipy is only useful for mllib/ml, it's better to keep them out of the 
scope of SQL.

> Python SQL schema inference should support numpy types
> ------------------------------------------------------
>
>                 Key: SPARK-6857
>                 URL: https://issues.apache.org/jira/browse/SPARK-6857
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib, PySpark, SQL
>    Affects Versions: 1.3.0
>            Reporter: Joseph K. Bradley
>
> **UPDATE**: Closing this JIRA since a better fix will be better UDT support.  
> See discussion in comments.
> If you try to use SQL's schema inference to create a DataFrame out of a list 
> or RDD of numpy types (such as numpy.float64), SQL will not recognize the 
> numpy types.  It would be handy if it did.
> E.g.:
> {code}
> import numpy
> from collections import namedtuple
> from pyspark.sql import SQLContext
> MyType = namedtuple('MyType', 'x')
> myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10))
> sqlContext = SQLContext(sc)
> data = sqlContext.createDataFrame(myValues)
> {code}
> The above code fails with:
> {code}
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/Users/josephkb/spark/python/pyspark/sql/context.py", line 331, in 
> createDataFrame
>     return self.inferSchema(data, samplingRatio)
>   File "/Users/josephkb/spark/python/pyspark/sql/context.py", line 205, in 
> inferSchema
>     schema = self._inferSchema(rdd, samplingRatio)
>   File "/Users/josephkb/spark/python/pyspark/sql/context.py", line 160, in 
> _inferSchema
>     schema = _infer_schema(first)
>   File "/Users/josephkb/spark/python/pyspark/sql/types.py", line 660, in 
> _infer_schema
>     fields = [StructField(k, _infer_type(v), True) for k, v in items]
>   File "/Users/josephkb/spark/python/pyspark/sql/types.py", line 637, in 
> _infer_type
>     raise ValueError("not supported type: %s" % type(obj))
> ValueError: not supported type: <type 'numpy.int64'>
> {code}
> But if we cast to int (not numpy types) first, it's OK:
> {code}
> myNativeValues = map(lambda x: MyType(int(x.x)), myValues)
> data = sqlContext.createDataFrame(myNativeValues) # OK
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to