[jira] [Updated] (SPARK-6857) Python SQL schema inference should support numpy types

Joseph K. Bradley (JIRA) Thu, 16 Apr 2015 11:57:44 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joseph K. Bradley updated SPARK-6857:
-------------------------------------
    Description: 
**UPDATE**: Closing this JIRA since a better fix will be better UDT support.  
See discussion in comments.

If you try to use SQL's schema inference to create a DataFrame out of a list or 
RDD of numpy types (such as numpy.float64), SQL will not recognize the numpy 
types.  It would be handy if it did.

E.g.:
{code}
import numpy
from collections import namedtuple
from pyspark.sql import SQLContext
MyType = namedtuple('MyType', 'x')
myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10))
sqlContext = SQLContext(sc)
data = sqlContext.createDataFrame(myValues)
{code}

The above code fails with:
{code}
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/josephkb/spark/python/pyspark/sql/context.py", line 331, in 
createDataFrame
    return self.inferSchema(data, samplingRatio)
  File "/Users/josephkb/spark/python/pyspark/sql/context.py", line 205, in 
inferSchema
    schema = self._inferSchema(rdd, samplingRatio)
  File "/Users/josephkb/spark/python/pyspark/sql/context.py", line 160, in 
_inferSchema
    schema = _infer_schema(first)
  File "/Users/josephkb/spark/python/pyspark/sql/types.py", line 660, in 
_infer_schema
    fields = [StructField(k, _infer_type(v), True) for k, v in items]
  File "/Users/josephkb/spark/python/pyspark/sql/types.py", line 637, in 
_infer_type
    raise ValueError("not supported type: %s" % type(obj))
ValueError: not supported type: <type 'numpy.int64'>
{code}

But if we cast to int (not numpy types) first, it's OK:
{code}
myNativeValues = map(lambda x: MyType(int(x.x)), myValues)
data = sqlContext.createDataFrame(myNativeValues) # OK
{code}


  was:
If you try to use SQL's schema inference to create a DataFrame out of a list or 
RDD of numpy types (such as numpy.float64), SQL will not recognize the numpy 
types.  It would be handy if it did.

E.g.:
{code}
import numpy
from collections import namedtuple
from pyspark.sql import SQLContext
MyType = namedtuple('MyType', 'x')
myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10))
sqlContext = SQLContext(sc)
data = sqlContext.createDataFrame(myValues)
{code}

The above code fails with:
{code}
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/josephkb/spark/python/pyspark/sql/context.py", line 331, in 
createDataFrame
    return self.inferSchema(data, samplingRatio)
  File "/Users/josephkb/spark/python/pyspark/sql/context.py", line 205, in 
inferSchema
    schema = self._inferSchema(rdd, samplingRatio)
  File "/Users/josephkb/spark/python/pyspark/sql/context.py", line 160, in 
_inferSchema
    schema = _infer_schema(first)
  File "/Users/josephkb/spark/python/pyspark/sql/types.py", line 660, in 
_infer_schema
    fields = [StructField(k, _infer_type(v), True) for k, v in items]
  File "/Users/josephkb/spark/python/pyspark/sql/types.py", line 637, in 
_infer_type
    raise ValueError("not supported type: %s" % type(obj))
ValueError: not supported type: <type 'numpy.int64'>
{code}

But if we cast to int (not numpy types) first, it's OK:
{code}
myNativeValues = map(lambda x: MyType(int(x.x)), myValues)
data = sqlContext.createDataFrame(myNativeValues) # OK
{code}



> Python SQL schema inference should support numpy types
> ------------------------------------------------------
>
>                 Key: SPARK-6857
>                 URL: https://issues.apache.org/jira/browse/SPARK-6857
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib, PySpark, SQL
>    Affects Versions: 1.3.0
>            Reporter: Joseph K. Bradley
>
> **UPDATE**: Closing this JIRA since a better fix will be better UDT support.  
> See discussion in comments.
> If you try to use SQL's schema inference to create a DataFrame out of a list 
> or RDD of numpy types (such as numpy.float64), SQL will not recognize the 
> numpy types.  It would be handy if it did.
> E.g.:
> {code}
> import numpy
> from collections import namedtuple
> from pyspark.sql import SQLContext
> MyType = namedtuple('MyType', 'x')
> myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10))
> sqlContext = SQLContext(sc)
> data = sqlContext.createDataFrame(myValues)
> {code}
> The above code fails with:
> {code}
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/Users/josephkb/spark/python/pyspark/sql/context.py", line 331, in 
> createDataFrame
>     return self.inferSchema(data, samplingRatio)
>   File "/Users/josephkb/spark/python/pyspark/sql/context.py", line 205, in 
> inferSchema
>     schema = self._inferSchema(rdd, samplingRatio)
>   File "/Users/josephkb/spark/python/pyspark/sql/context.py", line 160, in 
> _inferSchema
>     schema = _infer_schema(first)
>   File "/Users/josephkb/spark/python/pyspark/sql/types.py", line 660, in 
> _infer_schema
>     fields = [StructField(k, _infer_type(v), True) for k, v in items]
>   File "/Users/josephkb/spark/python/pyspark/sql/types.py", line 637, in 
> _infer_type
>     raise ValueError("not supported type: %s" % type(obj))
> ValueError: not supported type: <type 'numpy.int64'>
> {code}
> But if we cast to int (not numpy types) first, it's OK:
> {code}
> myNativeValues = map(lambda x: MyType(int(x.x)), myValues)
> data = sqlContext.createDataFrame(myNativeValues) # OK
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-6857) Python SQL schema inference should support numpy types

Reply via email to