[
https://issues.apache.org/jira/browse/SPARK-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14498508#comment-14498508
]
Joseph K. Bradley commented on SPARK-6857:
------------------------------------------
[~davies] Yes, that OK with me. It's a bit inconsistent:
* In MLlib, we want to encourage users to use numpy and scipy types, rather
than the mllib.linalg.* types.
* In SQL, it's better if users use Python types or mllib.linalg.* types (for
which UDTs handle the conversion).
Perhaps the best fix will be better UDTs: If we can register any type (such as
numpy.array) with the SQLContext as a UDT, then users will be able to use numpy
and scipy types everywhere. I hope we can add that support before too long.
> Python SQL schema inference should support numpy types
> ------------------------------------------------------
>
> Key: SPARK-6857
> URL: https://issues.apache.org/jira/browse/SPARK-6857
> Project: Spark
> Issue Type: Improvement
> Components: MLlib, PySpark, SQL
> Affects Versions: 1.3.0
> Reporter: Joseph K. Bradley
>
> If you try to use SQL's schema inference to create a DataFrame out of a list
> or RDD of numpy types (such as numpy.float64), SQL will not recognize the
> numpy types. It would be handy if it did.
> E.g.:
> {code}
> import numpy
> from collections import namedtuple
> from pyspark.sql import SQLContext
> MyType = namedtuple('MyType', 'x')
> myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10))
> sqlContext = SQLContext(sc)
> data = sqlContext.createDataFrame(myValues)
> {code}
> The above code fails with:
> {code}
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/Users/josephkb/spark/python/pyspark/sql/context.py", line 331, in
> createDataFrame
> return self.inferSchema(data, samplingRatio)
> File "/Users/josephkb/spark/python/pyspark/sql/context.py", line 205, in
> inferSchema
> schema = self._inferSchema(rdd, samplingRatio)
> File "/Users/josephkb/spark/python/pyspark/sql/context.py", line 160, in
> _inferSchema
> schema = _infer_schema(first)
> File "/Users/josephkb/spark/python/pyspark/sql/types.py", line 660, in
> _infer_schema
> fields = [StructField(k, _infer_type(v), True) for k, v in items]
> File "/Users/josephkb/spark/python/pyspark/sql/types.py", line 637, in
> _infer_type
> raise ValueError("not supported type: %s" % type(obj))
> ValueError: not supported type: <type 'numpy.int64'>
> {code}
> But if we cast to int (not numpy types) first, it's OK:
> {code}
> myNativeValues = map(lambda x: MyType(int(x.x)), myValues)
> data = sqlContext.createDataFrame(myNativeValues) # OK
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]