[ https://issues.apache.org/jira/browse/SPARK-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14498651#comment-14498651 ]
Joseph K. Bradley commented on SPARK-6857: ------------------------------------------ Based on past discussions with [~mengxr], ML should use numpy and scipy types, rather than re-implementing all of that functionality. "Supporting numpy and scipy types in SQL" does not actually mean having numpy or scipy code in SQL. It would mean: * Extending UDTs so users can registers their own UDTs with the SQLContext. * Adding UDTs for numpy and scipy types in MLlib. * Allowing users to import or call something which registers those MLlib UDTs with SQL. > Python SQL schema inference should support numpy types > ------------------------------------------------------ > > Key: SPARK-6857 > URL: https://issues.apache.org/jira/browse/SPARK-6857 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark, SQL > Affects Versions: 1.3.0 > Reporter: Joseph K. Bradley > > **UPDATE**: Closing this JIRA since a better fix will be better UDT support. > See discussion in comments. > If you try to use SQL's schema inference to create a DataFrame out of a list > or RDD of numpy types (such as numpy.float64), SQL will not recognize the > numpy types. It would be handy if it did. > E.g.: > {code} > import numpy > from collections import namedtuple > from pyspark.sql import SQLContext > MyType = namedtuple('MyType', 'x') > myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10)) > sqlContext = SQLContext(sc) > data = sqlContext.createDataFrame(myValues) > {code} > The above code fails with: > {code} > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File "/Users/josephkb/spark/python/pyspark/sql/context.py", line 331, in > createDataFrame > return self.inferSchema(data, samplingRatio) > File "/Users/josephkb/spark/python/pyspark/sql/context.py", line 205, in > inferSchema > schema = self._inferSchema(rdd, samplingRatio) > File "/Users/josephkb/spark/python/pyspark/sql/context.py", line 160, in > _inferSchema > schema = _infer_schema(first) > File "/Users/josephkb/spark/python/pyspark/sql/types.py", line 660, in > _infer_schema > fields = [StructField(k, _infer_type(v), True) for k, v in items] > File "/Users/josephkb/spark/python/pyspark/sql/types.py", line 637, in > _infer_type > raise ValueError("not supported type: %s" % type(obj)) > ValueError: not supported type: <type 'numpy.int64'> > {code} > But if we cast to int (not numpy types) first, it's OK: > {code} > myNativeValues = map(lambda x: MyType(int(x.x)), myValues) > data = sqlContext.createDataFrame(myNativeValues) # OK > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org