[
https://issues.apache.org/jira/browse/SPARK-16170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15346420#comment-15346420
]
Federico Ponzi commented on SPARK-16170:
----------------------------------------
Hi, and thanks for the response.
I've setted this as a bug instead of an improvement because if I do:
{code}
i = [(1, "rol"), (2.4, "str")]
rdd = sc.parallelize(i)
sqlContext.createDataFrame(i, schema=sch)
{code}
Running this I get this output:
{code}
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/spark/python/pyspark/sql/context.py", line 438, in
createDataFrame
rdd, schema = self._createFromLocal(map(prepare, data), schema)
File "/usr/local/spark/python/pyspark/sql/context.py", line 306, in
_createFromLocal
data = list(data)
File "/usr/local/spark/python/pyspark/sql/context.py", line 423, in prepare
_verify_type(obj, schema)
File "/usr/local/spark/python/pyspark/sql/types.py", line 1311, in
_verify_type
_verify_type(v, f.dataType, f.nullable)
File "/usr/local/spark/python/pyspark/sql/types.py", line 1283, in
_verify_type
raise TypeError("%s can not accept object %r in type %s" % (dataType, obj,
type(obj)))
TypeError: LongType can not accept object 2.4 in type <type 'float'>
{code}
> Throw error when row is not schema-compatible
> ---------------------------------------------
>
> Key: SPARK-16170
> URL: https://issues.apache.org/jira/browse/SPARK-16170
> Project: Spark
> Issue Type: Improvement
> Reporter: Federico Ponzi
> Priority: Minor
>
> We are using Spark to import some data from mysql.
> We just found that many of our imports are useless because our import
> function was wrongly forcing the longtype to a float column.
> Consider this example:
> {code}
> from pyspark.sql.types import *
> sqlContext = SQLContext(sc)
> sch = StructType([StructField("id", LongType(), True), StructField("rol",
> StringType(), True)])
> i = ['{"id": 1, "rol": "str"}', '{"id": 2.4, "rol": "str"}']
> rdd = sc.parallelize(i)
> df = sqlContext.read.json(rdd, schema=sch)
> print df.collect()
> {code}
> The output is:
> {code}
> [Row(id=1, rol=u'str'), Row(id=None, rol=None)]
> {code}
> Every column in the second row is null, not only id which has a wrong
> datatype and no error is triggered.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]