Joel Croteau created SPARK-24357: ------------------------------------ Summary: createDataFrame in Python infers large integers as long type and then fails silently when converting them Key: SPARK-24357 URL: https://issues.apache.org/jira/browse/SPARK-24357 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Joel Croteau
When inferring the schema type of an RDD passed to createDataFrame, PySpark SQL will infer any integral type as a LongType, which is a 64-bit integer, without actually checking whether the values will fit into a 64-bit slot. If the values are larger than 64 bits, then when pickled and unpickled in Java, Unpickler will convert them to BigIntegers. When applySchemaToPythonRDD is called, it will ignore the BigInteger type and return Null. This results in any large integers in the resulting DataFrame being silently converted to None. This can create some very surprising and difficult to debug behavior, in particular if you are not aware of this limitation. There should either be a runtime error at some point in this conversion chain, or else _infer_type should infer larger integers as DecimalType with appropriate precision, or as BinaryType. The former would be less convenient, but the latter may be problematic to implement in practice. In any case, we should stop silently converting large integers to None. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org