[
https://issues.apache.org/jira/browse/SPARK-24357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-24357.
----------------------------------
Resolution: Incomplete
> createDataFrame in Python infers large integers as long type and then fails
> silently when converting them
> ---------------------------------------------------------------------------------------------------------
>
> Key: SPARK-24357
> URL: https://issues.apache.org/jira/browse/SPARK-24357
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.3.0
> Reporter: Joel Croteau
> Priority: Major
> Labels: bulk-closed
>
> When inferring the schema type of an RDD passed to createDataFrame, PySpark
> SQL will infer any integral type as a LongType, which is a 64-bit integer,
> without actually checking whether the values will fit into a 64-bit slot. If
> the values are larger than 64 bits, then when pickled and unpickled in Java,
> Unpickler will convert them to BigIntegers. When applySchemaToPythonRDD is
> called, it will ignore the BigInteger type and return Null. This results in
> any large integers in the resulting DataFrame being silently converted to
> None. This can create some very surprising and difficult to debug behavior,
> in particular if you are not aware of this limitation. There should either be
> a runtime error at some point in this conversion chain, or else _infer_type
> should infer larger integers as DecimalType with appropriate precision, or as
> BinaryType. The former would be less convenient, but the latter may be
> problematic to implement in practice. In any case, we should stop silently
> converting large integers to None.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]