Joel Croteau created SPARK-24357:
------------------------------------

             Summary: createDataFrame in Python infers large integers as long 
type and then fails silently when converting them
                 Key: SPARK-24357
                 URL: https://issues.apache.org/jira/browse/SPARK-24357
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.3.0
            Reporter: Joel Croteau


When inferring the schema type of an RDD passed to createDataFrame, PySpark SQL 
will infer any integral type as a LongType, which is a 64-bit integer, without 
actually checking whether the values will fit into a 64-bit slot. If the values 
are larger than 64 bits, then when pickled and unpickled in Java, Unpickler 
will convert them to BigIntegers. When applySchemaToPythonRDD is called, it 
will ignore the BigInteger type and return Null. This results in any large 
integers in the resulting DataFrame being silently converted to None. This can 
create some very surprising and difficult to debug behavior, in particular if 
you are not aware of this limitation. There should either be a runtime error at 
some point in this conversion chain, or else _infer_type should infer larger 
integers as DecimalType with appropriate precision, or as BinaryType. The 
former would be less convenient, but the latter may be problematic to implement 
in practice. In any case, we should stop silently converting large integers to 
None.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to