Don Drake created SPARK-5722:
--------------------------------
Summary: Infer_schma_type incorrect for Integers in pyspark
Key: SPARK-5722
URL: https://issues.apache.org/jira/browse/SPARK-5722
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 1.2.0
Reporter: Don Drake
The Integers datatype in Python does not match what a Scala/Java integer is
defined as. This causes inference of data types and schemas to fail when data
is larger than 2^32 and it is inferred incorrectly as an Integer.
Since the range of valid Python integers is wider than Java Integers, this
causes problems when inferring Integer vs. Long datatypes. This will cause
problems when attempting to save SchemaRDD as Parquet or JSON.
Here's an example:
>>> sqlCtx = SQLContext(sc)
>>> from pyspark.sql import Row
>>> rdd = sc.parallelize([Row(f1='a', f2=100000000000000)])
>>> srdd = sqlCtx.inferSchema(rdd)
>>> srdd.schema()
StructType(List(StructField(f1,StringType,true),StructField(f2,IntegerType,true)))
That number is a LongType in Java, but an Integer in python. We need to check
the value to see if it should really by a LongType when a IntegerType is
initially inferred.
More tests:
>>> from pyspark.sql import _infer_type
# OK
>>> print _infer_type(1)
IntegerType
# OK
>>> print _infer_type(2**31-1)
IntegerType
#WRONG
>>> print _infer_type(2**31)
#WRONG
IntegerType
>>> print _infer_type(2**61 )
#OK
IntegerType
>>> print _infer_type(2**71 )
LongType
Java Primitive Types defined:
http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html
Python Built-in Types:
https://docs.python.org/2/library/stdtypes.html#typesnumeric
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]