[ 
https://issues.apache.org/jira/browse/SPARK-5722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14316940#comment-14316940
 ] 

Don Drake commented on SPARK-5722:
----------------------------------

Hi, I've submitted 2 pull requests for branch-1.2 and branch-1.3.

Please approve.

> Infer_schema_type incorrect for Integers in pyspark
> ---------------------------------------------------
>
>                 Key: SPARK-5722
>                 URL: https://issues.apache.org/jira/browse/SPARK-5722
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.2.0
>            Reporter: Don Drake
>
> The Integers datatype in Python does not match what a Scala/Java integer is 
> defined as.   This causes inference of data types and schemas to fail when 
> data is larger than 2^32 and it is inferred incorrectly as an Integer.
> Since the range of valid Python integers is wider than Java Integers, this 
> causes problems when inferring Integer vs. Long datatypes.  This will cause 
> problems when attempting to save SchemaRDD as Parquet or JSON.
> Here's an example:
> {code}
> >>> sqlCtx = SQLContext(sc)
> >>> from pyspark.sql import Row
> >>> rdd = sc.parallelize([Row(f1='a', f2=100000000000000)])
> >>> srdd = sqlCtx.inferSchema(rdd)
> >>> srdd.schema()
> StructType(List(StructField(f1,StringType,true),StructField(f2,IntegerType,true)))
> {code}
> That number is a LongType in Java, but an Integer in python.  We need to 
> check the value to see if it should really by a LongType when a IntegerType 
> is initially inferred.
> More tests:
> {code}
> >>> from pyspark.sql import _infer_type
> # OK
> >>> print _infer_type(1)
> IntegerType
> # OK
> >>> print _infer_type(2**31-1)
> IntegerType
> #WRONG
> >>> print _infer_type(2**31)
> #WRONG
> IntegerType
> >>> print _infer_type(2**61 )
> #OK
> IntegerType
> >>> print _infer_type(2**71 )
> LongType
> {code}
> Java Primitive Types defined:
> http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html
> Python Built-in Types:
> https://docs.python.org/2/library/stdtypes.html#typesnumeric



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to