Jason Blahovec created SPARK-26693:
--------------------------------------

             Summary: Large Numbers Truncated 
                 Key: SPARK-26693
                 URL: https://issues.apache.org/jira/browse/SPARK-26693
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.4.0
         Environment: Code was run in Zeppelin using Spark 2.4.
            Reporter: Jason Blahovec


We have a process that takes a file dumped from an external API and formats it 
for use in other processes.  These API dumps are brought into Spark with all 
fields read in as strings.  One of the fields is a 19 digit visitor ID.  Since 
implementing Spark 2.4 a few weeks ago, we have noticed that dataframes read 
the 19 digits correctly but any function in SQL appears to truncate the last 
two digits and replace them with "00".  

Our process is set up to convert these numbers to bigint, which worked before 
Spark 2.4.  We looked into data types, and the possibility of changing to a 
"long" type with no luck.  At that point we tried bringing in the string value 
as is, with the same result.  I've added code that should replicate the issue 
with a few 19 digit test cases and demonstrating the type conversions I tried.

Results for the code below are shown here:

dfTestExpanded.show:

+-------------------+-------------------+-------------------+ | idAsString| 
idAsBigint| idAsLong| 
+-------------------+-------------------+-------------------+ 
|4065453307562594031|4065453307562594031|4065453307562594031| 
|7659957277770523059|7659957277770523059|7659957277770523059| 
|1614560078712787995|1614560078712787995|1614560078712787995| 
+-------------------+-------------------+-------------------+

Run this query in a paragraph:

%sql

select * from global_temp.testTable

and see these results (all 3 columns):

4065453307562594000

7659957277770523000

1614560078712788000

 

Another notable observation was that this issue soes not appear to affect joins 
on the affected fields - we are seeing issues when the fields are used in where 
clauses or as part of a select list.

 

 
{code:java}

// code placeholder


%pyspark

from pyspark.sql.functions import *


sfTestValue = StructField("testValue",StringType(), True)
schemaTest = StructType([sfTestValue])

listTestValues = []
listTestValues.append(("4065453307562594031",))
listTestValues.append(("7659957277770523059",))
listTestValues.append(("1614560078712787995",))

dfTest = spark.createDataFrame(listTestValues, schemaTest)

dfTestExpanded = dfTest.selectExpr(\
"testValue as idAsString",\
"cast(testValue as bigint) as idAsBigint",\
"cast(testValue as long) as idAsLong")

dfTestExpanded.show() ##This will show three columns of data correctly.

dfTestExpanded.createOrReplaceGlobalTempView('testTable') ##When this table is 
viewed in a %sql paragraph, the truncated values are shown.{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to