[
https://issues.apache.org/jira/browse/SPARK-26693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769856#comment-16769856
]
Jason Ferrell commented on SPARK-26693:
---------------------------------------
Perhaps this is an issue with Zeppelin's sql interpreter. Taking the example
and adding line:
from pyspark.sql.types import *
sqlContext.sql('select * from global_temp.testTable').show(3)
Result:
+-------------------+-------------------+-------------------+ | idAsString|
idAsBigint| idAsLong|
+-------------------+-------------------+-------------------+
|4065453307562594031|4065453307562594031|4065453307562594031|
|7659957277770523059|7659957277770523059|7659957277770523059|
|1614560078712787995|1614560078712787995|1614560078712787995|
+-------------------+-------------------+-------------------+
> Large Numbers Truncated
> ------------------------
>
> Key: SPARK-26693
> URL: https://issues.apache.org/jira/browse/SPARK-26693
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.4.0
> Environment: Code was run in Zeppelin using Spark 2.4.
> Reporter: Jason Blahovec
> Priority: Major
>
> We have a process that takes a file dumped from an external API and formats
> it for use in other processes. These API dumps are brought into Spark with
> all fields read in as strings. One of the fields is a 19 digit visitor ID.
> Since implementing Spark 2.4 a few weeks ago, we have noticed that dataframes
> read the 19 digits correctly but any function in SQL appears to truncate the
> last two digits and replace them with "00".
> Our process is set up to convert these numbers to bigint, which worked before
> Spark 2.4. We looked into data types, and the possibility of changing to a
> "long" type with no luck. At that point we tried bringing in the string
> value as is, with the same result. I've added code that should replicate the
> issue with a few 19 digit test cases and demonstrating the type conversions I
> tried.
> Results for the code below are shown here:
> dfTestExpanded.show:
> +-------------------+-------------------+-------------------+ | idAsString|
> idAsBigint| idAsLong|
> +-------------------+-------------------+-------------------+
> |4065453307562594031|4065453307562594031|4065453307562594031|
> |7659957277770523059|7659957277770523059|7659957277770523059|
> |1614560078712787995|1614560078712787995|1614560078712787995|
> +-------------------+-------------------+-------------------+
> Run this query in a paragraph:
> %sql
> select * from global_temp.testTable
> and see these results (all 3 columns):
> 4065453307562594000
> 7659957277770523000
> 1614560078712788000
>
> Another notable observation was that this issue soes not appear to affect
> joins on the affected fields - we are seeing issues when the fields are used
> in where clauses or as part of a select list.
>
>
> {code:java}
> // code placeholder
> %pyspark
> from pyspark.sql.functions import *
> sfTestValue = StructField("testValue",StringType(), True)
> schemaTest = StructType([sfTestValue])
> listTestValues = []
> listTestValues.append(("4065453307562594031",))
> listTestValues.append(("7659957277770523059",))
> listTestValues.append(("1614560078712787995",))
> dfTest = spark.createDataFrame(listTestValues, schemaTest)
> dfTestExpanded = dfTest.selectExpr(\
> "testValue as idAsString",\
> "cast(testValue as bigint) as idAsBigint",\
> "cast(testValue as long) as idAsLong")
> dfTestExpanded.show() ##This will show three columns of data correctly.
> dfTestExpanded.createOrReplaceGlobalTempView('testTable') ##When this table
> is viewed in a %sql paragraph, the truncated values are shown.{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]