[
https://issues.apache.org/jira/browse/SPARK-33327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Punit Shah updated SPARK-33327:
-------------------------------
Description:
The attached csv file has two columns, namely "User" and "FromDate". The
import defaults the "FromDate" column as a timestamp.
* outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
* outDF.createOrReplaceTempView("table02")
In this default case the following sql generates
{color:#de350b}*incorrect*{color} results:
{color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as
`FromDate_First`, last(`FromDate`) as `FromDate_Last`,
count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}
{color:#172b4d}However if we read the dataframe like so (where the "FromDate"
is read in as a Date, then the above sql query {color:#de350b}*also*{color}
generates *incorrect* {color} results:
* outDF = spark_session.read.csv("users.csv", inferSchema=True,
header=True).selectExpr("`User`", "cast(`FromDate` as date)")
was:
The attached csv file has two columns, namely "User" and "FromDate". The
import defaults the "FromDate" column as a timestamp.
* outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
* outDF.createOrReplaceTempView("table02")
In this default case the following sql generates
{color:#de350b}*incorrect*{color} results:
{color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as
`FromDate_First`, last(`FromDate`) as `FromDate_Last`,
count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}
{color:#172b4d}However if we read the dataframe like so (where the "FromDate"
is read in as a Date, then the above sql query also generates *incorrect*
{color} results:
* outDF = spark_session.read.csv("users.csv", inferSchema=True,
header=True).selectExpr("`User`", "cast(`FromDate` as date)")
> grouped by first and last against date column returns incorrect results
> -----------------------------------------------------------------------
>
> Key: SPARK-33327
> URL: https://issues.apache.org/jira/browse/SPARK-33327
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.4.6, 2.4.7
> Reporter: Punit Shah
> Priority: Major
> Attachments: users.csv
>
>
> The attached csv file has two columns, namely "User" and "FromDate". The
> import defaults the "FromDate" column as a timestamp.
> * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
> * outDF.createOrReplaceTempView("table02")
> In this default case the following sql generates
> {color:#de350b}*incorrect*{color} results:
> {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as
> `FromDate_First`, last(`FromDate`) as `FromDate_Last`,
> count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}
> {color:#172b4d}However if we read the dataframe like so (where the "FromDate"
> is read in as a Date, then the above sql query {color:#de350b}*also*{color}
> generates *incorrect* {color} results:
> * outDF = spark_session.read.csv("users.csv", inferSchema=True,
> header=True).selectExpr("`User`", "cast(`FromDate` as date)")
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]