Florentino Sainz created SPARK-21613:
----------------------------------------
Summary: Wrong unix_timestamp when parsing Dates
Key: SPARK-21613
URL: https://issues.apache.org/jira/browse/SPARK-21613
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 2.1.0, 1.6.1
Reporter: Florentino Sainz
Hi,
I'm trying to perform some SQL queries on my database and I'm getting very
weird behaviour:
sqlContext.sql('SELECT date, time,
to_utc_timestamp(from_unixtime(unix_timestamp( CONCAT (date, " ",
time),"dd-MMM-YYYY HH:mm:ss")),"Europe/Madrid") FROM MYTABLE WHERE YEAR=2017
AND MONTH=01 LIMIT 10').show();
By doing this table, doesn't matter the month, we are always getting the same
TIMESTAMP (corresponding 01/01/2017) for every year. As if month and day were
not being processed at all. I've tested it @ 1.6, 2.1 (at my company's cluster
and at Databricks)
I've isolated the problem with the following snippet (not using Spark SQL
directly):
I also tested using numeric Months (first part of my snippet), just in case,
but no difference at all.
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4270191918950108/1394884673460602/1201575579760496/latest.html
{code}
from pyspark.sql.functions import concat, col, lit,
from_utc_timestamp,to_utc_timestamp, unix_timestamp, from_unixtime
df = sqlContext.createDataFrame([ ("11-08-2016", "20:11:18.222"),
("20-11-2016", "20:11:18.222")], ('Date', 'Time'))
df.select('Date',unix_timestamp(timestamp=col("Date"),format="dd-MM-YYYY")).show()
df.select('Date',to_utc_timestamp(from_unixtime(unix_timestamp(concat(col("Date"),
lit(" "), col("Time")),"dd-MM-YYYY HH:mm:ss")),"Europe/Madrid")).show()
df = sqlContext.createDataFrame([("20-SEP-2017", "20:00:18.111"),
("20-OCT-2017", "20:11:18.222"), ("11-AUG-2016", "20:11:18.222"),
("20-OCT-2016", "20:11:18.222")], ('Date', 'Time'))
df.select('Date',unix_timestamp(timestamp=col("Date"),format="dd-MMM-YYYY")).show()
df.select('Date',to_utc_timestamp(from_unixtime(unix_timestamp(concat(col("Date"),
lit(" "), col("Time")),"dd-MMM-YYYY HH:mm:ss")),"Europe/Madrid")).show()
{code}
Apart from the bug report, anyone can guess a workaround? (I'm trying to use DF
as my datasize is quite big and I want to prevent serializing my items back
into python at this stage)
Thanks,
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]