Florentino Sainz created SPARK-21613:
----------------------------------------

             Summary: Wrong unix_timestamp when parsing Dates
                 Key: SPARK-21613
                 URL: https://issues.apache.org/jira/browse/SPARK-21613
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.1.0, 1.6.1
            Reporter: Florentino Sainz


Hi,

I'm trying to perform some SQL queries on my database and I'm getting very 
weird behaviour:

sqlContext.sql('SELECT date, time, 
to_utc_timestamp(from_unixtime(unix_timestamp( CONCAT (date, " ", 
time),"dd-MMM-YYYY HH:mm:ss")),"Europe/Madrid") FROM MYTABLE WHERE YEAR=2017  
AND MONTH=01 LIMIT 10').show();

By doing this table, doesn't matter the month, we are always getting the same 
TIMESTAMP (corresponding 01/01/2017) for every year. As if month and day were 
not being processed at all. I've tested it @ 1.6, 2.1 (at my company's cluster 
and at Databricks) 

I've isolated the problem with the following snippet (not using Spark SQL 
directly):

I also tested using numeric Months (first part of my snippet), just in case, 
but no difference at all.

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4270191918950108/1394884673460602/1201575579760496/latest.html
{code}
from pyspark.sql.functions import concat, col, lit, 
from_utc_timestamp,to_utc_timestamp, unix_timestamp, from_unixtime



df = sqlContext.createDataFrame([ ("11-08-2016", "20:11:18.222"),  
("20-11-2016", "20:11:18.222")], ('Date', 'Time'))
df.select('Date',unix_timestamp(timestamp=col("Date"),format="dd-MM-YYYY")).show()
df.select('Date',to_utc_timestamp(from_unixtime(unix_timestamp(concat(col("Date"),
 lit(" "), col("Time")),"dd-MM-YYYY HH:mm:ss")),"Europe/Madrid")).show()



df = sqlContext.createDataFrame([("20-SEP-2017", "20:00:18.111"), 
("20-OCT-2017", "20:11:18.222"),  ("11-AUG-2016", "20:11:18.222"),  
("20-OCT-2016", "20:11:18.222")], ('Date', 'Time'))
df.select('Date',unix_timestamp(timestamp=col("Date"),format="dd-MMM-YYYY")).show()
df.select('Date',to_utc_timestamp(from_unixtime(unix_timestamp(concat(col("Date"),
 lit(" "), col("Time")),"dd-MMM-YYYY HH:mm:ss")),"Europe/Madrid")).show()
{code}

Apart from the bug report, anyone can guess a workaround? (I'm trying to use DF 
as my datasize is quite big and I want to prevent serializing my items back 
into python at this stage)

Thanks,




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to