[ 
https://issues.apache.org/jira/browse/SPARK-21613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florentino Sainz updated SPARK-21613:
-------------------------------------
    Description: 
Hi,

I'm trying to perform some SQL queries on my database and I'm getting very 
weird behaviour:

sqlContext.sql('SELECT date, time, 
to_utc_timestamp(from_unixtime(unix_timestamp( CONCAT (date, " ", 
time),"dd-MMM-YYYY HH:mm:ss")),"Europe/Madrid") FROM MYTABLE WHERE YEAR=2017  
AND MONTH=01 LIMIT 10').show();

By doing this table, doesn't matter the month, we are always getting the same 
TIMESTAMP (corresponding 01/01/2017). As if month and day were not being 
processed at all. I've tested it @ 1.6, 2.1 (at my company's cluster and at 
Databricks). I also tested by setting a wrong month (aka JJA) and it gives 
NULL, so it is getting processed at some point.






I've isolated the problem with the following snippet (not using Spark SQL 
directly):

I also tested using numeric Months (first part of my snippet), just in case, 
but no difference at all.

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4270191918950108/1394884673460602/1201575579760496/latest.html
{code}
from pyspark.sql.functions import concat, col, lit, 
from_utc_timestamp,to_utc_timestamp, unix_timestamp, from_unixtime



df = sqlContext.createDataFrame([ ("11-08-2016", "20:11:18.222"),  
("20-11-2016", "20:11:18.222")], ('Date', 'Time'))
df.select('Date',unix_timestamp(timestamp=col("Date"),format="dd-MM-YYYY")).show()
df.select('Date',to_utc_timestamp(from_unixtime(unix_timestamp(concat(col("Date"),
 lit(" "), col("Time")),"dd-MM-YYYY HH:mm:ss")),"Europe/Madrid")).show()



df = sqlContext.createDataFrame([("20-SEP-2017", "20:00:18.111"), 
("20-OCT-2017", "20:11:18.222"),  ("11-AUG-2016", "20:11:18.222"),  
("20-OCT-2016", "20:11:18.222")], ('Date', 'Time'))
df.select('Date',unix_timestamp(timestamp=col("Date"),format="dd-MMM-YYYY")).show()
df.select('Date',to_utc_timestamp(from_unixtime(unix_timestamp(concat(col("Date"),
 lit(" "), col("Time")),"dd-MMM-YYYY HH:mm:ss")),"Europe/Madrid")).show()
{code}

Apart from the bug report, anyone can guess a workaround? (I'm trying to use DF 
as my datasize is quite big and I want to prevent serializing my items back 
into python at this stage)

Thanks,


  was:
Hi,

I'm trying to perform some SQL queries on my database and I'm getting very 
weird behaviour:

sqlContext.sql('SELECT date, time, 
to_utc_timestamp(from_unixtime(unix_timestamp( CONCAT (date, " ", 
time),"dd-MMM-YYYY HH:mm:ss")),"Europe/Madrid") FROM MYTABLE WHERE YEAR=2017  
AND MONTH=01 LIMIT 10').show();

By doing this table, doesn't matter the month, we are always getting the same 
TIMESTAMP (corresponding 01/01/2017). As if month and day were not being 
processed at all. I've tested it @ 1.6, 2.1 (at my company's cluster and at 
Databricks) 

I've isolated the problem with the following snippet (not using Spark SQL 
directly):

I also tested using numeric Months (first part of my snippet), just in case, 
but no difference at all.

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4270191918950108/1394884673460602/1201575579760496/latest.html
{code}
from pyspark.sql.functions import concat, col, lit, 
from_utc_timestamp,to_utc_timestamp, unix_timestamp, from_unixtime



df = sqlContext.createDataFrame([ ("11-08-2016", "20:11:18.222"),  
("20-11-2016", "20:11:18.222")], ('Date', 'Time'))
df.select('Date',unix_timestamp(timestamp=col("Date"),format="dd-MM-YYYY")).show()
df.select('Date',to_utc_timestamp(from_unixtime(unix_timestamp(concat(col("Date"),
 lit(" "), col("Time")),"dd-MM-YYYY HH:mm:ss")),"Europe/Madrid")).show()



df = sqlContext.createDataFrame([("20-SEP-2017", "20:00:18.111"), 
("20-OCT-2017", "20:11:18.222"),  ("11-AUG-2016", "20:11:18.222"),  
("20-OCT-2016", "20:11:18.222")], ('Date', 'Time'))
df.select('Date',unix_timestamp(timestamp=col("Date"),format="dd-MMM-YYYY")).show()
df.select('Date',to_utc_timestamp(from_unixtime(unix_timestamp(concat(col("Date"),
 lit(" "), col("Time")),"dd-MMM-YYYY HH:mm:ss")),"Europe/Madrid")).show()
{code}

Apart from the bug report, anyone can guess a workaround? (I'm trying to use DF 
as my datasize is quite big and I want to prevent serializing my items back 
into python at this stage)

Thanks,



> Wrong unix_timestamp when parsing Dates
> ---------------------------------------
>
>                 Key: SPARK-21613
>                 URL: https://issues.apache.org/jira/browse/SPARK-21613
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.6.1, 2.1.0
>            Reporter: Florentino Sainz
>
> Hi,
> I'm trying to perform some SQL queries on my database and I'm getting very 
> weird behaviour:
> sqlContext.sql('SELECT date, time, 
> to_utc_timestamp(from_unixtime(unix_timestamp( CONCAT (date, " ", 
> time),"dd-MMM-YYYY HH:mm:ss")),"Europe/Madrid") FROM MYTABLE WHERE YEAR=2017  
> AND MONTH=01 LIMIT 10').show();
> By doing this table, doesn't matter the month, we are always getting the same 
> TIMESTAMP (corresponding 01/01/2017). As if month and day were not being 
> processed at all. I've tested it @ 1.6, 2.1 (at my company's cluster and at 
> Databricks). I also tested by setting a wrong month (aka JJA) and it gives 
> NULL, so it is getting processed at some point.
> I've isolated the problem with the following snippet (not using Spark SQL 
> directly):
> I also tested using numeric Months (first part of my snippet), just in case, 
> but no difference at all.
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4270191918950108/1394884673460602/1201575579760496/latest.html
> {code}
> from pyspark.sql.functions import concat, col, lit, 
> from_utc_timestamp,to_utc_timestamp, unix_timestamp, from_unixtime
> df = sqlContext.createDataFrame([ ("11-08-2016", "20:11:18.222"),  
> ("20-11-2016", "20:11:18.222")], ('Date', 'Time'))
> df.select('Date',unix_timestamp(timestamp=col("Date"),format="dd-MM-YYYY")).show()
> df.select('Date',to_utc_timestamp(from_unixtime(unix_timestamp(concat(col("Date"),
>  lit(" "), col("Time")),"dd-MM-YYYY HH:mm:ss")),"Europe/Madrid")).show()
> df = sqlContext.createDataFrame([("20-SEP-2017", "20:00:18.111"), 
> ("20-OCT-2017", "20:11:18.222"),  ("11-AUG-2016", "20:11:18.222"),  
> ("20-OCT-2016", "20:11:18.222")], ('Date', 'Time'))
> df.select('Date',unix_timestamp(timestamp=col("Date"),format="dd-MMM-YYYY")).show()
> df.select('Date',to_utc_timestamp(from_unixtime(unix_timestamp(concat(col("Date"),
>  lit(" "), col("Time")),"dd-MMM-YYYY HH:mm:ss")),"Europe/Madrid")).show()
> {code}
> Apart from the bug report, anyone can guess a workaround? (I'm trying to use 
> DF as my datasize is quite big and I want to prevent serializing my items 
> back into python at this stage)
> Thanks,



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to