[ https://issues.apache.org/jira/browse/SPARK-21613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Florentino Sainz updated SPARK-21613: ------------------------------------- Description: Hi, I'm trying to perform some SQL queries on my database and I'm getting very weird behaviour: sqlContext.sql('SELECT date, time, to_utc_timestamp(from_unixtime(unix_timestamp( CONCAT (date, " ", time),"dd-MMM-YYYY HH:mm:ss")),"Europe/Madrid") FROM MYTABLE WHERE YEAR=2017 AND MONTH=01 LIMIT 10').show(); By doing this table, doesn't matter the month, we are always getting the same TIMESTAMP (corresponding 01/01/2017). As if month and day were not being processed at all. I've tested it @ 1.6, 2.1 (at my company's cluster and at Databricks). I also tested by setting a wrong month (aka JJA) and it gives NULL, so it is getting processed at some point. I've isolated the problem with the following snippet (not using Spark SQL directly): I also tested using numeric Months (first part of my snippet), just in case, but no difference at all. https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4270191918950108/1394884673460602/1201575579760496/latest.html {code} from pyspark.sql.functions import concat, col, lit, from_utc_timestamp,to_utc_timestamp, unix_timestamp, from_unixtime df = sqlContext.createDataFrame([ ("11-08-2016", "20:11:18.222"), ("20-11-2016", "20:11:18.222")], ('Date', 'Time')) df.select('Date',unix_timestamp(timestamp=col("Date"),format="dd-MM-YYYY")).show() df.select('Date',to_utc_timestamp(from_unixtime(unix_timestamp(concat(col("Date"), lit(" "), col("Time")),"dd-MM-YYYY HH:mm:ss")),"Europe/Madrid")).show() df = sqlContext.createDataFrame([("20-SEP-2017", "20:00:18.111"), ("20-OCT-2017", "20:11:18.222"), ("11-AUG-2016", "20:11:18.222"), ("20-OCT-2016", "20:11:18.222")], ('Date', 'Time')) df.select('Date',unix_timestamp(timestamp=col("Date"),format="dd-MMM-YYYY")).show() df.select('Date',to_utc_timestamp(from_unixtime(unix_timestamp(concat(col("Date"), lit(" "), col("Time")),"dd-MMM-YYYY HH:mm:ss")),"Europe/Madrid")).show() {code} Apart from the bug report, anyone can guess a workaround? (I'm trying to use DF as my datasize is quite big and I want to prevent serializing my items back into python at this stage) Thanks, was: Hi, I'm trying to perform some SQL queries on my database and I'm getting very weird behaviour: sqlContext.sql('SELECT date, time, to_utc_timestamp(from_unixtime(unix_timestamp( CONCAT (date, " ", time),"dd-MMM-YYYY HH:mm:ss")),"Europe/Madrid") FROM MYTABLE WHERE YEAR=2017 AND MONTH=01 LIMIT 10').show(); By doing this table, doesn't matter the month, we are always getting the same TIMESTAMP (corresponding 01/01/2017). As if month and day were not being processed at all. I've tested it @ 1.6, 2.1 (at my company's cluster and at Databricks) I've isolated the problem with the following snippet (not using Spark SQL directly): I also tested using numeric Months (first part of my snippet), just in case, but no difference at all. https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4270191918950108/1394884673460602/1201575579760496/latest.html {code} from pyspark.sql.functions import concat, col, lit, from_utc_timestamp,to_utc_timestamp, unix_timestamp, from_unixtime df = sqlContext.createDataFrame([ ("11-08-2016", "20:11:18.222"), ("20-11-2016", "20:11:18.222")], ('Date', 'Time')) df.select('Date',unix_timestamp(timestamp=col("Date"),format="dd-MM-YYYY")).show() df.select('Date',to_utc_timestamp(from_unixtime(unix_timestamp(concat(col("Date"), lit(" "), col("Time")),"dd-MM-YYYY HH:mm:ss")),"Europe/Madrid")).show() df = sqlContext.createDataFrame([("20-SEP-2017", "20:00:18.111"), ("20-OCT-2017", "20:11:18.222"), ("11-AUG-2016", "20:11:18.222"), ("20-OCT-2016", "20:11:18.222")], ('Date', 'Time')) df.select('Date',unix_timestamp(timestamp=col("Date"),format="dd-MMM-YYYY")).show() df.select('Date',to_utc_timestamp(from_unixtime(unix_timestamp(concat(col("Date"), lit(" "), col("Time")),"dd-MMM-YYYY HH:mm:ss")),"Europe/Madrid")).show() {code} Apart from the bug report, anyone can guess a workaround? (I'm trying to use DF as my datasize is quite big and I want to prevent serializing my items back into python at this stage) Thanks, > Wrong unix_timestamp when parsing Dates > --------------------------------------- > > Key: SPARK-21613 > URL: https://issues.apache.org/jira/browse/SPARK-21613 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 1.6.1, 2.1.0 > Reporter: Florentino Sainz > > Hi, > I'm trying to perform some SQL queries on my database and I'm getting very > weird behaviour: > sqlContext.sql('SELECT date, time, > to_utc_timestamp(from_unixtime(unix_timestamp( CONCAT (date, " ", > time),"dd-MMM-YYYY HH:mm:ss")),"Europe/Madrid") FROM MYTABLE WHERE YEAR=2017 > AND MONTH=01 LIMIT 10').show(); > By doing this table, doesn't matter the month, we are always getting the same > TIMESTAMP (corresponding 01/01/2017). As if month and day were not being > processed at all. I've tested it @ 1.6, 2.1 (at my company's cluster and at > Databricks). I also tested by setting a wrong month (aka JJA) and it gives > NULL, so it is getting processed at some point. > I've isolated the problem with the following snippet (not using Spark SQL > directly): > I also tested using numeric Months (first part of my snippet), just in case, > but no difference at all. > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4270191918950108/1394884673460602/1201575579760496/latest.html > {code} > from pyspark.sql.functions import concat, col, lit, > from_utc_timestamp,to_utc_timestamp, unix_timestamp, from_unixtime > df = sqlContext.createDataFrame([ ("11-08-2016", "20:11:18.222"), > ("20-11-2016", "20:11:18.222")], ('Date', 'Time')) > df.select('Date',unix_timestamp(timestamp=col("Date"),format="dd-MM-YYYY")).show() > df.select('Date',to_utc_timestamp(from_unixtime(unix_timestamp(concat(col("Date"), > lit(" "), col("Time")),"dd-MM-YYYY HH:mm:ss")),"Europe/Madrid")).show() > df = sqlContext.createDataFrame([("20-SEP-2017", "20:00:18.111"), > ("20-OCT-2017", "20:11:18.222"), ("11-AUG-2016", "20:11:18.222"), > ("20-OCT-2016", "20:11:18.222")], ('Date', 'Time')) > df.select('Date',unix_timestamp(timestamp=col("Date"),format="dd-MMM-YYYY")).show() > df.select('Date',to_utc_timestamp(from_unixtime(unix_timestamp(concat(col("Date"), > lit(" "), col("Time")),"dd-MMM-YYYY HH:mm:ss")),"Europe/Madrid")).show() > {code} > Apart from the bug report, anyone can guess a workaround? (I'm trying to use > DF as my datasize is quite big and I want to prevent serializing my items > back into python at this stage) > Thanks, -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org