Pawan created SPARK-25873:
-----------------------------
Summary: Date corruption when Spark and Hive both are on different
timezones
Key: SPARK-25873
URL: https://issues.apache.org/jira/browse/SPARK-25873
Project: Spark
Issue Type: Bug
Components: Spark Core, Spark Shell, Spark Submit
Affects Versions: 2.2.1
Reporter: Pawan
There is date alteration when loading date from one table to another in hive
through spark. This happens when Hive is on a remote machine with timezone
different than the one on which Spark is running. This happens only when the
Source table format is
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
Below are the steps to produce the issue:
1. Create two tables as below in hive which has a timezone, say in, EST
-----------SOURCE TABLE----------------
CREATE TABLE t_src(
name varchar(10),
dob timestamp
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
INSERT INTO t_src VALUES ('p1', '0001-01-01 00:00:00.0'),('p2', '0002-01-01
00:00:00.0'), ('p3', '0003-01-01 00:00:00.0'),('p4', '0004-01-01 00:00:00.0');
-----------TARGET TABLE----------------
CREATE TABLE t_tgt(
name varchar(10),
dob timestamp
);
2. Copy hive-site.xml to spark-2.2.1-bin-hadoop2.7/conf folder, so that when
you create sqlContext for hive it connects to your remote hive server.
3. Start your spark-shell on some other machine whose timezone is different
than that of Hive, say, PDT
3. Execute below code:
import org.apache.spark.sql.hive.HiveContext
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val q0 = "TRUNCATE table t_tgt"
val q1 = "SELECT CAST(alias.name AS String) as a0, alias.dob as a1 FROM t_src
alias"
val q2 = "INSERT OVERWRITE TABLE t_tgt SELECT tbl0.a0 as c0, tbl0.a1 as c1
FROM tbl0"
sqlContext.sql(q0)
sqlContext.sql(q1).select("a0","a1").createOrReplaceTempView("tbl0")
sqlContext.sql(q2)
4. Now navigate to hive and check the contents of the TARGET table (t_tgt). The
dob field will have incorrect values.
Is this a known issue? Is there any work around on this? Can it be fixed?
Thanks & regards,
Pawan Lawale
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]