GitHub user ueshin opened a pull request:
https://github.com/apache/spark/pull/16750
[SPARK-18937][SQL] Timezone support in CSV/JSON parsing
## What changes were proposed in this pull request?
This is a follow-up pr of #16308.
This pr enables timezone support in CSV/JSON parsing.
We should introduce `timeZone` option for CSV/JSON datasources (the default
value of the option is session local timezone).
The datasources should use the `timeZone` option to format/parse to
write/read timestamp values.
Notice that while reading, if the timestampFormat has the timezone info,
the timezone will not be used because we should respect the timezone in the
values.
For example, if you have timestamp `"2016-01-01 00:00:00"` in `GMT`, the
values written with the default timezone option, which is `"GMT"` because
session local timezone is `"GMT"` here, are:
```scala
scala> spark.conf.set("spark.sql.session.timeZone", "GMT")
scala> val df = Seq(new java.sql.Timestamp(1451606400000L)).toDF("ts")
df: org.apache.spark.sql.DataFrame = [ts: timestamp]
scala> df.show()
+-------------------+
|ts |
+-------------------+
|2016-01-01 00:00:00|
+-------------------+
scala> df.write.json("/path/to/gmtjson")
```
```sh
$ cat /path/to/gmtjson/part-*
{"ts":"2016-01-01T00:00:00.000Z"}
```
whereas setting the option to `"PST"`, they are:
```scala
scala> df.write.option("timeZone", "PST").json("/path/to/pstjson")
```
```sh
$ cat /path/to/pstjson/part-*
{"ts":"2015-12-31T16:00:00.000-08:00"}
```
We can properly read these files even if the timezone option is wrong
because the timestamp values have timezone info:
```scala
scala> val schema = new StructType().add("ts", TimestampType)
schema: org.apache.spark.sql.types.StructType =
StructType(StructField(ts,TimestampType,true))
scala> spark.read.schema(schema).json("/path/to/gmtjson").show()
+-------------------+
|ts |
+-------------------+
|2016-01-01 00:00:00|
+-------------------+
scala> spark.read.schema(schema).option("timeZone",
"PST").json("/path/to/gmtjson").show()
+-------------------+
|ts |
+-------------------+
|2016-01-01 00:00:00|
+-------------------+
```
And even if `timezoneFormat` doesn't contain timezone info, we can properly
read the values with setting correct timezone option:
```scala
scala> df.write.option("timestampFormat",
"yyyy-MM-dd'T'HH:mm:ss").option("timeZone", "JST").json("/path/to/jstjson")
```
```sh
$ cat /path/to/jstjson/part-*
{"ts":"2016-01-01T09:00:00"}
```
```scala
// wrong result
scala> spark.read.schema(schema).option("timestampFormat",
"yyyy-MM-dd'T'HH:mm:ss").json("/path/to/jstjson").show()
+-------------------+
|ts |
+-------------------+
|2016-01-01 09:00:00|
+-------------------+
// correct result
scala> spark.read.schema(schema).option("timestampFormat",
"yyyy-MM-dd'T'HH:mm:ss").option("timeZone",
"JST").json("/path/to/jstjson").show()
+-------------------+
|ts |
+-------------------+
|2016-01-01 00:00:00|
+-------------------+
```
This pr also makes `JsonToStruct` and `StructToJson`
`TimeZoneAwareExpression` to be able to evaluate values with timezone option.
## How was this patch tested?
Existing tests and added some tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ueshin/apache-spark issues/SPARK-18937
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/16750.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #16750
----
commit aa052f4d11929192b749752f4b73772664d0460c
Author: Takuya UESHIN <[email protected]>
Date: 2017-01-05T09:29:42Z
Add timeZone option to JSONOptions.
commit 890879e24b3f63509a000585e18b288961a4e5cf
Author: Takuya UESHIN <[email protected]>
Date: 2017-01-06T05:11:41Z
Apply timeZone option to JSON datasources.
commit f08b78c16ac444550e7ea0857d0275b9a91b7561
Author: Takuya UESHIN <[email protected]>
Date: 2017-01-06T06:03:34Z
Apply timeZone option to CSV datasources.
commit 551cff99785927be3ef68c4393dca4dabb3c2ba0
Author: Takuya UESHIN <[email protected]>
Date: 2017-01-06T08:39:26Z
Modify python files.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]