[
https://issues.apache.org/jira/browse/SPARK-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202322#comment-16202322
]
Alexandre Dupriez edited comment on SPARK-18350 at 10/12/17 5:30 PM:
---------------------------------------------------------------------
Hello all,
I have a use case where a {{Dataset}} contains a column of type
{{java.sql.Timestamp}} (let's call it {{_time}}) which I am using to derive new
columns with the year, month, day and hour specified by the {{_time}} column,
with something like:
{code:java}
session.read.schema(mySchema)
.json(path)
.withColumn("year", year($"_time"))
.withColumn("month", month($"_time"))
.withColumn("day", dayofmonth($"_time"))
.withColumn("hour", hour($"_time"))
{code}
using the standard {{year}}, {{month}}, {{dayofmonth}} and {{hour}} functions
defined in {{org.apache.spark.sql.functions}}.
Now let's assume the timezone is row dependent - and let's call {{_tz}} the
column which contains it.The timezone is at the row level which is why I cannot
configure the {{DataFrameWriter}} with a {{timeZone}} option.
I wondered if something like this would be advisable:
{code:java}
session.read.schema(mySchema)
.json(path)
.withColumn("year", year($"_time"))
.withColumn("month", month($"_time"))
.withColumn("day", dayofmonth($"_time"))
.withColumn("hour", hour($"_time", $"_tz"))
{code}
Having a look at the definition of the {{hour}} function, it uses an {{Hour}}
expression which can be constructed with an optional {{timeZoneId}}.
I have been trying to create an {{Hour}} expression but this is Spark-internal
construct - and the API forbids to use it directly.
I guess providing a function {{hour(t: Column, tz: Column)}} along with the
existing {{hour(t: Column)}} would not be a satisfying design.
Do you think a somehow elegant solution exists for this use case? Or is the
methodology I use flawed - i.e. I should not derive the hour from a timestamp
column if it happens to rely on a not predefined, row-dependent time zone like
this?
was (Author: hangleton):
Hello all,
I have a use case where a {{Dataset}} contains a column of type
{{java.sql.Timestamp}} (let's call it {{_time}}) which I am using to derive new
columns with the year, month, day and hour specified by the {{_time}} column,
with something like:
{code:java}
session.read.schema(mySchema)
.json(path)
.withColumn("year", year($"_time"))
.withColumn("month", month($"_time"))
.withColumn("day", dayofmonth($"_time"))
.withColumn("hour", hour($"_time"))
{code}
using the standard {{year}}, {{month}}, {{dayofmonth}} and {{hour}} functions
defined in {{org.apache.spark.sql.functions}}.
Now let's assume the timezone is row dependent - and let's call {{_tz}} the
column which contains it.The timezone is at the row level which is why I cannot
configure the {{DataFrameWriter}} with a {{timeZone}} option.
I wondered if something like this would be advisable:
{code:java}
session.read.schema(mySchema)
.json(path)
.withColumn("year", year($"_time"))
.withColumn("month", month($"_time"))
.withColumn("day", dayofmonth($"_time"))
.withColumn("hour", hour($"_time", $"_tz"))
{code}
Having a look at the definition of the {{hour}} function, it uses an {{Hour}}
expression which can be constructed with an optional {{timeZoneId}}.
I have been trying to create an {{Hour}} expression but this is Spark-internal
construct - and the API forbids to use it directly.
I guess providing a function {{hour(t: Column, tz: Column)}} along with the
existing {{hour(t: Column)}} would not be a satisfying design.
Do you think a somehow elegant solution exists for this use case? Or is the
methodology I use flawed - i.e. I should not derive the hour from a timestamp
column if it happens to rely on a not predefined, row-dependent time zone like
this?
> Support session local timezone
> ------------------------------
>
> Key: SPARK-18350
> URL: https://issues.apache.org/jira/browse/SPARK-18350
> Project: Spark
> Issue Type: New Feature
> Components: SQL
> Reporter: Reynold Xin
> Assignee: Takuya Ueshin
> Labels: releasenotes
> Fix For: 2.2.0
>
> Attachments: sample.csv
>
>
> As of Spark 2.1, Spark SQL assumes the machine timezone for datetime
> manipulation, which is bad if users are not in the same timezones as the
> machines, or if different users have different timezones.
> We should introduce a session local timezone setting that is used for
> execution.
> An explicit non-goal is locale handling.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]