[jira] [Comment Edited] (SPARK-16216) CSV data source does not write date and timestamp correctly

Barry Becker (JIRA) Fri, 21 Oct 2016 09:42:33 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-16216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15595619#comment-15595619
 ]


Barry Becker edited comment on SPARK-16216 at 10/21/16 4:41 PM:
----------------------------------------------------------------

If timezone is not specified, the date should be interpreted as being in "local 
time".  Trying to add a time zone when none was specified is not the right 
thing to do since it is making an assumption that is not necessarily true. I 
think JSON is doing the right thing above by leaving off the timezone. I just 
updated to 2.0.1 and see that one of my tests broke because of this.
 Here is my test case:
 I create a dataFrame containing this data:
{code}
val ISO_DATE_FORMAT = DateTimeFormat.forPattern("yyyy-MM-dd'T'HH:mm:ss")
val columnData = List(
      new 
Timestamp(ISO_DATE_FORMAT.parseDateTime("2012-01-03T09:12:00").getMillis),
      null,
      new 
Timestamp(ISO_DATE_FORMAT.parseDateTime("2015-02-23T18:00:00").getMillis))
{code}
then write it to a file using
{code}
dataframe.write.format("csv") 
        .option("delimiter", "\t")
        .option("header", "false")
        .option("nullValue", NULL_VALUE)
        .option("dateFormat", "yyyy-MM-dd'T'HH:mm:ss")
        .option("escape", "\\") 
        .save(tempFileName)
{code}
Note that I specifically do not want a time zone when I write my dateTimes to 
the file. They are in local time not UTC or GMT. I do not want a timeZone added.

The dataFile used to contain
{code}
2012-01-03T09:12:00
?
2015-02-23T18:00:00
{code}
Which is correct. With spark 1.6.2, but now, with 2.0.1, it contains
{code}
2012-01-03T09:12:00.000-08:00
?
2015-02-23T18:00:00.000-08:00
{code}
Which is not correct. I think the previous behavior is correct. Can we reopen?
If I actually wanted the timeZone to be considered as UTC, then I could add an 
explicit Z at the end.



was (Author: barrybecker4):
If timezone is not specified, the date should be interpreted as being in "local 
time".  Trying to add a time zone when none was specified is not the right 
thing to do since it is making an assumption that is not necessarily true. I 
think JSON is doing the right thing above by leaving off the timezone. I just 
updated to 2.0.1 and see that one of my tests broke because of this.
 Here is my test case:
 I create a dataFrame containing this data:
{code}
val ISO_DATE_FORMAT = DateTimeFormat.forPattern("yyyy-MM-dd'T'HH:mm:ss")
val columnData = List(
      new 
Timestamp(ISO_DATE_FORMAT.parseDateTime("2012-01-03T09:12:00").getMillis),
      null,
      new 
Timestamp(ISO_DATE_FORMAT.parseDateTime("2015-02-23T18:00:00").getMillis))
{code}
then write it to a file using
{code}
dataframe.write.format("csv") 
        .option("delimiter", "\t")
        .option("header", "false")
        .option("nullValue", NULL_VALUE)
        .option("dateFormat", "yyyy-MM-dd'T'HH:mm:ss")
        .option("escape", "\\") 
        .save(tempFileName)
{code}
Note that I specifically do not want a time zone when I write my dateTimes to 
the file. They are in local time not UTC or GMT. I do not want a timeZone added.

The dataFile used to contain
{code}
2012-01-03T09:12:00
?
2015-02-23T18:00:00
{code}
Which is correct. With spark 1.6.2, but now, with 2.0.1, it contains
{code}
2012-01-03T09:12:00.000-08:00
?
2015-02-23T18:00:00.000-08:00
{code}
Which is not correct. I think the previous behavior is correct. Can we reopen?


> CSV data source does not write date and timestamp correctly
> -----------------------------------------------------------
>
>                 Key: SPARK-16216
>                 URL: https://issues.apache.org/jira/browse/SPARK-16216
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Hyukjin Kwon
>            Assignee: Hyukjin Kwon
>            Priority: Blocker
>              Labels: releasenotes
>             Fix For: 2.0.1, 2.1.0
>
>
> Currently, CSV data source write {{DateType}} and {{TimestampType}} as below:
> {code}
> +----------------+
> |            date|
> +----------------+
> |1440637200000000|
> |1414459800000000|
> |1454040000000000|
> +----------------+
> {code}
> It would be nicer if it write dates and timestamps as a formatted string just 
> like JSON data sources.
> Also, CSV data source currently supports {{dateFormat}} option to read dates 
> and timestamps in a custom format. It might be better if this option can be 
> applied in writing as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16216) CSV data source does not write date and timestamp correctly

Reply via email to