[jira] [Updated] (SPARK-17066) dateFormat should be used when writing dataframes as csv files

Barry Becker (JIRA) Mon, 15 Aug 2016 14:30:36 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-17066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Barry Becker updated SPARK-17066:
---------------------------------
    Description: 
I noticed this when running tests after pulling and building @lw-lin 's PR 
(https://github.com/apache/spark/pull/14118). I don't think it is anything 
wrong with his PR, just that the fix that was made to spark-csv for this issue 
was never moved to spark 2.x when databrick's spark-csv was merged into spark 2 
back in January. https://github.com/databricks/spark-csv/issues/308 was fixed 
in spark-csv after that merge.

The problem is that if I try to write a dataframe that contains a date column 
out to a csv using something like this

repartitionDf.write.format("csv") //.format(DATABRICKS_CSV)
        .option("delimiter", "\t")
        .option("header", "false")
        .option("nullValue", "?")
        .option("dateFormat", "yyyy-MM-dd'T'HH:mm:ss")
        .option("escape", "\\")       
        .save(tempFileName)

Then my unit test (which passed under spark 1.6.2) fails using the spark 2.1.0 
snapshot build that I made today. The dataframe contained 3 values in a date 
column.

Expected "[2012-01-03T09:12:00
?
2015-02-23T18:00:]00", 
but got 
"[1325610720000000
?
14247432000000]00"

This means that while the null value is being correctly exported, the specified 
dateFormat is not being used to format the date. Instead it looks like number 
of seconds from epoch is being used.

  was:
I noticed this when running tests after pulling and building @lw-lin 's PR 
(https://github.com/apache/spark/pull/14118). I don't think it is anything 
wrong with his PR, just that the fix that was made to spark-csv for this issue 
was never moved to spark 2.x when databrick's spark-csv was merged into spark 2 
back in January. https://github.com/databricks/spark-csv/issues/308 was fixed 
in spark-csv after that merge.

The problem is that if I try to write a dataframe that contains a date column 
out to a csv using something like this

repartitionDf.write.format("csv") //.format(DATABRICKS_CSV)
        .option("delimiter", "\t")
        .option("header", "false")
        .option("nullValue", "?")
        .option("dateFormat", "yyyy-MM-dd'T'HH:mm:ss")
        .option("escape", "\\")       
        .save(tempFileName)

Then my unit test (which passed under spark 1.6.2 fails using the spark 2.1.0 
snapshot build that I made today.

Expected "[2012-01-03T09:12:00
?
2015-02-23T18:00:]00", 
but got 
"[1325610720000000
?
14247432000000]00"

This means that while the null value is being correctly exported, the specified 
dateFormat is not being used to format the date. Instead it looks like number 
of seconds from epoch is being used.


> dateFormat should be used when writing dataframes as csv files
> --------------------------------------------------------------
>
>                 Key: SPARK-17066
>                 URL: https://issues.apache.org/jira/browse/SPARK-17066
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.0.0
>            Reporter: Barry Becker
>
> I noticed this when running tests after pulling and building @lw-lin 's PR 
> (https://github.com/apache/spark/pull/14118). I don't think it is anything 
> wrong with his PR, just that the fix that was made to spark-csv for this 
> issue was never moved to spark 2.x when databrick's spark-csv was merged into 
> spark 2 back in January. https://github.com/databricks/spark-csv/issues/308 
> was fixed in spark-csv after that merge.
> The problem is that if I try to write a dataframe that contains a date column 
> out to a csv using something like this
> repartitionDf.write.format("csv") //.format(DATABRICKS_CSV)
>         .option("delimiter", "\t")
>         .option("header", "false")
>         .option("nullValue", "?")
>         .option("dateFormat", "yyyy-MM-dd'T'HH:mm:ss")
>         .option("escape", "\\")       
>         .save(tempFileName)
> Then my unit test (which passed under spark 1.6.2) fails using the spark 
> 2.1.0 snapshot build that I made today. The dataframe contained 3 values in a 
> date column.
> Expected "[2012-01-03T09:12:00
> ?
> 2015-02-23T18:00:]00", 
> but got 
> "[1325610720000000
> ?
> 14247432000000]00"
> This means that while the null value is being correctly exported, the 
> specified dateFormat is not being used to format the date. Instead it looks 
> like number of seconds from epoch is being used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-17066) dateFormat should be used when writing dataframes as csv files

Reply via email to