[ 
https://issues.apache.org/jira/browse/SPARK-22460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16242109#comment-16242109
 ] 

Saniya Tech commented on SPARK-22460:
-------------------------------------

No it is not a spark-avro issue. The spark-avro serializes and deserializes the 
long value (the time in milliseconds since EPOCH) correctly. 

It is the Spark encoder which converts the dataset as the case class that 
incorrectly interprets the long value as the time in seconds since EPOCH.

I have broken the code into steps to show the results after each step:

{code:java}
// De-serialize
// rawOutput is deserializing using spark-avro connector
val rawOutput = spark.read.avro(path)
// output is encoding using Spark's `as`
val output = rawOutput.as[TestRecord]
{code}

Print-out of results for each step:
{code:java}
scala> data.head
res3: TestRecord = TestRecord(One,2017-11-07 14:19:42.427)

scala> data.head.modified.getTime
res4: Long = 1510064382427

scala> rawOutput.collect().head
res5: org.apache.spark.sql.Row = [One,1510064382427]

scala> output.collect().head
res6: TestRecord = TestRecord(One,49822-01-14 00:27:07.0)
{code}

This is the relevant code in Spark where it assumes the long value is in 
seconds:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L285
{code:java}
  // converting seconds to us
  private[this] def longToTimestamp(t: Long): Long = t * 1000000L
{code}

The Java API specifies the Timestamp.getTime() returns a long number in 
milliseconds:
https://docs.oracle.com/javase/8/docs/api/java/sql/Timestamp.html#getTime--



> Spark De-serialization of Timestamp field is Incorrect
> ------------------------------------------------------
>
>                 Key: SPARK-22460
>                 URL: https://issues.apache.org/jira/browse/SPARK-22460
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.1.1
>            Reporter: Saniya Tech
>
> We are trying to serialize Timestamp fields to Avro using spark-avro 
> connector. I can see the Timestamp fields are getting correctly serialized as 
> long (milliseconds since Epoch). I verified that the data is correctly read 
> back from the Avro files. It is when we encode the Dataset as a case class 
> that timestamp field is incorrectly converted to a long value as seconds 
> since Epoch. As can be seen below, this shifts the timestamp many years in 
> the future.
> Code used to reproduce the issue:
> {code:java}
> import java.sql.Timestamp
> import com.databricks.spark.avro._
> import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession}
> case class TestRecord(name: String, modified: Timestamp)
> import spark.implicits._
> val data = Seq(
>   TestRecord("One", new Timestamp(System.currentTimeMillis()))
> )
> // Serialize:
> val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> 
> "com.example.domain")
> val path = s"s3a://some-bucket/output/"
> val ds = spark.createDataset(data)
> ds.write
>   .options(parameters)
>   .mode(SaveMode.Overwrite)
>   .avro(path)
> //
> // De-serialize
> val output = spark.read.avro(path).as[TestRecord]
> {code}
> Output from the test:
> {code:java}
> scala> data.head
> res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419)
> scala> output.collect().head
> res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to