[jira] [Commented] (HIVE-6394) Implement Timestmap in ParquetSerde

Cheng Lian (JIRA) Tue, 25 Aug 2015 03:20:02 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-6394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14711046#comment-14711046
 ]


Cheng Lian commented on HIVE-6394:
----------------------------------

While testing Spark SQL 1.5-SNAPSHOT for Parquet/Hive compatibility, we hit 
SPARK-10177.  In short, Spark SQL and Hive both have their own Julian date 
conversion code, and their results don't match.  Currently, we've fixed this 
issue by making Spark SQL behave the same as Hive so that we can interoperate 
(see [Spark PR #8400|https://github.com/apache/spark/pull/8400]).  However, 
Hive's behavior looks a little bit weird to me: when converting a calendar 
timestamp to a Julian timestamp, Hive always gives a result 12 hours later than 
the expected result.

This behavior can be verified by the following Spark 1.5-SNAPSHOT shell snippet 
(I'm using Spark 1.5-SNAPSHOT shell since it comes with Hive 1.2.1 
dependencies):
{code}
import java.sql._
import java.util._

import org.apache.hadoop.hive.ql.io.parquet.timestamp._
import org.apache.spark.sql.catalyst.util._

TimeZone.setDefault(TimeZone.getTimeZone("GMT"))
val timestamp = Timestamp.valueOf("1970-00-00 00:00:00")

val hiveNanoTime = NanoTimeUtils.getNanoTime(timestamp, false)
val hiveJulianDay = hiveNanoTime.getJulianDay
val hiveTimeOfDayNanos = hiveNanoTime.getTimeOfDayNanos

println(
  s"""Hive converts "$timestamp" to Julian timestamp:
     |(julianDay=$hiveJulianDay, timeOfDayNanos=$hiveTimeOfDayNanos)
   """.stripMargin)
{code}
The result is:
{noformat}
Hive converts "1970-01-01 00:00:00.0" to Julian timestamp:
(julianDay=2440588, timeOfDayNanos=0)
{noformat}
According to definition on [this 
page|http://aa.usno.navy.mil/data/docs/JulianDate.php], Julian dates count from 
noon.  Namely "00:00:00" of any calendar date must map to a Julian timestamp 
with a fraction of 0.5, i.e. an integral date plus 12 hours.  And the correct 
Julian timestamp given by the converter in the aforementioned page is 
"2440587.500000", which is equivalent to:
{noformat}
(julianDay=2440587, timeOfDayNanos=43200000000000)
{noformat}
This means, INT96 timestamp values stored in Parquet files written by Hive all 
have a 12 hr offset.  (I haven't tried to verify this issue against Impala.)

This shouldn't a big problem though, as long as the read path always correctly 
decode written timestamp values.  Just curious, is this 12 hr offset 
intentional?


> Implement Timestmap in ParquetSerde
> -----------------------------------
>
>                 Key: HIVE-6394
>                 URL: https://issues.apache.org/jira/browse/HIVE-6394
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Serializers/Deserializers
>            Reporter: Jarek Jarcec Cecho
>            Assignee: Szehon Ho
>              Labels: Parquet
>             Fix For: 0.14.0
>
>         Attachments: HIVE-6394.2.patch, HIVE-6394.3.patch, HIVE-6394.4.patch, 
> HIVE-6394.5.patch, HIVE-6394.6.patch, HIVE-6394.6.patch, HIVE-6394.7.patch, 
> HIVE-6394.patch
>
>
> This JIRA is to implement timestamp support in Parquet SerDe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-6394) Implement Timestmap in ParquetSerde

Reply via email to