[jira] [Comment Edited] (HUDI-83) Support for timestamp datatype in Hudi

Cosmin Iordache (Jira) Fri, 13 Mar 2020 04:12:52 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-83?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17058635#comment-17058635
 ]


Cosmin Iordache edited comment on HUDI-83 at 3/13/20, 11:11 AM:
----------------------------------------------------------------

I was looking at how hudi saves data with spark-2.4.4 and things have changed. 
Probably because of this : https://github.com/apache/incubator-hudi/pull/1005

Decimal types are saved correctly ,timestamp as well. 

Example of timestamp inferred column being read after saved with hoodie : 
{code:java}
scala> val q3 = 
spark.read.format("org.apache.hudi").load("hdfs://namenode:8020/data/lake/d3325f10-4a91-4b19-872b-5be019c4836a/converted/*/*")
5651992 [main] WARN  org.apache.hudi.DefaultSource  - Snapshot view not 
supported yet via data source, for MERGE_ON_READ tables. Please query the Hive 
table registered using Spark SQL.
q3: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, 
_hoodie_commit_seqno: string ... 13 more fields]scala> q3.show()

scala> q3.select("other_date","timestamp_1").show
+-------------------+-------------------+
|         other_date|        timestamp_1|
+-------------------+-------------------+
|2017-09-17 00:00:00|2017-01-01 00:00:00|
|2017-09-16 00:00:00|2017-01-01 00:00:00|
+-------------------+-------------------+
scala> q3.select("other_date","timestamp_1").dtypes
res6: Array[(String, String)] = Array((other_date,TimestampType), 
(timestamp_1,TimestampType))

{code}
Where the avro schema sent was : 

 
{code:java}
...{
    "name" : "other_date",
    "type" : [ {
      "type" : "long",
      "logicalType" : "timestamp-micros"
    }, "null" ]
  },{
    "name" : "timestamp_1",
    "type" : [ {
      "type" : "long",
      "logicalType" : "timestamp-micros"
    }, "null" ]
  }...
{code}
 

 And for Decimal:
{code:java}
scala> val q2 = 
spark.read.format("org.apache.hudi").load("hdfs://namenode:8020/data/lake/5a3d9896-b331-4b5d-8638-5d72e02edd34/converted/*/*")
6221463 [main] WARN  org.apache.hudi.DefaultSource  - Snapshot view not 
supported yet via data source, for MERGE_ON_READ tables. Please query the Hive 
table registered using Spark SQL.
q2: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, 
_hoodie_commit_seqno: string ... 40 more fields]
scala> q2.select("LIMIT_BAL").show()
+---------+
|LIMIT_BAL|
+---------+
|   260000|
|   110000|
|    50000|
....
scala> q2.select("LIMIT_BAL").dtypes
res10: Array[(String, String)] = Array((LIMIT_BAL,DecimalType(6,0)))
{code}
And the schema sent was : 
{code:java}
 {
    "name" : "LIMIT_BAL",
    "type" : [ {
      "type" : "fixed",
      "name" : "fixed",
      "namespace" : "hoodie.doi8nhn.doi8nhn_record.LIMIT_BAL",
      "size" : 3,
      "logicalType" : "decimal",
      "precision" : 6,
      "scale" : 0
    }, "null" ]
  }
{code}
This introduces a backwards compatibility issue though . 


was (Author: arw357):
I was looking at how hudi saves data with spark-2.4.4 and things have changed. 

Decimal types are saved correctly ,timestamp as well. 

Example of timestamp inferred column being read after saved with hoodie : 
{code:java}
scala> val q3 = 
spark.read.format("org.apache.hudi").load("hdfs://namenode:8020/data/lake/d3325f10-4a91-4b19-872b-5be019c4836a/converted/*/*")
5651992 [main] WARN  org.apache.hudi.DefaultSource  - Snapshot view not 
supported yet via data source, for MERGE_ON_READ tables. Please query the Hive 
table registered using Spark SQL.
q3: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, 
_hoodie_commit_seqno: string ... 13 more fields]scala> q3.show()

scala> q3.select("other_date","timestamp_1").show
+-------------------+-------------------+
|         other_date|        timestamp_1|
+-------------------+-------------------+
|2017-09-17 00:00:00|2017-01-01 00:00:00|
|2017-09-16 00:00:00|2017-01-01 00:00:00|
+-------------------+-------------------+
scala> q3.select("other_date","timestamp_1").dtypes
res6: Array[(String, String)] = Array((other_date,TimestampType), 
(timestamp_1,TimestampType))

{code}
Where the avro schema sent was : 

 
{code:java}
...{
    "name" : "other_date",
    "type" : [ {
      "type" : "long",
      "logicalType" : "timestamp-micros"
    }, "null" ]
  },{
    "name" : "timestamp_1",
    "type" : [ {
      "type" : "long",
      "logicalType" : "timestamp-micros"
    }, "null" ]
  }...
{code}
 

 And for Decimal:
{code:java}
scala> val q2 = 
spark.read.format("org.apache.hudi").load("hdfs://namenode:8020/data/lake/5a3d9896-b331-4b5d-8638-5d72e02edd34/converted/*/*")
6221463 [main] WARN  org.apache.hudi.DefaultSource  - Snapshot view not 
supported yet via data source, for MERGE_ON_READ tables. Please query the Hive 
table registered using Spark SQL.
q2: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, 
_hoodie_commit_seqno: string ... 40 more fields]
scala> q2.select("LIMIT_BAL").show()
+---------+
|LIMIT_BAL|
+---------+
|   260000|
|   110000|
|    50000|
....
scala> q2.select("LIMIT_BAL").dtypes
res10: Array[(String, String)] = Array((LIMIT_BAL,DecimalType(6,0)))
{code}
And the schema sent was : 
{code:java}
 {
    "name" : "LIMIT_BAL",
    "type" : [ {
      "type" : "fixed",
      "name" : "fixed",
      "namespace" : "hoodie.doi8nhn.doi8nhn_record.LIMIT_BAL",
      "size" : 3,
      "logicalType" : "decimal",
      "precision" : 6,
      "scale" : 0
    }, "null" ]
  }
{code}
This introduces a backwards compatibility issue though . 

> Support for timestamp datatype in Hudi
> --------------------------------------
>
>                 Key: HUDI-83
>                 URL: https://issues.apache.org/jira/browse/HUDI-83
>             Project: Apache Hudi (incubating)
>          Issue Type: Bug
>          Components: Usability
>            Reporter: Vinoth Chandar
>            Priority: Major
>             Fix For: 0.6.0
>
>
> [https://github.com/apache/incubator-hudi/issues/543] &; related issues 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HUDI-83) Support for timestamp datatype in Hudi

Reply via email to