[
https://issues.apache.org/jira/browse/HUDI-83?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17058635#comment-17058635
]
Cosmin Iordache edited comment on HUDI-83 at 3/13/20, 11:11 AM:
----------------------------------------------------------------
I was looking at how hudi saves data with spark-2.4.4 and things have changed.
Probably because of this : https://github.com/apache/incubator-hudi/pull/1005
Decimal types are saved correctly ,timestamp as well.
Example of timestamp inferred column being read after saved with hoodie :
{code:java}
scala> val q3 =
spark.read.format("org.apache.hudi").load("hdfs://namenode:8020/data/lake/d3325f10-4a91-4b19-872b-5be019c4836a/converted/*/*")
5651992 [main] WARN org.apache.hudi.DefaultSource - Snapshot view not
supported yet via data source, for MERGE_ON_READ tables. Please query the Hive
table registered using Spark SQL.
q3: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string,
_hoodie_commit_seqno: string ... 13 more fields]scala> q3.show()
scala> q3.select("other_date","timestamp_1").show
+-------------------+-------------------+
| other_date| timestamp_1|
+-------------------+-------------------+
|2017-09-17 00:00:00|2017-01-01 00:00:00|
|2017-09-16 00:00:00|2017-01-01 00:00:00|
+-------------------+-------------------+
scala> q3.select("other_date","timestamp_1").dtypes
res6: Array[(String, String)] = Array((other_date,TimestampType),
(timestamp_1,TimestampType))
{code}
Where the avro schema sent was :
{code:java}
...{
"name" : "other_date",
"type" : [ {
"type" : "long",
"logicalType" : "timestamp-micros"
}, "null" ]
},{
"name" : "timestamp_1",
"type" : [ {
"type" : "long",
"logicalType" : "timestamp-micros"
}, "null" ]
}...
{code}
And for Decimal:
{code:java}
scala> val q2 =
spark.read.format("org.apache.hudi").load("hdfs://namenode:8020/data/lake/5a3d9896-b331-4b5d-8638-5d72e02edd34/converted/*/*")
6221463 [main] WARN org.apache.hudi.DefaultSource - Snapshot view not
supported yet via data source, for MERGE_ON_READ tables. Please query the Hive
table registered using Spark SQL.
q2: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string,
_hoodie_commit_seqno: string ... 40 more fields]
scala> q2.select("LIMIT_BAL").show()
+---------+
|LIMIT_BAL|
+---------+
| 260000|
| 110000|
| 50000|
....
scala> q2.select("LIMIT_BAL").dtypes
res10: Array[(String, String)] = Array((LIMIT_BAL,DecimalType(6,0)))
{code}
And the schema sent was :
{code:java}
{
"name" : "LIMIT_BAL",
"type" : [ {
"type" : "fixed",
"name" : "fixed",
"namespace" : "hoodie.doi8nhn.doi8nhn_record.LIMIT_BAL",
"size" : 3,
"logicalType" : "decimal",
"precision" : 6,
"scale" : 0
}, "null" ]
}
{code}
This introduces a backwards compatibility issue though .
was (Author: arw357):
I was looking at how hudi saves data with spark-2.4.4 and things have changed.
Decimal types are saved correctly ,timestamp as well.
Example of timestamp inferred column being read after saved with hoodie :
{code:java}
scala> val q3 =
spark.read.format("org.apache.hudi").load("hdfs://namenode:8020/data/lake/d3325f10-4a91-4b19-872b-5be019c4836a/converted/*/*")
5651992 [main] WARN org.apache.hudi.DefaultSource - Snapshot view not
supported yet via data source, for MERGE_ON_READ tables. Please query the Hive
table registered using Spark SQL.
q3: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string,
_hoodie_commit_seqno: string ... 13 more fields]scala> q3.show()
scala> q3.select("other_date","timestamp_1").show
+-------------------+-------------------+
| other_date| timestamp_1|
+-------------------+-------------------+
|2017-09-17 00:00:00|2017-01-01 00:00:00|
|2017-09-16 00:00:00|2017-01-01 00:00:00|
+-------------------+-------------------+
scala> q3.select("other_date","timestamp_1").dtypes
res6: Array[(String, String)] = Array((other_date,TimestampType),
(timestamp_1,TimestampType))
{code}
Where the avro schema sent was :
{code:java}
...{
"name" : "other_date",
"type" : [ {
"type" : "long",
"logicalType" : "timestamp-micros"
}, "null" ]
},{
"name" : "timestamp_1",
"type" : [ {
"type" : "long",
"logicalType" : "timestamp-micros"
}, "null" ]
}...
{code}
And for Decimal:
{code:java}
scala> val q2 =
spark.read.format("org.apache.hudi").load("hdfs://namenode:8020/data/lake/5a3d9896-b331-4b5d-8638-5d72e02edd34/converted/*/*")
6221463 [main] WARN org.apache.hudi.DefaultSource - Snapshot view not
supported yet via data source, for MERGE_ON_READ tables. Please query the Hive
table registered using Spark SQL.
q2: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string,
_hoodie_commit_seqno: string ... 40 more fields]
scala> q2.select("LIMIT_BAL").show()
+---------+
|LIMIT_BAL|
+---------+
| 260000|
| 110000|
| 50000|
....
scala> q2.select("LIMIT_BAL").dtypes
res10: Array[(String, String)] = Array((LIMIT_BAL,DecimalType(6,0)))
{code}
And the schema sent was :
{code:java}
{
"name" : "LIMIT_BAL",
"type" : [ {
"type" : "fixed",
"name" : "fixed",
"namespace" : "hoodie.doi8nhn.doi8nhn_record.LIMIT_BAL",
"size" : 3,
"logicalType" : "decimal",
"precision" : 6,
"scale" : 0
}, "null" ]
}
{code}
This introduces a backwards compatibility issue though .
> Support for timestamp datatype in Hudi
> --------------------------------------
>
> Key: HUDI-83
> URL: https://issues.apache.org/jira/browse/HUDI-83
> Project: Apache Hudi (incubating)
> Issue Type: Bug
> Components: Usability
> Reporter: Vinoth Chandar
> Priority: Major
> Fix For: 0.6.0
>
>
> [https://github.com/apache/incubator-hudi/issues/543] & related issues
--
This message was sent by Atlassian Jira
(v8.3.4#803005)