[
https://issues.apache.org/jira/browse/HUDI-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vinoth Chandar updated HUDI-1894:
---------------------------------
Labels: schema sev:critical triaged (was: schema sev:high triaged)
> NULL values in timestamp column defaulted
> ------------------------------------------
>
> Key: HUDI-1894
> URL: https://issues.apache.org/jira/browse/HUDI-1894
> Project: Apache Hudi
> Issue Type: Bug
> Components: Spark Integration
> Reporter: Eldhose Paul
> Assignee: sivabalan narayanan
> Priority: Critical
> Labels: schema, sev:critical, triaged
>
> Reading timestamp column from hudi and underlying parquet files in spark
> gives different results.
> *hudi properties:*
> {code:java}
> hdfs dfs -cat
> /user/hive/warehouse/jira_expl.db/jiraissue_events/.hoodie/hoodie.properties
> #Properties saved on Tue May 11 17:17:22 EDT 2021
> #Tue May 11 17:17:22 EDT 2021
> hoodie.compaction.payload.class=org.apache.hudi.common.model.OverwriteWithLatestAvroPayload
> hoodie.table.name=jiraissue
> hoodie.archivelog.folder=archived
> hoodie.table.type=MERGE_ON_READ
> hoodie.table.version=1
> hoodie.timeline.layout.version=1
> {code}
>
> *Reading directly from parquet using Spark:*
> {code:java}
> scala> val ji =
> spark.read.format("parquet").load("/user/hive/warehouse/jira_expl.db/jiraissue_events/*.parquet")
> ji: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string,
> _hoodie_commit_seqno: string ... 49 more fields]scala> ji.filter($"id" ===
> 1237858).withColumn("inputfile",
> input_file_name()).select($"_hoodie_commit_time", $"_hoodie_commit_seqno",
> $"_hoodie_record_key", $"_hoodie_partition_path",
> $"_hoodie_file_name",$"resolutiondate", $"archiveddate",
> $"inputfile").show(false)
> +-------------------+----------------------+------------------+----------------------+-----------------------------------------------------------------------+--------------+------------+------------------------------------------------------------------------------------------------------------------------------------------------+
> |_hoodie_commit_time|_hoodie_commit_seqno
> |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name
> |resolutiondate|archiveddate|inputfile
>
> |
> +-------------------+----------------------+------------------+----------------------+-----------------------------------------------------------------------+--------------+------------+------------------------------------------------------------------------------------------------------------------------------------------------+
> |20210511171722 |20210511171722_7_13718|1237858.0 |
>
> |832cf07f-637b-4a4c-ab08-6929554f003a-0_7-98-5106_20210511171722.parquet|null
> |null
> |hdfs://nameservice1/user/hive/warehouse/jira_expl.db/jiraissue_events/832cf07f-637b-4a4c-ab08-6929554f003a-0_7-98-5106_20210511171722.parquet
> |
> |20210511171722 |20210511171722_7_13718|1237858.0 |
>
> |832cf07f-637b-4a4c-ab08-6929554f003a-0_7-98-5106_20210511171722.parquet|null
> |null
> |hdfs://nameservice1/user/hive/warehouse/jira_expl.db/jiraissue_events/832cf07f-637b-4a4c-ab08-6929554f003a-0_8-1610-78711_20210511173615.parquet%7C
> +-------------------+----------------------+------------------+----------------------+-----------------------------------------------------------------------+--------------+------------+------------------------------------------------------------------------------------------------------------------------------------------------+
> {code}
> *Reading `hudi` using Spark:*
> {code:java}
> scala> val jih =
> spark.read.format("org.apache.hudi").load("/user/hive/warehouse/jira_expl.db/jiraissue_events")
> jih: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string,
> _hoodie_commit_seqno: string ... 49 more fields]scala> jih.filter($"id" ===
> 1237858).select($"_hoodie_commit_time", $"_hoodie_commit_seqno",
> $"_hoodie_record_key", $"_hoodie_partition_path",
> $"_hoodie_file_name",$"resolutiondate", $"archiveddate").show(false)
> +-------------------+----------------------+------------------+----------------------+-----------------------------------------------------------------------+-------------------+-------------------+
> |_hoodie_commit_time|_hoodie_commit_seqno
> |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name
> |resolutiondate |archiveddate |
> +-------------------+----------------------+------------------+----------------------+-----------------------------------------------------------------------+-------------------+-------------------+
> |20210511171722 |20210511171722_7_13718|1237858.0 |
>
> |832cf07f-637b-4a4c-ab08-6929554f003a-0_7-98-5106_20210511171722.parquet|2018-07-30
> 14:58:52|1969-12-31 19:00:00|
> +-------------------+----------------------+------------------+----------------------+-----------------------------------------------------------------------+-------------------+-------------------+
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)