[jira] [Commented] (HUDI-1894) NULL values in timestamp column defaulted

Eldhose Paul (Jira) Thu, 13 May 2021 07:07:40 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343883#comment-17343883
 ]


Eldhose Paul commented on HUDI-1894:
------------------------------------

[~shivnarayan] I am not sure if I can get you all the steps to reproduce.  On a 
high level here is what we are doing.
 # We capture CDC from a relational database using Debezium. 
 # Stream the data to Kafka
 # Use Structured streaming to write data in to hudi tables(hdfs). We use MOR 
tables with inline compaction. Delta commits to trigger compaction is set to 10.
 # Read data for further processing using Spark from Hudi tables. 

I will try with COW tables and let you know the results. We do trigger 
compaction at every commit as an intermittent solution in prod. But is not 
optimized, batch times have increased. :(

 

> NULL values in timestamp column defaulted 
> ------------------------------------------
>
>                 Key: HUDI-1894
>                 URL: https://issues.apache.org/jira/browse/HUDI-1894
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: Spark Integration
>            Reporter: Eldhose Paul
>            Assignee: sivabalan narayanan
>            Priority: Major
>              Labels: sev:critical
>
> Reading timestamp column from hudi and underlying parquet files in spark 
> gives different results. 
> *hudi properties:*
> {code:java}
>  hdfs dfs -cat 
> /user/hive/warehouse/jira_expl.db/jiraissue_events/.hoodie/hoodie.properties
> #Properties saved on Tue May 11 17:17:22 EDT 2021
> #Tue May 11 17:17:22 EDT 2021
> hoodie.compaction.payload.class=org.apache.hudi.common.model.OverwriteWithLatestAvroPayload
> hoodie.table.name=jiraissue
> hoodie.archivelog.folder=archived
> hoodie.table.type=MERGE_ON_READ
> hoodie.table.version=1
> hoodie.timeline.layout.version=1
> {code}
>  
> *Reading directly from parquet using Spark:*
> {code:java}
> scala> val ji = 
> spark.read.format("parquet").load("/user/hive/warehouse/jira_expl.db/jiraissue_events/*.parquet")
> ji: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, 
> _hoodie_commit_seqno: string ... 49 more fields]scala>  ji.filter($"id" === 
> 1237858).withColumn("inputfile", 
> input_file_name()).select($"_hoodie_commit_time", $"_hoodie_commit_seqno", 
> $"_hoodie_record_key", $"_hoodie_partition_path", 
> $"_hoodie_file_name",$"resolutiondate", $"archiveddate", 
> $"inputfile").show(false)
> +-------------------+----------------------+------------------+----------------------+-----------------------------------------------------------------------+--------------+------------+------------------------------------------------------------------------------------------------------------------------------------------------+
> |_hoodie_commit_time|_hoodie_commit_seqno  
> |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name                  
>                                     |resolutiondate|archiveddate|inputfile    
>                                                                               
>                                                      |
> +-------------------+----------------------+------------------+----------------------+-----------------------------------------------------------------------+--------------+------------+------------------------------------------------------------------------------------------------------------------------------------------------+
> |20210511171722     |20210511171722_7_13718|1237858.0         |               
>        
> |832cf07f-637b-4a4c-ab08-6929554f003a-0_7-98-5106_20210511171722.parquet|null 
>          |null        
> |hdfs://nameservice1/user/hive/warehouse/jira_expl.db/jiraissue_events/832cf07f-637b-4a4c-ab08-6929554f003a-0_7-98-5106_20210511171722.parquet
>    |
> |20210511171722     |20210511171722_7_13718|1237858.0         |               
>        
> |832cf07f-637b-4a4c-ab08-6929554f003a-0_7-98-5106_20210511171722.parquet|null 
>          |null        
> |hdfs://nameservice1/user/hive/warehouse/jira_expl.db/jiraissue_events/832cf07f-637b-4a4c-ab08-6929554f003a-0_8-1610-78711_20210511173615.parquet%7C
> +-------------------+----------------------+------------------+----------------------+-----------------------------------------------------------------------+--------------+------------+------------------------------------------------------------------------------------------------------------------------------------------------+
> {code}
> *Reading `hudi` using Spark:*
> {code:java}
> scala> val jih = 
> spark.read.format("org.apache.hudi").load("/user/hive/warehouse/jira_expl.db/jiraissue_events")
> jih: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, 
> _hoodie_commit_seqno: string ... 49 more fields]scala> jih.filter($"id" === 
> 1237858).select($"_hoodie_commit_time", $"_hoodie_commit_seqno", 
> $"_hoodie_record_key", $"_hoodie_partition_path", 
> $"_hoodie_file_name",$"resolutiondate", $"archiveddate").show(false)
> +-------------------+----------------------+------------------+----------------------+-----------------------------------------------------------------------+-------------------+-------------------+
> |_hoodie_commit_time|_hoodie_commit_seqno  
> |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name                  
>                                     |resolutiondate     |archiveddate       |
> +-------------------+----------------------+------------------+----------------------+-----------------------------------------------------------------------+-------------------+-------------------+
> |20210511171722     |20210511171722_7_13718|1237858.0         |               
>        
> |832cf07f-637b-4a4c-ab08-6929554f003a-0_7-98-5106_20210511171722.parquet|2018-07-30
>  14:58:52|1969-12-31 19:00:00|
> +-------------------+----------------------+------------------+----------------------+-----------------------------------------------------------------------+-------------------+-------------------+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-1894) NULL values in timestamp column defaulted

Reply via email to