[ 
https://issues.apache.org/jira/browse/HUDI-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17344141#comment-17344141
 ] 

Eldhose Paul commented on HUDI-1894:
------------------------------------

[~shivnarayan] I tried in to query this table later for couple more tickets, 

Log Files: 

!image-2021-05-13-17-16-20-609.png!

 

Result: looks good

!image-2021-05-13-17-17-33-181.png!

 

Log File: and the tickets I am querying is not present in log file

!image-2021-05-13-17-18-43-830.png!

 

Result: Null replaced by some default values for records which has two instance 
of log file.

eg: 

.031a359b-f8f0-417a-888b-45f2a0b3a26f-0_20210513170024.log.1_5-36696-1755439
.031a359b-f8f0-417a-888b-45f2a0b3a26f-0_20210513171040.log.1_5-37626-1799835

result from files *031a359b-f8f0-417a-888b-45f2a0b3a26f* are incorrect.

logfiles .*dbf01b8b-98c0-4768-a8d3-f562c5d17a6b.*, 
.e54bd2c3-6b9c-4e4b-95d7-0a7b81b83dd0.* 
.3105cc7d-82a5-4d3c-b2fd-76579476abda-0.** only have one entry and 
*.2d6621ac-7da9-46d7-be43-c4f45f785e11** doesn't have an entry. Records from 
this files looks good

!image-2021-05-13-17-24-43-464.png!

> NULL values in timestamp column defaulted 
> ------------------------------------------
>
>                 Key: HUDI-1894
>                 URL: https://issues.apache.org/jira/browse/HUDI-1894
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: Spark Integration
>            Reporter: Eldhose Paul
>            Assignee: sivabalan narayanan
>            Priority: Major
>              Labels: sev:critical
>         Attachments: image-2021-05-13-17-16-20-609.png, 
> image-2021-05-13-17-17-33-181.png, image-2021-05-13-17-18-43-830.png, 
> image-2021-05-13-17-24-43-464.png
>
>
> Reading timestamp column from hudi and underlying parquet files in spark 
> gives different results. 
> *hudi properties:*
> {code:java}
>  hdfs dfs -cat 
> /user/hive/warehouse/jira_expl.db/jiraissue_events/.hoodie/hoodie.properties
> #Properties saved on Tue May 11 17:17:22 EDT 2021
> #Tue May 11 17:17:22 EDT 2021
> hoodie.compaction.payload.class=org.apache.hudi.common.model.OverwriteWithLatestAvroPayload
> hoodie.table.name=jiraissue
> hoodie.archivelog.folder=archived
> hoodie.table.type=MERGE_ON_READ
> hoodie.table.version=1
> hoodie.timeline.layout.version=1
> {code}
>  
> *Reading directly from parquet using Spark:*
> {code:java}
> scala> val ji = 
> spark.read.format("parquet").load("/user/hive/warehouse/jira_expl.db/jiraissue_events/*.parquet")
> ji: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, 
> _hoodie_commit_seqno: string ... 49 more fields]scala>  ji.filter($"id" === 
> 1237858).withColumn("inputfile", 
> input_file_name()).select($"_hoodie_commit_time", $"_hoodie_commit_seqno", 
> $"_hoodie_record_key", $"_hoodie_partition_path", 
> $"_hoodie_file_name",$"resolutiondate", $"archiveddate", 
> $"inputfile").show(false)
> +-------------------+----------------------+------------------+----------------------+-----------------------------------------------------------------------+--------------+------------+------------------------------------------------------------------------------------------------------------------------------------------------+
> |_hoodie_commit_time|_hoodie_commit_seqno  
> |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name                  
>                                     |resolutiondate|archiveddate|inputfile    
>                                                                               
>                                                      |
> +-------------------+----------------------+------------------+----------------------+-----------------------------------------------------------------------+--------------+------------+------------------------------------------------------------------------------------------------------------------------------------------------+
> |20210511171722     |20210511171722_7_13718|1237858.0         |               
>        
> |832cf07f-637b-4a4c-ab08-6929554f003a-0_7-98-5106_20210511171722.parquet|null 
>          |null        
> |hdfs://nameservice1/user/hive/warehouse/jira_expl.db/jiraissue_events/832cf07f-637b-4a4c-ab08-6929554f003a-0_7-98-5106_20210511171722.parquet
>    |
> |20210511171722     |20210511171722_7_13718|1237858.0         |               
>        
> |832cf07f-637b-4a4c-ab08-6929554f003a-0_7-98-5106_20210511171722.parquet|null 
>          |null        
> |hdfs://nameservice1/user/hive/warehouse/jira_expl.db/jiraissue_events/832cf07f-637b-4a4c-ab08-6929554f003a-0_8-1610-78711_20210511173615.parquet%7C
> +-------------------+----------------------+------------------+----------------------+-----------------------------------------------------------------------+--------------+------------+------------------------------------------------------------------------------------------------------------------------------------------------+
> {code}
> *Reading `hudi` using Spark:*
> {code:java}
> scala> val jih = 
> spark.read.format("org.apache.hudi").load("/user/hive/warehouse/jira_expl.db/jiraissue_events")
> jih: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, 
> _hoodie_commit_seqno: string ... 49 more fields]scala> jih.filter($"id" === 
> 1237858).select($"_hoodie_commit_time", $"_hoodie_commit_seqno", 
> $"_hoodie_record_key", $"_hoodie_partition_path", 
> $"_hoodie_file_name",$"resolutiondate", $"archiveddate").show(false)
> +-------------------+----------------------+------------------+----------------------+-----------------------------------------------------------------------+-------------------+-------------------+
> |_hoodie_commit_time|_hoodie_commit_seqno  
> |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name                  
>                                     |resolutiondate     |archiveddate       |
> +-------------------+----------------------+------------------+----------------------+-----------------------------------------------------------------------+-------------------+-------------------+
> |20210511171722     |20210511171722_7_13718|1237858.0         |               
>        
> |832cf07f-637b-4a4c-ab08-6929554f003a-0_7-98-5106_20210511171722.parquet|2018-07-30
>  14:58:52|1969-12-31 19:00:00|
> +-------------------+----------------------+------------------+----------------------+-----------------------------------------------------------------------+-------------------+-------------------+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to