[jira] [Comment Edited] (HUDI-1267) Additional Metadata Details for Hudi Transactions

Ashish M G (Jira) Thu, 03 Sep 2020 19:11:23 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190482#comment-17190482
 ]


Ashish M G edited comment on HUDI-1267 at 9/4/20, 2:10 AM:
-----------------------------------------------------------

[~vinoth] Yes, that would be good to have present timeline CLI metadata as a 
Table in Hudi. Maybe I can raise another Jira for the same if thats possible 
for present releases. The idea for this Jira adds more value in terms of audit 
purposes ( Point 2) which gives a view for the user to see all the transactions 
happening on the Data Lake . But if we can have Point 1 implemented in 
immediate releases, that would be great as Its an addition of last operation 
performed on a row


was (Author: ashishmg):
[~vinoth] Yes, that would be good to have present timeline CLI metadata as a 
Table in Hudi. Maybe can raise another Jira for the same of thats possible for 
present releases. The idea for this Jira adds more value in terms of audit 
purposes ( Point 2) which gives a view for the user to see all the transactions 
happening on the Data Lake . But if we can have Point 1 implemented in 
immediate releases, that would be great as Its an addition of last operation 
performed on a row

> Additional Metadata Details for Hudi Transactions
> -------------------------------------------------
>
>                 Key: HUDI-1267
>                 URL: https://issues.apache.org/jira/browse/HUDI-1267
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: Usability
>            Reporter: Ashish M G
>            Priority: Major
>              Labels: features
>             Fix For: 0.7.0
>
>
> Whenever following scenarios happen :
>  # Custom Datasource ( Kafka for instance ) -> Hudi Table
>  # Hudi -> Hudi Table
>  # s3 -> Hudi Table
> Following metadata need to be captured :
>  # Table Level Metadata
>  * 
>  ** Operation name ( record level ) like Upsert, Insert etc for last 
> operation performed on the row
>  # Transaction Level Metadata ( This will be logged on Hudi Level and not 
> Table Level )
>  ** Source ( Kafka Topic Name / S3 url for source data in case of s3 etc )
>  ** Target Hudi Table Name
>  ** Last transaction time ( last commit time )
> Basically , point (1) collects all details on table level  and point (2) 
> collects all the transactions happened on Hudi Level
> Point(1) would be just a column addition for operation type
> Eg for Point (2) :  Suppose we had an ingestion from Kafka topic 'A' to Hudi 
> table 'ingest_kafka' and another ingestion from RDBMS table ( 'tableA' ) 
> through Sqoop to Hudi Table 'RDBMSingest' then the metadata captured would be 
> :
>  
> |Source|Timestamp|Transaction Type|Target|
> |Kafka - 'A'|XXXXXX|UPSERT|ingest_kafka|
> |RDBMS - 'tableA'|XXXXXX|INSERT|RDBMSingest|
>  
> The Transaction Details Table in Point (2) should be available as a separate 
> common table which can be queried as Hudi Table or stored as parquet which 
> can be queried from Spark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HUDI-1267) Additional Metadata Details for Hudi Transactions

Reply via email to