Re: [PR] [RFC-70]: Add Reverse streamer RFC [hudi]

via GitHub Wed, 11 Sep 2024 11:22:44 -0700


bibhu107 commented on code in PR #9492:
URL: https://github.com/apache/hudi/pull/9492#discussion_r1755289129



##########
rfc/rfc-70/rfc-70.md:
##########
@@ -0,0 +1,177 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-70: Hudi Reverse Streamer
+
+## Proposers
+
+* @pratyakshsharma
+
+## Approvers
+
+* 
+
+## Status
+UNDER REVIEW
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-6425
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+This RFC proposes the design and implementation of `HoodieReverseStreamer`, a 
tool which has the exact opposite behaviour of more popular 
`HoodieDeltaStreamer` tool. 
+HoodieDeltaStreamer allows users to stream data incrementally into hudi tables 
from varied sources like kafka, DFS. HudiReverseStreamer will allow users to 
stream data back from Hudi to various sinks like kafka, DFS etc. 
+
+## Background
+
+Apache Hudi currently supports multiple ways of writing data into hudi tables 
and migrating existing data into hudi format. These ways include spark 
datasource, delta streamer, flink streamer and the lower level 
HoodieWriteClient. All of these cover pretty much all the common workloads, 
common sources and use cases. Hudi is the pioneer of the lakehouse term, and 
lakehouse term is nothing but the combination of data lakes and data 
warehouses. Data warehouses are generally created to solve a particular 
business use case and are served as a separate database. A typical data 
pipeline for creating data warehouse involves the extraction of data from 
primary relational databases, transforming the data and then serving it as a 
specialized database. 
+Essentially this involved some tool (for example, debezium + secor or apache 
sqoop) to extract raw data from relational database and then some kind of 
transformation to prepare derived data.
+Hudi has been a popular choice for extracting and transforming the data, but 
it does not allow users to move the data back to some other database for 
serving to end users. With this RFC, we propose a new way of creating data 
pipelines where HoodieDeltaStreamer + Transformer is used for preparing the 
derived data and at a later stage, HoodieReverseStreamer can be used for 
writing this data to a desired location for end user consumption.
+
+
+## Implementation
+
+On a high level, the changes for this new tool can be divided into below 
categories - writer, timeline metadata.
+
+### Writer Flow
+A new class `HoodieReverseStreamer` will be introduced in hudi-utilities 
package. This essentially is responsible for doing the following actions - 
+
+- Can read the entire data from hudi table in case no checkpoint is provided, 
else read from the last checkpoint. Option to manually overwrite the checkpoint 
to resume from will be provided. Checkpoint information will be maintained in a 
new timeline action `reverseCommit`. Code from HiveIncrementalPuller can 
probably be reused for this step. Also since checkpoints are involved, this 
will only be supported via spark-submit job and not spark datasource. This is 
inline with current DeltaStreamer support.
+- The records will be processed to remove hoodie specific metadata columns
+- Converted into sink specific format and perform any serialization needed.
+- Write to sink
+
+This does not need any sort of schema provider since all the necessary checks 
have been performed when the data was written into hudi table.
+
+```java
+public class HoodieReverseStreamer {
+    
+    void readFromSource();
+    void writeToSink();

Review Comment:
   Optimizing Hudi Writer-Reader Workflow: Inline Reverse Streamer
   
   Typically, after writing to a Hudi table, a reader performs an incremental 
or snapshot read before sending data downstream to another Hudi table, S3, or 
other storage.
   
   To streamline this, could we cache write commits and enable an inline 
ReverseStreamer? This would eliminate the need for a separate read job, 
allowing the upserted data (taking care of atomicity) to be sent directly to 
downstream sinks within the same job, reducing overhead.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [RFC-70]: Add Reverse streamer RFC [hudi]

Reply via email to