bibhu107 commented on code in PR #9492: URL: https://github.com/apache/hudi/pull/9492#discussion_r1755289129
########## rfc/rfc-70/rfc-70.md: ########## @@ -0,0 +1,177 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +# RFC-70: Hudi Reverse Streamer + +## Proposers + +* @pratyakshsharma + +## Approvers + +* + +## Status +UNDER REVIEW + +JIRA: https://issues.apache.org/jira/browse/HUDI-6425 + +> Please keep the status updated in `rfc/README.md`. + +## Abstract + +This RFC proposes the design and implementation of `HoodieReverseStreamer`, a tool which has the exact opposite behaviour of more popular `HoodieDeltaStreamer` tool. +HoodieDeltaStreamer allows users to stream data incrementally into hudi tables from varied sources like kafka, DFS. HudiReverseStreamer will allow users to stream data back from Hudi to various sinks like kafka, DFS etc. + +## Background + +Apache Hudi currently supports multiple ways of writing data into hudi tables and migrating existing data into hudi format. These ways include spark datasource, delta streamer, flink streamer and the lower level HoodieWriteClient. All of these cover pretty much all the common workloads, common sources and use cases. Hudi is the pioneer of the lakehouse term, and lakehouse term is nothing but the combination of data lakes and data warehouses. Data warehouses are generally created to solve a particular business use case and are served as a separate database. A typical data pipeline for creating data warehouse involves the extraction of data from primary relational databases, transforming the data and then serving it as a specialized database. +Essentially this involved some tool (for example, debezium + secor or apache sqoop) to extract raw data from relational database and then some kind of transformation to prepare derived data. +Hudi has been a popular choice for extracting and transforming the data, but it does not allow users to move the data back to some other database for serving to end users. With this RFC, we propose a new way of creating data pipelines where HoodieDeltaStreamer + Transformer is used for preparing the derived data and at a later stage, HoodieReverseStreamer can be used for writing this data to a desired location for end user consumption. + + +## Implementation + +On a high level, the changes for this new tool can be divided into below categories - writer, timeline metadata. + +### Writer Flow +A new class `HoodieReverseStreamer` will be introduced in hudi-utilities package. This essentially is responsible for doing the following actions - + +- Can read the entire data from hudi table in case no checkpoint is provided, else read from the last checkpoint. Option to manually overwrite the checkpoint to resume from will be provided. Checkpoint information will be maintained in a new timeline action `reverseCommit`. Code from HiveIncrementalPuller can probably be reused for this step. Also since checkpoints are involved, this will only be supported via spark-submit job and not spark datasource. This is inline with current DeltaStreamer support. +- The records will be processed to remove hoodie specific metadata columns +- Converted into sink specific format and perform any serialization needed. +- Write to sink + +This does not need any sort of schema provider since all the necessary checks have been performed when the data was written into hudi table. + +```java +public class HoodieReverseStreamer { + + void readFromSource(); + void writeToSink(); Review Comment: Optimizing Hudi Writer-Reader Workflow: Inline Reverse Streamer Typically, after writing to a Hudi table, a reader performs an incremental or snapshot read before sending data downstream to another Hudi table, S3, or other storage. To streamline this, could we cache write commits and enable an inline ReverseStreamer? This would eliminate the need for a separate read job, allowing the upserted data (taking care of atomicity) to be sent directly to downstream sinks within the same job, reducing overhead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
