[
https://issues.apache.org/jira/browse/HUDI-184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979838#comment-16979838
]
Taher Koitawala commented on HUDI-184:
--------------------------------------
[~yanghua] In terms of Streaming I do not have a specific plan but a rough idea
of how it should look like or what could work better per se. However, i feel
like we should start with mocking our own Hudi implementation in flink to start
with.
States being a huge part of the implementation for Hudi indexes. With that
implementation we will get more clarity on how we can further improve our
approach, as I am really sure that Flink states can handle the huge spark cache
of indexes we do in Hudi. Reason being, in flink all caches and joins and Flink
SQL are powered by flink states, user can choose memory or file or RocksDB
states. Mixing them with the ProcessFunctions which give us onTimer controls
and sideOutputs and with proper keyBys we can make it i guess.
Also i am confident because I have deployed flink jobs which handled around
21TB of data per day in the RocksDb state and the pipelines were really
performant and fault tolerant. What i am not confident about is reading the
existing data files with flink streaming model as I have seen a lot of
performance degradation with that.
And move over to avoid a lot of rewriting of code we can directly base our
implementation on Flink SQL as after the community's contribution to Blink
planner (SQL planner for Flink), flink sql has really taken off and is stronger
now.
Let me know what you think about this!
As of batch, we can still use states there too. However, i am a little
reluctant and fearsome about Flink Batch, reason is that in almost all use
cases we tired. Flink batch could never beat Spark batch.
> Integrate Hudi with Apache Flink
> --------------------------------
>
> Key: HUDI-184
> URL: https://issues.apache.org/jira/browse/HUDI-184
> Project: Apache Hudi (incubating)
> Issue Type: New Feature
> Components: Write Client
> Reporter: vinoyang
> Assignee: vinoyang
> Priority: Major
>
> Apache Flink is a popular streaming processing engine.
> Integrating Hudi with Flink is a valuable work.
> The discussion mailing thread is here:
> [https://lists.apache.org/api/source.lua/1533de2d4cd4243fa9e8f8bf057ffd02f2ac0bec7c7539d8f72166ea@%3Cdev.hudi.apache.org%3E]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)