[ 
https://issues.apache.org/jira/browse/HUDI-184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979838#comment-16979838
 ] 

Taher Koitawala commented on HUDI-184:
--------------------------------------

[~yanghua] In terms of Streaming I do not have a specific plan but a rough idea 
of how it should look like or what could work better per se. However, i feel 
like we should start with mocking our own Hudi implementation in flink to start 
with. 

States being a huge part of the implementation for Hudi indexes. With that 
implementation we will get more clarity on how we can further improve our 
approach, as I am really sure that Flink states can handle the huge spark cache 
of indexes we do in Hudi. Reason being, in flink all caches and joins and Flink 
SQL are powered by flink states, user can choose memory or file or RocksDB 
states. Mixing them with the ProcessFunctions which give us onTimer controls 
and sideOutputs and with proper keyBys we can make it i guess. 

Also i am confident because I have deployed flink jobs which handled around 
21TB of data per day in the RocksDb state and the pipelines were really 
performant and fault tolerant. What i am not confident about is reading the 
existing data files with flink streaming model as I have seen a lot of 
performance degradation with that.

And move over to avoid a lot of rewriting of code we can directly base our 
implementation on Flink SQL as after the community's contribution to Blink 
planner (SQL planner for Flink), flink sql has really taken off and is stronger 
now. 

Let me know what you think about this!

As of batch, we can still use states there too. However, i am a little 
reluctant and fearsome about Flink Batch, reason is that in almost all use 
cases we tired. Flink batch could never beat Spark batch. 

> Integrate Hudi with Apache Flink
> --------------------------------
>
>                 Key: HUDI-184
>                 URL: https://issues.apache.org/jira/browse/HUDI-184
>             Project: Apache Hudi (incubating)
>          Issue Type: New Feature
>          Components: Write Client
>            Reporter: vinoyang
>            Assignee: vinoyang
>            Priority: Major
>
> Apache Flink is a popular streaming processing engine.
> Integrating Hudi with Flink is a valuable work.
> The discussion mailing thread is here: 
> [https://lists.apache.org/api/source.lua/1533de2d4cd4243fa9e8f8bf057ffd02f2ac0bec7c7539d8f72166ea@%3Cdev.hudi.apache.org%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to