[
https://issues.apache.org/jira/browse/HUDI-184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16981649#comment-16981649
]
vinoyang commented on HUDI-184:
-------------------------------
[~vinoth] Based on limited observation, answer your two questions:
bq. We need to decide when to commit a batch of record i.e pause streaming
across workers and publish to Hudi timeline. In a purely streaming model can
this be achieved?
Yes, as discussed before, we will use Flink's window mechanism as bounded
stream abstraction. A window is a batch, when firing the window, we can
implement {{ProcessWindowFunction}}
(https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html#processwindowfunction)
which can let us access all the elements cache in the window. We can implement
the commit business logic in this UDF.
bq. How do we run compaction? Can a physical Flink/YARN job for e.g run both
ingestion and compaction concurrently, as we can do with Spark/DeltaStreamer
continuous mode now?
Yes, It seems the ingestion and compaction steps are independent of each other?
We just let them exist in the same Spark job? If so, it's also not a problem in
Flink.
> Integrate Hudi with Apache Flink
> --------------------------------
>
> Key: HUDI-184
> URL: https://issues.apache.org/jira/browse/HUDI-184
> Project: Apache Hudi (incubating)
> Issue Type: New Feature
> Components: Write Client
> Reporter: vinoyang
> Assignee: vinoyang
> Priority: Major
>
> Apache Flink is a popular streaming processing engine.
> Integrating Hudi with Flink is a valuable work.
> The discussion mailing thread is here:
> [https://lists.apache.org/api/source.lua/1533de2d4cd4243fa9e8f8bf057ffd02f2ac0bec7c7539d8f72166ea@%3Cdev.hudi.apache.org%3E]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)