Re: [D] Flink support [incubator-gluten]

via GitHub Tue, 04 Mar 2025 18:10:10 -0800


GitHub user zhanglistar edited a comment on the discussion: Flink support



**Some Ideas:**

1. **Data Structure Adaptation:** Convert RowData to Vector, supporting RowKind.
2. **Operator Support:** Extend Velox Operator to support state and rollback.
3. **State Management:** Integrate Flink's StateBackend into Velox.
4. **Distributed Execution:** Connect Flink's networking and scheduling logic 
to Velox.
5. **Execution Flow:** Replace Flink StreamTask with Velox Task.

1. **State Engine**

Velox does not have an internal state. We can introduce RocksDB as the state, 
which is not too difficult. However, this could become a potential performance 
bottleneck, so careful consideration is necessary. Since RocksDB involves disk 
operations, it is essential to orchestrate IO as asynchronous batch reads.

2. **Stream retract Functionality**

This is one of the fundamental differences between stream and batch processing. 
Streams are updated, while batches are not. For example, a table from MySQL 
naturally has updates and deletes. Flink introduces a stream retract mechanism 
to address this issue, which involves carrying metadata on the data to indicate 
whether it is an update, delete, or insert. Clearly, Velox does not have this 
capability.

Several modifications are needed: 
1. Operators need to support rollback, such as window, stream join, stream 
aggregation, etc. This means that almost all operators will need to add this 
interface, which is a significant amount of work.
2. The data stream needs to include rollback indicators in the vector to 
indicate whether the current data is rollback data.

Once both the operators and the data stream are modified to support rollback, 
the overall execution flow should be fine.

3. **Stateless Operators** such as Map, Filter, FlatMap, KeyBy, Project/Calc, 
Union, Split/Select, and some Sources and Sinks do not depend on state and only 
transform the current input. They are suitable for simple data processing and 
can be reused directly. This is similar to the approach taken by Ant Group, 
which offloads expression calculations to Velox using JNI. However, due to 
potential row-column conversion and low coverage, there may be performance 
issues, and the overall speedup may not be significant.


GitHub link: 
https://github.com/apache/incubator-gluten/discussions/8849#discussioncomment-12395428

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [D] Flink support [incubator-gluten]

Reply via email to