Currently, Hudi has a component that has not been widely used: Transformer.
As we all know, before the original data fell into the data lake, a very
common operation is data preprocessing and ETL. This is also the most
common use scenario of many computing engines, such as Flink and Spark. Now
that Hudi has taken advantage of the power of the computing engine, it can
also naturally take advantage of its ability of data preprocessing. We can
refactor the Transformer to make it become more flexible. To summarize, we
can refactor from the following aspects:

   - Decouple Transformer from Spark
   - Enrich the Transformer and provide built-in transformer
   - Support Transformer-chain

For the first point, the Transformer interface is tightly coupled with
Spark in design, and it contains a Spark-specific context. This makes it
impossible for us to take advantage of the transform capabilities provided
by other engines (such as Flink) after supporting multiple engines.
Therefore, we need to decouple it from Spark in design.

For the second point, we can enhance the Transformer and provide some
out-of-the-box Transformers, such as FilterTransformer, FlatMapTrnasformer,
and so on.

For the third point, the most common pattern for data processing is the
pipeline model, and the common implementation of the pipeline model is the
responsibility chain model, which can be compared to the Apache commons
chain[1], combining multiple Transformers can make data-processing become
more flexible and expandable.

If we enhance the capabilities of Transformer components, Hudi will provide
richer data processing capabilities based on the computing engine.

What do you think?

Any opinions and feedback are welcome and appreciated.

Best,
Vino

[1]: https://commons.apache.org/proper/commons-chain/

Reply via email to