[DISCUSS] Rethink the abstraction of current client

vino yang Tue, 19 Jan 2021 01:39:00 -0800

Hi guys,

*I open this thread to discuss if we can separate the attributes and
behaviors of HoodieTable, and rethink the abstraction of the client.*


Currently, in the hudi-client-common module, there is a HoodieTable class,
which contains a set of attributes and behaviors. For different engines, it
has different implementations. The existing classes include:

   - HoodieSparkTable;
   - HoodieFlinkTable;
   - HoodieJavaTable;

In addition, for two different table types: COW and MOR, these classes are
further split. For example, HoodieSparkTable is split into:

   - HoodieSparkCopyOnWriteTable;
   - HoodieSparkMergeOnReadTable;

HoodieSparkTable degenerates into a factory to initialize these classes.

This model looks clear but brings some problems.

First of all, HoodieTable is a mixture of attributes and behaviors. The
attributes are independent of the engines, but the behavior varies
depending on the engine. Semantically speaking, HoodieTable should belong
to hudi-common, and should not only be associated with hudi-client-common.

Second, the behaviors contained in HoodieTable, such as:

   - upsert
   - insert
   - delete
   - insertOverwrite

They are similar to the APIs provided by the client, but it is not
implemented directly in HoodieTable. Instead, the implementation is handed
over to a bunch of actions (executors), such as:

   - commit
   - compact
   - clean
   - rollback

In addition, these actions do not completely contain the implementation
logic. Part of their logic is separated into some Helper classes under the
same package, such as:

   - SparkWriteHelper
   - SparkMergeHelper
   - SparkDeleteHelper

To sum up, for abstraction, the implementation is moved backward layer by
layer (mainly completed by the executor + helper classes), which makes each
client need a lot of classes with similar patterns to implement the basic
API, and the class expansion is very serious.

Let us reorganize it:

What a write client does is to insert or upsert a batch of records to a
table with transaction semantics, and provide some additional operations to
the table. It contains three components:

   - Two objects: a table, a batch of records;
   - One type of operation: insert or upsert (focus on records)
   - One type of additional operation: compact / clean (focus on the table
   itself)

Therefore, the following improvements are proposed here:

   - The table object does not contain behavior, the table should be public
   and engine independent;
   - Classify and abstract the operation behavior:
      - TableInsertOperation(interface)
      - TableUpsertOperation(interface)
      - TableTransactionOperation
      - TableManageOperation(compact/clean…)

This kind of abstraction is more intuitive and focused so that there is
only one point of materialization. For example, the Spark engine for insert
operation will hatch the following specific implementation classes:

   - CoWTableSparkInsertOperation;
   - MoRTableSparkInsertOperation;

Of course, we can provide a factory class named TableSparkInsertOperation,
which is optional.

Based on the new abstraction, a new engine only needs to reimplement the
interfaces of the above behaviors, and then provide a new client to
instantiate them.

In order to focus here, I deliberately ignored an important object: the
index. The index should also be in the hudi-common module, and its
implementation may be engine-related, providing acceleration capabilities
for writing and querying at the same time.

The above is just a preliminary idea, there are still many details that
have not been considered. I hope to hear your thoughts on this.

Any opinions and thoughts are appreciated and welcome.

Best,
Vino

[DISCUSS] Rethink the abstraction of current client

Reply via email to