[DISCUSS] Compactor (Query vs MR) roadmap

Stamatis Zampetakis Mon, 31 Jan 2022 14:03:00 -0800

Hi all,

In the current master, there are two approaches for performing compactions
of ACID tables [1]:
* using hard-coded MapReduce jobs (aka. CompactorMR [2]);
* using HiveQL queries (aka. QueryCompactor [3]) and delegating the
execution to the underlying engine (MR, Tez, other);

The motivation for introducing the query compactor was to make compaction
tasks engine independent, and potentially more efficient. In principle the
query based compaction should be able to completely replace the respective
MR jobs but it appears that it is not there yet.

At the moment of writing this email the two compactor modes are
complementary to each other. Compactions on insert-only tables (aka.
micromanaged tables) can only be done in the using the query compactor.
Moreover, query-based compactions on ACID tables work only when the
underlying engine is Tez (various bugs [4] seem to be blocking the use of
MR as an execution engine). The latter means that if someone is using MR as
the execution engine they cannot use the query based compactor. Certain
features (e.g., per-table selection of compaction queues [5]) exist for one
mode (and apparently are important for end users) but are not yet
implemented for the other.

Currently the query based compactor is not part of any Apache Hive release
but would be nice if someone could shed some light to the roadmap around
this feature. I tried to summarize very briefly the state of this work
based on my understanding but I am sure people who have worked on these
areas of the code can provide much better insights. Some quick questions
that come to mind are the following:
Is there going to be support for MR based compactor in the next releases of
Hive?
Is the query based compactor gonna work with an engine other than Tez? Is
someone working on this?
Are there benefits in using the MR based compactor when the query based
compactor is available?
Are there major features that are not yet part of the query based compactor
(and they need to be)?

Finally, I don't see any documentation around the "new" query based
compaction mode in the wiki [6]. I think it would be good if someone can
update the respective part of the documentation before releasing the next
Hive version.

Best,
Stamatis

[1] HIVE-5317: Implement insert, update, and delete in Hive with full ACID
support
[2] HIVE-6319: Add compactor for ACID tables (Apr, 2014)
[3] HIVE-20699: Query based compactor for full CRUD Acid tables (Feb, 2019)
[4] HIVE-24015: Disable query-based compaction on MR execution engine
(Karen Coppage, reviewed by Laszlo Pinter)
[5] HIVE-20723: Allow per table specification of compaction yarn queue
[6]
https://cwiki.apache.org/confluence/display/hive/hive+transactions#HiveTransactions-Compactor

[DISCUSS] Compactor (Query vs MR) roadmap

Reply via email to