Hi all, In the current master, there are two approaches for performing compactions of ACID tables [1]: * using hard-coded MapReduce jobs (aka. CompactorMR [2]); * using HiveQL queries (aka. QueryCompactor [3]) and delegating the execution to the underlying engine (MR, Tez, other);
The motivation for introducing the query compactor was to make compaction tasks engine independent, and potentially more efficient. In principle the query based compaction should be able to completely replace the respective MR jobs but it appears that it is not there yet. At the moment of writing this email the two compactor modes are complementary to each other. Compactions on insert-only tables (aka. micromanaged tables) can only be done in the using the query compactor. Moreover, query-based compactions on ACID tables work only when the underlying engine is Tez (various bugs [4] seem to be blocking the use of MR as an execution engine). The latter means that if someone is using MR as the execution engine they cannot use the query based compactor. Certain features (e.g., per-table selection of compaction queues [5]) exist for one mode (and apparently are important for end users) but are not yet implemented for the other. Currently the query based compactor is not part of any Apache Hive release but would be nice if someone could shed some light to the roadmap around this feature. I tried to summarize very briefly the state of this work based on my understanding but I am sure people who have worked on these areas of the code can provide much better insights. Some quick questions that come to mind are the following: Is there going to be support for MR based compactor in the next releases of Hive? Is the query based compactor gonna work with an engine other than Tez? Is someone working on this? Are there benefits in using the MR based compactor when the query based compactor is available? Are there major features that are not yet part of the query based compactor (and they need to be)? Finally, I don't see any documentation around the "new" query based compaction mode in the wiki [6]. I think it would be good if someone can update the respective part of the documentation before releasing the next Hive version. Best, Stamatis [1] HIVE-5317: Implement insert, update, and delete in Hive with full ACID support [2] HIVE-6319: Add compactor for ACID tables (Apr, 2014) [3] HIVE-20699: Query based compactor for full CRUD Acid tables (Feb, 2019) [4] HIVE-24015: Disable query-based compaction on MR execution engine (Karen Coppage, reviewed by Laszlo Pinter) [5] HIVE-20723: Allow per table specification of compaction yarn queue [6] https://cwiki.apache.org/confluence/display/hive/hive+transactions#HiveTransactions-Compactor