This is an automated email from the ASF dual-hosted git repository.
zabetak pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/hive-site.git
The following commit(s) were added to refs/heads/main by this push:
new 9ae3cda6 HIVE-29621: Duplicate Hive transactions page (#105)
9ae3cda6 is described below
commit 9ae3cda69fec2151a905a382c61778d670fef0b8
Author: Stamatis Zampetakis <[email protected]>
AuthorDate: Fri May 29 13:00:55 2026 +0200
HIVE-29621: Duplicate Hive transactions page (#105)
1. Unify content into hive-transactions.md page.
2. Drop Hive-Transactions-ACID.md page since there are no links to it.
---
content/docs/latest/user/Hive-Transactions-ACID.md | 279 ---------------------
content/docs/latest/user/hive-transactions.md | 69 +++--
2 files changed, 48 insertions(+), 300 deletions(-)
diff --git a/content/docs/latest/user/Hive-Transactions-ACID.md
b/content/docs/latest/user/Hive-Transactions-ACID.md
deleted file mode 100644
index 438b1b02..00000000
--- a/content/docs/latest/user/Hive-Transactions-ACID.md
+++ /dev/null
@@ -1,279 +0,0 @@
----
-title: "Apache Hive : Hive Transactions (Hive ACID)"
-date: 2024-12-12
----
-
-# Apache Hive : Hive Transactions (Hive ACID)
-
-## What is ACID and why should you use it?
-
-ACID stands for four traits of database transactions: Atomicity (an operation
either succeeds completely or fails, it does not leave partial data),
Consistency (once an application performs an operation the results of that
operation are visible to it in every subsequent operation),
[Isolation](https://en.wikipedia.org/wiki/Isolation_(database_systems)) (an
incomplete operation by one user does not cause unexpected side effects for
other users), and Durability (once an operation is compl [...]
-
-Transactions with ACID semantics have been added to Hive to address the
following use cases:
-
-1. Streaming ingest of data. Many users have tools such as [Apache
Flume](http://flume.apache.org/), [Apache
Storm](https://storm.incubator.apache.org/), or [Apache
Kafka](http://kafka.apache.org/) that they use to stream data into their Hadoop
cluster. While these tools can write data at rates of hundreds or more rows
per second, Hive can only add partitions every fifteen minutes to an hour.
Adding partitions more often leads quickly to an overwhelming number of
partitions in the tab [...]
-2. Slow changing dimensions. In a typical star schema data warehouse,
dimensions tables change slowly over time. For example, a retailer will open
new stores, which need to be added to the stores table, or an existing store
may change its square footage or some other tracked characteristic. These
changes lead to inserts of individual records or updates of records (depending
on the strategy chosen).
-3. Data restatement. Sometimes collected data is found to be incorrect and
needs correction. Or the first instance of the data may be an approximation
(90% of servers reporting) with the full data provided later. Or business
rules may require that certain transactions be restated due to subsequent
transactions (e.g., after making a purchase a customer may purchase a
membership and thus be entitled to discount prices, including on the previous
purchase). Or a user may be contractually [...]
-4. Bulk updates using [SQL
MERGE](/docs/latest/language/languagemanual-dml#merge) statement.
-
-## Limitations
-
-* *BEGIN*, *COMMIT*, and *ROLLBACK* are not yet supported. All language
operations are auto-commit.
-* Only [ORC file format]({{< ref "languagemanual-orc" >}}) is supported. The
feature has been built such that transactions can be used by any storage format
that can determine how updates or deletes apply to base records (basically,
that has an explicit or implicit row id), but so far the integration work has
only been done for ORC.
-* By default transactions are configured to be off. See the
[Configuration]({{< ref "#configuration" >}}) section below for a discussion of
which values need to be set to configure it.
-* Tables must be [bucketed]({{< ref "languagemanual-ddl-bucketedtables" >}})
to make use of these features. Tables in the same system not using
transactions and ACID do not need to be bucketed.
-* Reading/writing to an ACID table from a non-ACID session is not allowed. In
other words, the Hive transaction manager must be set to
org.apache.hadoop.hive.ql.lockmgr.DbTxnManager in order to work with ACID
tables.
-* At this time only snapshot level isolation is supported. When a given query
starts it will be provided with a consistent snapshot of the data. There is no
support for dirty read, read committed, repeatable read, or serializable. With
the introduction of BEGIN the intention is to support snapshot isolation for
the duration of transaction rather than just a single query. Other isolation
levels may be added depending on user requests.
-* The existing ZooKeeper and in-memory lock managers are not compatible with
transactions. There is no intention to address this issue. See [Basic
Design]({{< ref "#basic-design" >}}) below for a discussion of how locks are
stored for transactions.
-* Using Oracle as the Metastore DB and
"datanucleus.connectionPoolingType=BONECP" may generate intermittent "No such
lock.." and "No such transaction..." errors. Setting
"datanucleus.connectionPoolingType=DBCP" is recommended in this case.
-* [LOAD
DATA...](/docs/latest/language/languagemanual-dml#loading-files-into-tables)
statement is not supported with transactional tables. (This was not properly
enforced until [HIVE-16732](https://issues.apache.org/jira/browse/HIVE-16732))
-
-## Streaming APIs
-
-Hive offers APIs for streaming data ingest and streaming mutation:
-
-* [Hive HCatalog Streaming API]({{< ref "streaming-data-ingest" >}})
-* [Hive Streaming API](/docs/latest/user/streaming-data-ingest-v2) (Since Hive
3)
-* [HCatalog Streaming Mutation API (Copy)]({{< ref
"HCatalog-Streaming-Mutation-API" >}}) (available in Hive 2.0.0 and later)
-
-A comparison of these two APIs is available in the [Background]({{< ref
"#background" >}}) section of the Streaming Mutation document.
-
-## Grammar Changes
-
-*INSERT...VALUES, UPDATE*, and *DELETE* have been added to the SQL grammar,
starting in [Hive 0.14](https://issues.apache.org/jira/browse/HIVE-5317). See
[LanguageManual DML]({{< ref "languagemanual-dml" >}}) for details.
-
-Several new commands have been added to Hive's DDL in support of ACID and
transactions, plus some existing DDL has been modified.
-
-A new command *SHOW TRANSACTIONS* has been added, see [Show Transactions]({{<
ref "#show-transactions" >}}) for details.
-
-A new command *SHOW COMPACTIONS* has been added, see [Show Compactions]({{<
ref "#show-compactions" >}}) for details.
-
-The *SHOW LOCKS* command has been altered to provide information about the new
locks associated with transactions. If you are using the ZooKeeper or
in-memory lock managers you will notice no difference in the output of this
command. See [Show Locks]({{< ref "#show-locks" >}}) for details.
-
-A new option has been added to *ALTER TABLE* to request a compaction of a
table or partition. In general users do not need to request compactions, as
the system will detect the need for them and initiate the compaction. However,
if [compaction is turned off]({{< ref "#compaction-is-turned-off" >}}) for a
table or a user wants to compact the table at a time the system would not
choose to, *ALTER TABLE* can be used to initiate the compaction. See [Alter
Table/Partition Compact]({{< ref [...]
-
-A new command *ABORT TRANSACTIONS* has been added, see [Abort
Transactions](/docs/latest/language/languagemanual-ddl#abort-transactions) for
details.
-
-## Basic Design
-
-HDFS does not support in-place changes to files. It also does not offer read
consistency in the face of writers appending to files being read by a user. In
order to provide these features on top of HDFS we have followed the standard
approach used in other data warehousing tools. Data for the table or partition
is stored in a set of base files. New records, updates, and deletes are stored
in delta files. A new set of delta files is created for each transaction (or
in the case of stre [...]
-
-### Base and Delta Directories
-
-Previously all files for a partition (or a table if the table is not
partitioned) lived in a single directory. With these changes, any partitions
(or tables) written with an ACID aware writer will have a directory for the
base files and a directory for each set of delta files. Here is what this may
look like for an unpartitioned table "t":
-
-**Filesystem Layout for Table "t"**
-
-```
-hive> dfs -ls -R /user/hive/warehouse/t;
-drwxr-xr-x - ekoifman staff 0 2016-06-09 17:03
/user/hive/warehouse/t/base_0000022
--rw-r--r-- 1 ekoifman staff 602 2016-06-09 17:03
/user/hive/warehouse/t/base_0000022/bucket_00000
-drwxr-xr-x - ekoifman staff 0 2016-06-09 17:06
/user/hive/warehouse/t/delta_0000023_0000023_0000
--rw-r--r-- 1 ekoifman staff 611 2016-06-09 17:06
/user/hive/warehouse/t/delta_0000023_0000023_0000/bucket_00000
-drwxr-xr-x - ekoifman staff 0 2016-06-09 17:07
/user/hive/warehouse/t/delta_0000024_0000024_0000
--rw-r--r-- 1 ekoifman staff 610 2016-06-09 17:07
/user/hive/warehouse/t/delta_0000024_0000024_0000/bucket_00000
-```
-
-### Compactor
-
-Compactor is a set of background processes running inside the Metastore to
support ACID system. It consists of Initiator, Worker, Cleaner,
AcidHouseKeeperService and a few others.
-
-#### Delta File Compaction
-
-As operations modify the table more and more delta files are created and need
to be compacted to maintain adequate performance. There are three types of
compactions, minor, major and rebalance.
-
-* **Minor** compaction takes a set of existing delta files and rewrites them
to a single delta file per bucket.
-* **Major** compaction takes one or more delta files and the base file for the
bucket and rewrites them into a new base file per bucket. Major compaction is
more expensive but is more effective.
-* More information about **rebalance** compaction can be found here:
[Rebalance compaction]({{< ref "rebalance-compaction" >}})
-
-All compactions are done in the background. Minor and major compactions do not
prevent concurrent reads and writes of the data. Rebalance compaction uses
exclusive write lock, therefore it prevents concurrent writes. After a
compaction the system waits until all readers of the old files have finished
and then removes the old files.
-
-#### Initiator
-
-This module is responsible for discovering which tables or partitions are due
for compaction. This should be enabled in a Metastore using
[hive.compactor.initiator.on]({{< ref "#hive-compactor-initiator-on" >}}).
There are several properties of the form *.threshold in "New Configuration
Parameters for Transactions" table below that control when a compaction task is
created and which type of compaction is performed. Each compaction task
handles 1 partition (or whole table if the table [...]
-
-#### Worker
-
-Each Worker handles a single compaction task. A compaction is a MapReduce job
with name in the following form: <hostname>-compactor-<db>.<table>.<partition>.
Each worker submits the job to the cluster (via [hive.compactor.job.queue]({{<
ref "#hive-compactor-job-queue" >}}) if defined) and waits for the job to
finish. [hive.compactor.worker.threads]({{< ref
"#hive-compactor-worker-threads" >}}) determines the number of Workers in each
Metastore. The total number of Workers in the Hive [...]
-
-#### Cleaner
-
-This process is a process that deletes delta files after compaction and after
it determines that they are no longer needed.
-
-#### AcidHouseKeeperService
-
-This process looks for transactions that have not heartbeated in
[hive.txn.timeout]({{< ref "#hive-txn-timeout" >}}) time and aborts them. The
system assumes that a client that initiated a transaction stopped heartbeating
crashed and the resources it locked should be released.
-
-#### SHOW COMPACTIONS
-
-This commands displays information about currently running compaction and
recent history (configurable retention period) of compactions. This history
display is available since
[HIVE-12353](https://issues.apache.org/jira/browse/HIVE-12353).
-
-Also see [LanguageManual DDL#ShowCompactions]({{< ref
"#languagemanual-ddl#showcompactions" >}}) for more information on the output
of this command and [NewConfigurationParametersforTransactions]({{< ref
"#newconfigurationparametersfortransactions" >}})/Compaction History for
configuration properties affecting the output of this command. The system
retains the last N entries of each type: failed, succeeded, attempted (where N
is configurable for each type).
-
-
-
-### Transaction/Lock Manager
-
-A new logical entity called "transaction manager" was added which
incorporated previous notion of "database/table/partition lock manager"
(hive.lock.manager with default of
org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager). The
transaction manager is now additionally responsible for managing of
transactions locks. The default DummyTxnManager emulates behavior of old Hive
versions: has no transactions and uses hive.lock.manager property to create
lock manager for tabl [...]
-
-The length of time that the DbLockManger will continue to try to acquire locks
can be controlled via [hive.lock.numretires](http://Configuration
Properties#hive.lock.numretires) and
[hive.lock.sleep.between.retries](http://Configuration
Properties#hive.lock.sleep.between.retries). When the DbLockManager cannot
acquire a lock (due to existence of a competing lock), it will back off and try
again after a certain time period. In order to support short running queries
and not overwhelm the [...]
-
-More [details](/docs/latest/language/languagemanual-ddl#show-locks) on locks
used by this Lock Manager.
-
-Note that the lock manager used by DbTxnManager will acquire locks on all
tables, even those without "transactional=true" property. By default, Insert
operation into a non-transactional table will acquire an exclusive lock and
thus block other inserts and reads. While technically correct, this is a
departure from how Hive traditionally worked (i.e. w/o a lock manger). For
backwards compatibility, [hive.txn.strict.locking.mode](http://Configuration
Properties#hive.txn.strict.locking.mo [...]
-
-## Configuration
-
-Minimally, these configuration parameters must be set appropriately to turn on
transaction support in Hive:
-
-Client Side
-
-* [hive.support.concurrency]({{< ref "#hive-support-concurrency" >}}) – true
-* [hive.exec.dynamic.partition.mode]({{< ref
"#hive-exec-dynamic-partition-mode" >}}) – nonstrict
-* [hive.txn.manager]({{< ref "#hive-txn-manager" >}}) –
org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
-
-Server Side (Metastore)
-
-* [hive.compactor.initiator.on]({{< ref "#hive-compactor-initiator-on" >}}) –
true (See table below for more details)
-* [hive.compactor.cleaner.on]({{< ref "#hive-compactor-cleaner-on" >}}) – true
(See table below for more details)
-* [hive.compactor.worker.threads]({{< ref "#hive-compactor-worker-threads"
>}}) – a positive number on at least one instance of the Thrift metastore
service
-
-The following sections list all of the configuration parameters that affect
Hive transactions and compaction. Also see [Limitations]({{< ref
"#limitations" >}}) above and [Table Properties]({{< ref "#table-properties"
>}}) below.
-
-### New Configuration Parameters for Transactions
-
-A number of new configuration parameters have been added to the system to
support transactions.
-
-| **Configuration key** | **Values** | **Location** | **Notes** |
-| --- | --- | --- | --- |
-| [hive.txn.manager]({{< ref "#hive-txn-manager" >}}) | *Default:*
org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager*Value required for
transactions:* org.apache.hadoop.hive.ql.lockmgr.DbTxnManager |
Client/HiveServer2 | DummyTxnManager replicates pre Hive-0.13 behavior and
provides no transactions. |
-| [hive.txn.strict.locking.mode]({{< ref "#hive-txn-strict-locking-mode" >}})
| *Default:* true | Client/ HiveServer2 | In strict mode non-ACID resources use
standard R/W lock semantics, e.g. INSERT will acquire exclusive lock. In
non-strict mode, for non-ACID resources, INSERT will only acquire shared lock,
which allows two concurrent writes to the same partition but still lets lock
manager prevent DROP TABLE etc. when the table is being written to (as of [Hive
2.2.0](https://issues.apa [...]
-| [hive.txn.timeout]({{< ref "#hive-txn-timeout" >}}) deprecated. Use
metastore.txn.timeout instead | *Default:* 300 | Client/HiveServer2/Metastore
| Time after which transactions are declared aborted if the client has not sent
a heartbeat, in seconds. It's critical that this property has the same value
for all components/services.5 |
-|
[hive.txn.heartbeat.threadpool.size](/docs/latest/user/configuration-properties#hivetxnheartbeatthreadpoolsize)
deprecated - but still in use | *Default:* 5 | Client/HiveServer2 | The number
of threads to use for heartbeating (as of [Hive 1.3.0 and
2.0.0](https://issues.apache.org/jira/browse/HIVE-12366)). |
-| [hive.timedout.txn.reaper.start]({{< ref "#hive-timedout-txn-reaper-start"
>}}) deprecated | *Default:* 100s | Metastore | Time delay of first reaper (the
process which aborts timed-out transactions) run after the metastore starts (as
of [Hive 1.3.0](https://issues.apache.org/jira/browse/HIVE-11317)). Controls
AcidHouseKeeperServcie above. |
-| [hive.timedout.txn.reaper.interval]({{< ref
"#hive-timedout-txn-reaper-interval" >}}) deprecated | *Default:* 180s |
Metastore | Time interval describing how often the reaper (the process which
aborts timed-out transactions) runs (as of [Hive
1.3.0](https://issues.apache.org/jira/browse/HIVE-11317)). Controls
AcidHouseKeeperServcie above. |
-| [hive.txn.max.open.batch]({{< ref "#hive-txn-max-open-batch" >}})
deprecated. Use metastore.txn.max.open.batch instead | *Default:* 1000 | Client
| Maximum number of transactions that can be fetched in one call to
open_txns().1 |
-| [hive.max.open.txns]({{< ref "#hive-max-open-txns" >}}) deprecated. Use
metastore.max.open.txns instead. | *Default:* 100000 | HiveServer2/ Metastore |
Maximum number of open transactions. If current open transactions reach this
limit, future open transaction requests will be rejected, until the number goes
below the limit. (As of [Hive 1.3.0 and
2.1.0](https://issues.apache.org/jira/browse/HIVE-13249).) |
-| [hive.count.open.txns.interval]({{< ref "#hive-count-open-txns-interval"
>}}) deprecated. Use metastore.count.open.txns.interval instead. | *Default:*
1s | HiveServer2/ Metastore | Time in seconds between checks to count open
transactions (as of [Hive 1.3.0 and
2.1.0](https://issues.apache.org/jira/browse/HIVE-13249)). |
-| [hive.txn.retryable.sqlex.regex]({{< ref "#hive-txn-retryable-sqlex-regex"
>}}) deprecated. Use metastore.txn.retryable.sqlex.regex instead. | *Default:*
"" (empty string) | HiveServer2/ Metastore | Comma separated list of regular
expression patterns for SQL state, error code, and error message of retryable
SQLExceptions, that's suitable for the Hive metastore database (as of [Hive
1.3.0 and 2.1.0](https://issues.apache.org/jira/browse/HIVE-12637)).For an
example, see [Configuration Pr [...]
-| hive.compaction.merge.enabled | *Default:* false | HiveServer2 | Enables
merge-based compaction which is a compaction optimization when few ORC delta
files are present |
-| hive.compactor.initiator.duration.update.interval | *Default:* 60s |
HiveServer2 | Time in seconds that drives the update interval of
compaction_initiator_duration metric.Smaller value results in a fine grained
metric update.This updater can be turned off if its value less than or equals
to zero.In this case the above metric will be update only after the initiator
completed one cycle.The hive.compactor.initiator.on must be turned on (true)
in-order to enable the Initiator,otherwise thi [...]
-| [hive.compactor.initiator.on]({{< ref "#hive-compactor-initiator-on" >}})
deprecated. Use metastore.compactor.initiator.on instead. | *Default:*
false*Value required for transactions:* true (for exactly one instance of the
Thrift metastore service) | Metastore | Whether to run the initiator thread on
this metastore instance. Prior to [Hive
1.3.0](https://issues.apache.org/jira/browse/HIVE-11388) it's critical that
this is enabled on exactly one standalone metastore service instance (no [...]
-| hive.compactor.cleaner.duration.update.interval | *Default:* 60s |
HiveServer2 | Time in seconds that drives the update interval of
compaction_cleaner_duration metric.Smaller value results in a fine grained
metric update.This updater can be turned off if its value less than or equals
to zero.In this case the above metric will be update only after the cleaner
completed one cycle. |
-| [hive.compactor.cleaner.on]({{< ref "#hive-compactor-cleaner-on" >}})
deprecated. Use metastore.compactor.cleaner.on instead. | *Default:*
false*Value required for transactions:* true (for exactly one instance of the
Thrift metastore service) | Metastore | Whether to run the cleaner thread on
this metastore instance. Before **Hive 4.0.0** Cleaner thread can be
started/stopped with config hive.compactor.initiator.on. This config helps to
enable/disable initiator/cleaner threads independently |
-| hive.compactor.cleaner.threads.num | *Default:* 1 | HiveServer2 | Enables
parallelization of the cleaning directories after compaction, that includes
many file related checks and may be expensive |
-| hive.compactor.compact.insert.only | *Default:* true | HiveServer2 | Whether
the compactor should compact insert-only tables. A safety switch. |
-| hive.compactor.crud.query.based | *Default*: false | HiveServer2 | Means
compaction on full CRUD tables is done via queries. Compactions on insert-only
tables will always run via queries regardless of the value of this
configuration. |
-| hive.compactor.gather.stats | *Default:* true | HiveServer2 | If set to true
MAJOR compaction will gather stats if there are stats already associated with
the table/partition.Turn this off to save some resources and the stats are not
used anyway.This is a replacement for the HIVE_MR_COMPACTOR_GATHER_STATS
config, and works both for MR and Query based compaction. |
-| metastore.compactor.initiator.failed.retry.time | *Default: 7d* | Metastore
| Time after Initiator will ignore
metastore.compactor.initiator.failed.compacts.threshold and retry with
compaction again. This will try to auto heal tables with previous failed
compaction without manual intervention. Setting it to 0 or negative value will
disable this feature. |
-| metastore.compactor.long.running.initiator.threshold.warning | *Default:* 6h
| Metastore | Initiator cycle duration after which a warning will be logged.
Default time unit is: hours |
-| metastore.compactor.long.running.initiator.threshold.error | *Default:* 12h
| Metastore | Initiator cycle duration after which an error will be logged.
Default time unit is: hours |
-| hive.compactor.worker.sleep.time | *Default:*10800ms | HiveServer2 | Time in
milliseconds for which a worker threads goes into sleep before starting another
iteration in case of no launched job or error |
-| hive.compactor.worker.max.sleep.time | *Default:* 320000ms | HiveServer2 |
Max time in milliseconds for which a worker threads goes into sleep before
starting another iteration used for backoff in case of no launched job or error
|
-| [hive.compactor.worker.threads]({{< ref "#hive-compactor-worker-threads"
>}}) deprecated. Use metastore.compactor.worker.threads instead. | *Default:*
0*Value required for transactions:* > 0 on at least one instance of the Thrift
metastore service | Metastore | How many compactor worker threads to run on
this metastore instance.2 |
-| [hive.compactor.worker.timeout]({{< ref "#hive-compactor-worker-timeout"
>}}) | *Default:* 86400s | Metastore | Time in seconds after which a compaction
job will be declared failed and the compaction re-queued. |
-| [hive.compactor.cleaner.run.interval]({{< ref
"#hive-compactor-cleaner-run-interval" >}}) | *Default*: 5000ms | Metastore |
Time in milliseconds between runs of the cleaner thread. ([Hive
0.14.0](https://issues.apache.org/jira/browse/HIVE-8258) and later.) |
-| [hive.compactor.check.interval]({{< ref "#hive-compactor-check-interval"
>}}) | *Default:* 300s | Metastore | Time in seconds between checks to see if
any tables or partitions need to be compacted.3 |
-| [hive.compactor.delta.num.threshold]({{< ref
"#hive-compactor-delta-num-threshold" >}}) | *Default:* 10 | Metastore | Number
of delta directories in a table or partition that will trigger a minor
compaction. |
-| [hive.compactor.delta.pct.threshold]({{< ref
"#hive-compactor-delta-pct-threshold" >}}) | *Default:* 0.1 | Metastore |
Percentage (fractional) size of the delta files relative to the base that will
trigger a major compaction. 1 = 100%, so the default 0.1 = 10%. |
-| [hive.compactor.abortedtxn.threshold]({{< ref
"#hive-compactor-abortedtxn-threshold" >}}) | *Default:* 1000 | Metastore |
Number of aborted transactions involving a given table or partition that will
trigger a major compaction. |
-| [hive.compactor.aborted.txn.time.threshold]({{< ref
"#hive-compactor-aborted-txn-time-threshold" >}}) | *Default*: 12h | Metastore
| Age of table/partition's oldest aborted transaction when compaction will be
triggered. Default time unit is: hours. Set to a negative number to disable. |
-| [hive.compactor.max.num.delta]({{< ref "#hive-compactor-max-num-delta" >}})
| Default: 500 | Metastore | Maximum number of delta files that the compactor
will attempt to handle in a single job (as of [Hive
1.3.0](https://issues.apache.org/jira/browse/HIVE-11540)).4 |
-| [hive.compactor.job.queue]({{< ref "#hive-compactor-job-queue" >}}) |
*Default*: "" (empty string) | Metastore | Used to specify name of Hadoop
queue to which Compaction jobs will be submitted. Set to empty string to let
Hadoop choose the queue (as of [Hive
1.3.0](https://issues.apache.org/jira/browse/HIVE-11997)). |
-| hive.compactor.request.queue | *Default:* 1 | HiveServer2 | Enables
parallelization of the checkForCompaction operation, that includes many file
metadata checksand may be expensive |
-| hive.split.grouping.mode | *Default:* query (Allowed values: query,
compactor) | HiveServer2 | This is set to compactor from within the query based
compactor. This enables the Tez SplitGrouper to group splits based on their
bucket number, so that all rows from different bucket files for the same
bucket number can end up in the same bucket file after the compaction. |
-| hive.txn.xlock.iow | Default: true | HiveServer2 | Ensures commands with
OVERWRITE (such as INSERT OVERWRITE) acquire Exclusive locks fortransactional
tables. This ensures that inserts (w/o overwrite) running concurrentlyare not
hidden by the INSERT OVERWRITE. |
-| hive.txn.xlock.write | *Default*: true | HiveServer2 | Manages concurrency
levels for ACID resources. Provides better level of query parallelism by
enabling shared writes and write-write conflict resolution at the commit step.-
If true - exclusive writes are used: - INSERT OVERWRITE acquires EXCLUSIVE
locks - UPDATE/DELETE acquire EXCL_WRITE locks - INSERT acquires SHARED_READ
locks- If false - shared writes, transaction is aborted in case of conflicting
changes: - INSERT OVERWRITE [...]
-| metastore.acidmetrics.ext.on | *Default:* true | HiveServer2 | Whether to
collect additional acid related metrics outside of the acid metrics service.
(metastore.metrics.enabled and/or hive.server2.metrics.enabled are also
required to be set to true.) |
-| Compaction History |
-| hive.compactor.history.retention.succeeded deprecated. Use
metastore.compactor.history.retention.succeeded instead | Default: 3 |
Metastore | Number of successful compaction entries to retain in history (per
partition). |
-| hive.compactor.history.retention.failed deprecated. Use
metastore.compactor.history.retention.failed instead. | Default: 3 | Metastore
| Number of failed compaction entries to retain in history (per partition). |
-| hive.compactor.history.retention.attempted deprecated. Use
metastore.compactor.history.retention.did.not.initiate instead. | Default: 2 |
Metastore | Number of attempted compaction entries to retain in history (per
partition). |
-| hive.compactor.initiator.failed.compacts.threshold deprecated. Use
metastore.compactor.initiator.failed.compacts.threshold instead. | Default: 2 |
Metastore | Number of of consecutive failed compactions for a given partition
after which the Initiator will stop attempting to schedule compactions
automatically. It is still possible to use [ALTER
TABLE](/docs/latest/language/languagemanual-ddl#alter-tablepartition-compact)
to initiate compaction. Once a manually initiated compaction succe [...]
-| metastore.compactor.initiator.failed.compacts.threshold | *Default*: 2
(Allowed between 1 and 20) | Metastore | Number of consecutive compaction
failures (per table/partition) after which automatic compactions will not be
scheduled any more. Note that this must be less than
hive.compactor.history.retention.failed. |
-| hive.compactor.history.reaper.interval deprecated.
metastore.acid.housekeeper.interval handles it. | Default: 2m | Metastore |
Controls how often the process to purge historical record of compactions runs. |
-| ACID metrics | | | |
-| metastore.acidmetrics.check.interval | *Default*: 300s | Metastore | Time in
seconds between acid related metric collection runs. |
-| metastore.acidmetrics.thread.on | *Default:* true | Metastore | Whether to
run acid related metrics collection on this metastore instance. |
-| metastore.deltametrics.delta.num.threshold | *Deafult:* 100 | Metastore |
The minimum number of active delta files a table/partition must have in order
to be included in the ACID metrics report. |
-| metastore.deltametrics.delta.pct.threshold | *Default:* 0.01 | Metastore |
Percentage (fractional) size of the delta files relative to the base directory.
Deltas smaller than this threshold count as small deltas. Default 0.01 = 1%.) |
-| metastore.deltametrics.max.cache.size | *Default:* 100 (Allowed between 0
and 500) | Metastore | Size of the ACID metrics cache, i.e. max number of
partitions and unpartitioned tables with the most deltas that will be included
in the lists of active, obsolete and small deltas. Allowed range is 0 to 500. |
-| metastore.deltametrics.obsolete.delta.num.threshold | *Default:* 100 |
Metastore | The minimum number of obsolete delta files a table/partition must
have in order to be included in the ACID metrics report. |
-
-1metastore.txn.max.open.batch controls how many transactions streaming agents
such as Flume or Storm open simultaneously. The streaming agent then writes
that number of entries into a single file (per Flume agent or Storm bolt).
Thus increasing this value decreases the number of delta files created by
streaming agents. But it also increases the number of open transactions that
Hive has to track at any given time, which may negatively affect read
performance.
-
- 2Worker threads spawn MapReduce jobs to do compactions. They do not do the
compactions themselves. Increasing the number of worker threads will decrease
the time it takes tables or partitions to be compacted once they are determined
to need compaction. It will also increase the background load on the Hadoop
cluster as more MapReduce jobs will be running in the background. Each
compaction can handle one partition at a time (or whole table if it's
unpartitioned).
-
-3Decreasing this value will reduce the time it takes for compaction to be
started for a table or partition that requires compaction. However, checking
if compaction is needed requires several calls to the NameNode for each table
or partition that has had a transaction done on it since the last major
compaction. So decreasing this value will increase the load on the NameNode.
-
-4If the compactor detects a very high number of delta files, it will first run
several partial minor compactions (currently sequentially) and then perform the
compaction actually requested.
-
-5If the value is not the same active transactions may be determined to be
"timed out" and consequently Aborted. This will result in errors like "No such
transaction...", "No such lock ..."
-
-### Configuration Values to Set for Hive ACID (*INSERT, UPDATE, DELETE)*
-
-In addition to the new parameters listed above, some existing parameters need
to be set to support *INSERT ... VALUES, UPDATE,*and *DELETE*.
-
-| Configuration key | Must be set to |
-| --- | --- |
-| [hive.support.concurrency]({{< ref "#hive-support-concurrency" >}}) | true
(default is false) |
-| [hive.enforce.bucketing]({{< ref "#hive-enforce-bucketing" >}}) | true
(default is false) (Not required as of [Hive
2.0](https://issues.apache.org/jira/browse/HIVE-12331)) |
-| [hive.exec.dynamic.partition.mode]({{< ref
"#hive-exec-dynamic-partition-mode" >}}) | nonstrict (default is strict) |
-
-### Configuration Values to Set for Compaction
-
-If the data in your system is not owned by the Hive user (i.e., the user that
the Hive metastore runs as), then Hive will need permission to run as the user
who owns the data in order to perform compactions. If you have already set up
HiveServer2 to impersonate users, then the only additional work to do is assure
that Hive has the right to impersonate users from the host running the Hive
metastore. This is done by adding the hostname to
`hadoop.proxyuser.hive.hosts` in Hadoop's `core-s [...]
-
-### Compaction pooling
-
-More in formation on compaction pooling can be found here: [Compaction
pooling](/docs/latest/language/compaction-pooling)
-
-## Table Properties
-
-If a table is to be used in ACID writes (insert, update, delete) then the
table property "transactional=true" must be set on that table. Note, once a
table has been defined as an ACID table via TBLPROPERTIES
("transactional"="true"), it cannot be converted back to a non-ACID table,
i.e., changing TBLPROPERTIES ("transactional"="false") is not allowed. Also,
[hive.txn.manager]({{< ref "#hive-txn-manager" >}}) must be set to
org.apache.hadoop.hive.ql.lockmgr.DbTxnManager either in hive-sit [...]
-
-If a table owner does not wish the system to automatically determine when to
compact, then the table property "`NO_AUTO_COMPACTION`" can be set. This will
prevent all automatic compactions. Manual compactions can still be done with
[Alter Table/Partition Compact]({{< ref "#alter-table/partition-compact" >}})
statements.
-
-Table properties are set with the TBLPROPERTIES clause when a table is created
or altered, as described in the [Create Table]({{< ref "#create-table" >}}) and
[Alter Table Properties]({{< ref "#alter-table-properties" >}}) sections of
Hive Data Definition Language. The "`transactional`" and "`NO_AUTO_COMPACTION`"
table properties are case-insensitive.
-
-More compaction related options can be set via TBLPROPERTIES. They can be set
at both table-level via [CREATE
TABLE](/docs/latest/language/languagemanual-ddl#createdroptruncate-table), and
on request-level via [ALTER TABLE/PARTITION
COMPACT](/docs/latest/language/languagemanual-ddl#alter-tablepartition-compact).
These are used to override the Warehouse/table wide settings. For example,
to override an MR property to affect a compaction job, one can add
"compactor.<mr property name>=<val [...]
-
-**Example: Set compaction options in TBLPROPERTIES at table level**
-
-```
-CREATE TABLE table_name (
- id int,
- name string
-)
-CLUSTERED BY (id) INTO 2 BUCKETS STORED AS ORC
-TBLPROPERTIES ("transactional"="true",
- "compactor.mapreduce.map.memory.mb"="2048", -- specify compaction map
job properties
- "compactorthreshold.hive.compactor.delta.num.threshold"="4", -- trigger
minor compaction if there are more than 4 delta directories
- "compactorthreshold.hive.compactor.delta.pct.threshold"="0.5" -- trigger
major compaction if the ratio of size of delta files to
- -- size of
base files is greater than 50%
-);
-```
-
-**Example: Set compaction options in TBLPROPERTIES at request level**
-
-```
-ALTER TABLE table_name COMPACT 'minor'
- WITH OVERWRITE TBLPROPERTIES ("compactor.mapreduce.map.memory.mb"="3072");
-- specify compaction map job properties
-ALTER TABLE table_name COMPACT 'major'
- WITH OVERWRITE TBLPROPERTIES ("tblprops.orc.compress.size"="8192");
-- change any other Hive table properties
-```
-
-# Talks and Presentations
-
-[The Art of Compaction](https://youtu.be/h62Bhe78jW0?t=1953) by Kokila N at a
Cloudera meetup.
-
-Transactional Operations In Hive by Eugene Koifman at [Dataworks Summit 2017,
San Jose, CA, USA](https://dataworkssummit.com/san-jose-2017/agenda/)
-
-*
[Slides](https://www.slideshare.net/Hadoop_Summit/transactional-sql-in-apache-hive)
-* [Video](https://www.youtube.com/watch?v=Rk8irGDjpuI&feature=youtu.be)
-
-DataWorks Summit 2018, San Jose, CA, USA - Covers Hive 3 and ACID V2 features
-
-*
[Slides](https://www.slideshare.net/Hadoop_Summit/transactional-operations-in-apache-hive-present-and-future-102803358)
-* [Video](https://www.youtube.com/watch?v=GyzU9wG0cFQ&t=834s)
-
diff --git a/content/docs/latest/user/hive-transactions.md
b/content/docs/latest/user/hive-transactions.md
index f97aca10..30258c3e 100644
--- a/content/docs/latest/user/hive-transactions.md
+++ b/content/docs/latest/user/hive-transactions.md
@@ -156,34 +156,59 @@ A number of new configuration parameters have been added
to the system to suppor
| --- | --- | --- | --- |
| [hive.txn.manager]({{< ref "#hive-txn-manager" >}}) | *Default:*
org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager*Value required for
transactions:* org.apache.hadoop.hive.ql.lockmgr.DbTxnManager |
Client/HiveServer2 | DummyTxnManager replicates pre Hive-0.13 behavior and
provides no transactions. |
| [hive.txn.strict.locking.mode]({{< ref "#hive-txn-strict-locking-mode" >}})
| *Default:* true | Client/ HiveServer2 | In strict mode non-ACID resources use
standard R/W lock semantics, e.g. INSERT will acquire exclusive lock. In
non-strict mode, for non-ACID resources, INSERT will only acquire shared lock,
which allows two concurrent writes to the same partition but still lets lock
manager prevent DROP TABLE etc. when the table is being written to (as of [Hive
2.2.0](https://issues.apa [...]
-| [hive.txn.timeout]({{< ref "#hive-txn-timeout" >}}) | *Default:* 300 |
Client/HiveServer2/Metastore | Time after which transactions are declared
aborted if the client has not sent a heartbeat, in seconds. It's critical that
this property has the same value for all components/services.5 |
-|
[hive.txn.heartbeat.threadpool.size](/docs/latest/user/configuration-properties#hivetxnheartbeatthreadpoolsize)
| *Default:* 5 | Client/HiveServer2 | The number of threads to use for
heartbeating (as of [Hive 1.3.0 and
2.0.0](https://issues.apache.org/jira/browse/HIVE-12366)). |
-| [hive.timedout.txn.reaper.start]({{< ref "#hive-timedout-txn-reaper-start"
>}}) | *Default:* 100s | Metastore | Time delay of first reaper (the process
which aborts timed-out transactions) run after the metastore starts (as of
[Hive 1.3.0](https://issues.apache.org/jira/browse/HIVE-11317)). Controls
AcidHouseKeeperServcie above. |
-| [hive.timedout.txn.reaper.interval]({{< ref
"#hive-timedout-txn-reaper-interval" >}}) | *Default:* 180s | Metastore | Time
interval describing how often the reaper (the process which aborts timed-out
transactions) runs (as of [Hive
1.3.0](https://issues.apache.org/jira/browse/HIVE-11317)). Controls
AcidHouseKeeperServcie above. |
-| [hive.txn.max.open.batch]({{< ref "#hive-txn-max-open-batch" >}}) |
*Default:* 1000 | Client | Maximum number of transactions that can be fetched
in one call to open_txns().1 |
-| [hive.max.open.txns]({{< ref "#hive-max-open-txns" >}}) | *Default:* 100000
| HiveServer2/ Metastore | Maximum number of open transactions. If current open
transactions reach this limit, future open transaction requests will be
rejected, until the number goes below the limit. (As of [Hive 1.3.0 and
2.1.0](https://issues.apache.org/jira/browse/HIVE-13249).) |
-| [hive.count.open.txns.interval]({{< ref "#hive-count-open-txns-interval"
>}}) | *Default:* 1s | HiveServer2/ Metastore | Time in seconds between checks
to count open transactions (as of [Hive 1.3.0 and
2.1.0](https://issues.apache.org/jira/browse/HIVE-13249)). |
-| [hive.txn.retryable.sqlex.regex]({{< ref "#hive-txn-retryable-sqlex-regex"
>}}) | *Default:* "" (empty string) | HiveServer2/ Metastore | Comma separated
list of regular expression patterns for SQL state, error code, and error
message of retryable SQLExceptions, that's suitable for the Hive metastore
database (as of [Hive 1.3.0 and
2.1.0](https://issues.apache.org/jira/browse/HIVE-12637)).For an example, see
[Configuration Properties]({{< ref "#configuration-properties" >}}). |
-| [hive.compactor.initiator.on]({{< ref "#hive-compactor-initiator-on" >}}) |
*Default:* false*Value required for transactions:* true (for exactly one
instance of the Thrift metastore service) | Metastore | Whether to run the
initiator thread on this metastore instance. Prior to [Hive
1.3.0](https://issues.apache.org/jira/browse/HIVE-11388) it's critical that
this is enabled on exactly one standalone metastore service instance (not
enforced yet).As of [Hive 1.3.0](https://issues.apache.o [...]
-| [hive.compactor.cleaner.on]({{< ref "#hive-compactor-cleaner-on" >}}) |
*Default:* false*Value required for transactions:* true (for exactly one
instance of the Thrift metastore service) | Metastore | Whether to run the
cleaner thread on this metastore instance. Before **Hive 4.0.0** Cleaner thread
can be started/stopped with config hive.compactor.initiator.on. This config
helps to enable/disable initiator/cleaner threads independently |
-| [hive.compactor.worker.threads]({{< ref "#hive-compactor-worker-threads"
>}}) | *Default:* 0*Value required for transactions:* > 0 on at least one
instance of the Thrift metastore service | Metastore | How many compactor
worker threads to run on this metastore instance.2 |
-| [hive.compactor.worker.timeout]({{< ref "#hive-compactor-worker-timeout"
>}}) | *Default:* 86400 | Metastore | Time in seconds after which a compaction
job will be declared failed and the compaction re-queued. |
-| [hive.compactor.cleaner.run.interval]({{< ref
"#hive-compactor-cleaner-run-interval" >}}) | *Default*: 5000 | Metastore |
Time in milliseconds between runs of the cleaner thread. ([Hive
0.14.0](https://issues.apache.org/jira/browse/HIVE-8258) and later.) |
-| [hive.compactor.check.interval]({{< ref "#hive-compactor-check-interval"
>}}) | *Default:* 300 | Metastore | Time in seconds between checks to see if
any tables or partitions need to be compacted.3 |
+| [hive.txn.timeout]({{< ref "#hive-txn-timeout" >}}) deprecated. Use
metastore.txn.timeout instead | *Default:* 300 | Client/HiveServer2/Metastore
| Time after which transactions are declared aborted if the client has not sent
a heartbeat, in seconds. It's critical that this property has the same value
for all components/services.5 |
+|
[hive.txn.heartbeat.threadpool.size](/docs/latest/user/configuration-properties#hivetxnheartbeatthreadpoolsize)
deprecated - but still in use | *Default:* 5 | Client/HiveServer2 | The number
of threads to use for heartbeating (as of [Hive 1.3.0 and
2.0.0](https://issues.apache.org/jira/browse/HIVE-12366)). |
+| [hive.timedout.txn.reaper.start]({{< ref "#hive-timedout-txn-reaper-start"
>}}) deprecated | *Default:* 100s | Metastore | Time delay of first reaper (the
process which aborts timed-out transactions) run after the metastore starts (as
of [Hive 1.3.0](https://issues.apache.org/jira/browse/HIVE-11317)). Controls
AcidHouseKeeperServcie above. |
+| [hive.timedout.txn.reaper.interval]({{< ref
"#hive-timedout-txn-reaper-interval" >}}) deprecated | *Default:* 180s |
Metastore | Time interval describing how often the reaper (the process which
aborts timed-out transactions) runs (as of [Hive
1.3.0](https://issues.apache.org/jira/browse/HIVE-11317)). Controls
AcidHouseKeeperServcie above. |
+| [hive.txn.max.open.batch]({{< ref "#hive-txn-max-open-batch" >}})
deprecated. Use metastore.txn.max.open.batch instead | *Default:* 1000 | Client
| Maximum number of transactions that can be fetched in one call to
open_txns().1 |
+| [hive.max.open.txns]({{< ref "#hive-max-open-txns" >}}) deprecated. Use
metastore.max.open.txns instead. | *Default:* 100000 | HiveServer2/ Metastore |
Maximum number of open transactions. If current open transactions reach this
limit, future open transaction requests will be rejected, until the number goes
below the limit. (As of [Hive 1.3.0 and
2.1.0](https://issues.apache.org/jira/browse/HIVE-13249).) |
+| [hive.count.open.txns.interval]({{< ref "#hive-count-open-txns-interval"
>}}) deprecated. Use metastore.count.open.txns.interval instead. | *Default:*
1s | HiveServer2/ Metastore | Time in seconds between checks to count open
transactions (as of [Hive 1.3.0 and
2.1.0](https://issues.apache.org/jira/browse/HIVE-13249)). |
+| [hive.txn.retryable.sqlex.regex]({{< ref "#hive-txn-retryable-sqlex-regex"
>}}) deprecated. Use metastore.txn.retryable.sqlex.regex instead. | *Default:*
"" (empty string) | HiveServer2/ Metastore | Comma separated list of regular
expression patterns for SQL state, error code, and error message of retryable
SQLExceptions, that's suitable for the Hive metastore database (as of [Hive
1.3.0 and 2.1.0](https://issues.apache.org/jira/browse/HIVE-12637)).For an
example, see [Configuration Pr [...]
+| hive.compaction.merge.enabled | *Default:* false | HiveServer2 | Enables
merge-based compaction which is a compaction optimization when few ORC delta
files are present |
+| hive.compactor.initiator.duration.update.interval | *Default:* 60s |
HiveServer2 | Time in seconds that drives the update interval of
compaction_initiator_duration metric.Smaller value results in a fine grained
metric update.This updater can be turned off if its value less than or equals
to zero.In this case the above metric will be update only after the initiator
completed one cycle.The hive.compactor.initiator.on must be turned on (true)
in-order to enable the Initiator,otherwise thi [...]
+| [hive.compactor.initiator.on]({{< ref "#hive-compactor-initiator-on" >}})
deprecated. Use metastore.compactor.initiator.on instead. | *Default:*
false*Value required for transactions:* true (for exactly one instance of the
Thrift metastore service) | Metastore | Whether to run the initiator thread on
this metastore instance. Prior to [Hive
1.3.0](https://issues.apache.org/jira/browse/HIVE-11388) it's critical that
this is enabled on exactly one standalone metastore service instance (no [...]
+| hive.compactor.cleaner.duration.update.interval | *Default:* 60s |
HiveServer2 | Time in seconds that drives the update interval of
compaction_cleaner_duration metric.Smaller value results in a fine grained
metric update.This updater can be turned off if its value less than or equals
to zero.In this case the above metric will be update only after the cleaner
completed one cycle. |
+| [hive.compactor.cleaner.on]({{< ref "#hive-compactor-cleaner-on" >}})
deprecated. Use metastore.compactor.cleaner.on instead. | *Default:*
false*Value required for transactions:* true (for exactly one instance of the
Thrift metastore service) | Metastore | Whether to run the cleaner thread on
this metastore instance. Before **Hive 4.0.0** Cleaner thread can be
started/stopped with config hive.compactor.initiator.on. This config helps to
enable/disable initiator/cleaner threads independently |
+| hive.compactor.cleaner.threads.num | *Default:* 1 | HiveServer2 | Enables
parallelization of the cleaning directories after compaction, that includes
many file related checks and may be expensive |
+| hive.compactor.compact.insert.only | *Default:* true | HiveServer2 | Whether
the compactor should compact insert-only tables. A safety switch. |
+| hive.compactor.crud.query.based | *Default*: false | HiveServer2 | Means
compaction on full CRUD tables is done via queries. Compactions on insert-only
tables will always run via queries regardless of the value of this
configuration. |
+| hive.compactor.gather.stats | *Default:* true | HiveServer2 | If set to true
MAJOR compaction will gather stats if there are stats already associated with
the table/partition.Turn this off to save some resources and the stats are not
used anyway.This is a replacement for the HIVE_MR_COMPACTOR_GATHER_STATS
config, and works both for MR and Query based compaction. |
+| metastore.compactor.initiator.failed.retry.time | *Default: 7d* | Metastore
| Time after Initiator will ignore
metastore.compactor.initiator.failed.compacts.threshold and retry with
compaction again. This will try to auto heal tables with previous failed
compaction without manual intervention. Setting it to 0 or negative value will
disable this feature. |
+| metastore.compactor.long.running.initiator.threshold.warning | *Default:* 6h
| Metastore | Initiator cycle duration after which a warning will be logged.
Default time unit is: hours |
+| metastore.compactor.long.running.initiator.threshold.error | *Default:* 12h
| Metastore | Initiator cycle duration after which an error will be logged.
Default time unit is: hours |
+| hive.compactor.worker.sleep.time | *Default:*10800ms | HiveServer2 | Time in
milliseconds for which a worker threads goes into sleep before starting another
iteration in case of no launched job or error |
+| hive.compactor.worker.max.sleep.time | *Default:* 320000ms | HiveServer2 |
Max time in milliseconds for which a worker threads goes into sleep before
starting another iteration used for backoff in case of no launched job or error
|
+| [hive.compactor.worker.threads]({{< ref "#hive-compactor-worker-threads"
>}}) deprecated. Use metastore.compactor.worker.threads instead. | *Default:*
0*Value required for transactions:* > 0 on at least one instance of the Thrift
metastore service | Metastore | How many compactor worker threads to run on
this metastore instance.2 |
+| [hive.compactor.worker.timeout]({{< ref "#hive-compactor-worker-timeout"
>}}) | *Default:* 86400s | Metastore | Time in seconds after which a compaction
job will be declared failed and the compaction re-queued. |
+| [hive.compactor.cleaner.run.interval]({{< ref
"#hive-compactor-cleaner-run-interval" >}}) | *Default*: 5000ms | Metastore |
Time in milliseconds between runs of the cleaner thread. ([Hive
0.14.0](https://issues.apache.org/jira/browse/HIVE-8258) and later.) |
+| [hive.compactor.check.interval]({{< ref "#hive-compactor-check-interval"
>}}) | *Default:* 300s | Metastore | Time in seconds between checks to see if
any tables or partitions need to be compacted.3 |
| [hive.compactor.delta.num.threshold]({{< ref
"#hive-compactor-delta-num-threshold" >}}) | *Default:* 10 | Metastore | Number
of delta directories in a table or partition that will trigger a minor
compaction. |
| [hive.compactor.delta.pct.threshold]({{< ref
"#hive-compactor-delta-pct-threshold" >}}) | *Default:* 0.1 | Metastore |
Percentage (fractional) size of the delta files relative to the base that will
trigger a major compaction. 1 = 100%, so the default 0.1 = 10%. |
| [hive.compactor.abortedtxn.threshold]({{< ref
"#hive-compactor-abortedtxn-threshold" >}}) | *Default:* 1000 | Metastore |
Number of aborted transactions involving a given table or partition that will
trigger a major compaction. |
| [hive.compactor.aborted.txn.time.threshold]({{< ref
"#hive-compactor-aborted-txn-time-threshold" >}}) | *Default*: 12h | Metastore
| Age of table/partition's oldest aborted transaction when compaction will be
triggered. Default time unit is: hours. Set to a negative number to disable. |
| [hive.compactor.max.num.delta]({{< ref "#hive-compactor-max-num-delta" >}})
| Default: 500 | Metastore | Maximum number of delta files that the compactor
will attempt to handle in a single job (as of [Hive
1.3.0](https://issues.apache.org/jira/browse/HIVE-11540)).4 |
| [hive.compactor.job.queue]({{< ref "#hive-compactor-job-queue" >}}) |
*Default*: "" (empty string) | Metastore | Used to specify name of Hadoop
queue to which Compaction jobs will be submitted. Set to empty string to let
Hadoop choose the queue (as of [Hive
1.3.0](https://issues.apache.org/jira/browse/HIVE-11997)). |
+| hive.compactor.request.queue | *Default:* 1 | HiveServer2 | Enables
parallelization of the checkForCompaction operation, that includes many file
metadata checksand may be expensive |
+| hive.split.grouping.mode | *Default:* query (Allowed values: query,
compactor) | HiveServer2 | This is set to compactor from within the query based
compactor. This enables the Tez SplitGrouper to group splits based on their
bucket number, so that all rows from different bucket files for the same
bucket number can end up in the same bucket file after the compaction. |
+| hive.txn.xlock.iow | Default: true | HiveServer2 | Ensures commands with
OVERWRITE (such as INSERT OVERWRITE) acquire Exclusive locks fortransactional
tables. This ensures that inserts (w/o overwrite) running concurrentlyare not
hidden by the INSERT OVERWRITE. |
+| hive.txn.xlock.write | *Default*: true | HiveServer2 | Manages concurrency
levels for ACID resources. Provides better level of query parallelism by
enabling shared writes and write-write conflict resolution at the commit step.-
If true - exclusive writes are used: - INSERT OVERWRITE acquires EXCLUSIVE
locks - UPDATE/DELETE acquire EXCL_WRITE locks - INSERT acquires SHARED_READ
locks- If false - shared writes, transaction is aborted in case of conflicting
changes: - INSERT OVERWRITE [...]
+| metastore.acidmetrics.ext.on | *Default:* true | HiveServer2 | Whether to
collect additional acid related metrics outside of the acid metrics service.
(metastore.metrics.enabled and/or hive.server2.metrics.enabled are also
required to be set to true.) |
| Compaction History |
-| hive.compactor.history.retention.succeeded | *Default: 3* | Metastore |
Number of successful compaction entries to retain in history (per partition). |
-| hive.compactor.history.retention.failed | *Default: 3* | Metastore | Number
of failed compaction entries to retain in history (per partition). |
-| hive.compactor.history.retention.attempted | *Default: 2* | Metastore |
Number of attempted compaction entries to retain in history (per partition). |
-| hive.compactor.initiator.failed.compacts.threshold | *Default: 2* |
Metastore | Number of of consecutive failed compactions for a given partition
after which the Initiator will stop attempting to schedule compactions
automatically. It is still possible to use [ALTER
TABLE](/docs/latest/language/languagemanual-ddl#alter-tablepartition-compact)
to initiate compaction. Once a manually initiated compaction succeeds auto
initiated compactions will resume. Note that this must be less than hi [...]
-| hive.compactor.history.reaper.interval | *Default: 2m* | Metastore |
Controls how often the process to purge historical record of compactions runs. |
-
-1. hive.txn.max.open.batch controls how many transactions streaming agents
such as Flume or Storm open simultaneously. The streaming agent then writes
that number of entries into a single file (per Flume agent or Storm bolt).
Thus increasing this value decreases the number of delta files created by
streaming agents. But it also increases the number of open transactions that
Hive has to track at any given time, which may negatively affect read
performance.
+| hive.compactor.history.retention.succeeded deprecated. Use
metastore.compactor.history.retention.succeeded instead | Default: 3 |
Metastore | Number of successful compaction entries to retain in history (per
partition). |
+| hive.compactor.history.retention.failed deprecated. Use
metastore.compactor.history.retention.failed instead. | Default: 3 | Metastore
| Number of failed compaction entries to retain in history (per partition). |
+| hive.compactor.history.retention.attempted deprecated. Use
metastore.compactor.history.retention.did.not.initiate instead. | Default: 2 |
Metastore | Number of attempted compaction entries to retain in history (per
partition). |
+| hive.compactor.initiator.failed.compacts.threshold deprecated. Use
metastore.compactor.initiator.failed.compacts.threshold instead. | Default: 2 |
Metastore | Number of of consecutive failed compactions for a given partition
after which the Initiator will stop attempting to schedule compactions
automatically. It is still possible to use [ALTER
TABLE](/docs/latest/language/languagemanual-ddl#alter-tablepartition-compact)
to initiate compaction. Once a manually initiated compaction succe [...]
+| metastore.compactor.initiator.failed.compacts.threshold | *Default*: 2
(Allowed between 1 and 20) | Metastore | Number of consecutive compaction
failures (per table/partition) after which automatic compactions will not be
scheduled any more. Note that this must be less than
hive.compactor.history.retention.failed. |
+| hive.compactor.history.reaper.interval deprecated.
metastore.acid.housekeeper.interval handles it. | Default: 2m | Metastore |
Controls how often the process to purge historical record of compactions runs. |
+| ACID metrics | | | |
+| metastore.acidmetrics.check.interval | *Default*: 300s | Metastore | Time in
seconds between acid related metric collection runs. |
+| metastore.acidmetrics.thread.on | *Default:* true | Metastore | Whether to
run acid related metrics collection on this metastore instance. |
+| metastore.deltametrics.delta.num.threshold | *Deafult:* 100 | Metastore |
The minimum number of active delta files a table/partition must have in order
to be included in the ACID metrics report. |
+| metastore.deltametrics.delta.pct.threshold | *Default:* 0.01 | Metastore |
Percentage (fractional) size of the delta files relative to the base directory.
Deltas smaller than this threshold count as small deltas. Default 0.01 = 1%.) |
+| metastore.deltametrics.max.cache.size | *Default:* 100 (Allowed between 0
and 500) | Metastore | Size of the ACID metrics cache, i.e. max number of
partitions and unpartitioned tables with the most deltas that will be included
in the lists of active, obsolete and small deltas. Allowed range is 0 to 500. |
+| metastore.deltametrics.obsolete.delta.num.threshold | *Default:* 100 |
Metastore | The minimum number of obsolete delta files a table/partition must
have in order to be included in the ACID metrics report. |
+
+1. metastore.txn.max.open.batch controls how many transactions streaming
agents such as Flume or Storm open simultaneously. The streaming agent then
writes that number of entries into a single file (per Flume agent or Storm
bolt). Thus increasing this value decreases the number of delta files created
by streaming agents. But it also increases the number of open transactions
that Hive has to track at any given time, which may negatively affect read
performance.
2. Worker threads spawn MapReduce jobs to do compactions. They do not do the
compactions themselves. Increasing the number of worker threads will decrease
the time it takes tables or partitions to be compacted once they are determined
to need compaction. It will also increase the background load on the Hadoop
cluster as more MapReduce jobs will be running in the background. Each
compaction can handle one partition at a time (or whole table if it's
unpartitioned).
@@ -248,6 +273,8 @@ ALTER TABLE table_name COMPACT 'major'
# Talks and Presentations
+[The Art of Compaction](https://youtu.be/h62Bhe78jW0?t=1953) by Kokila N at a
Cloudera meetup.
+
Transactional Operations In Hive by Eugene Koifman at [Dataworks Summit 2017,
San Jose, CA, USA](https://dataworkssummit.com/san-jose-2017/agenda/)
*
[Slides](https://www.slideshare.net/Hadoop_Summit/transactional-sql-in-apache-hive)