This is an automated email from the ASF dual-hosted git repository.
zabetak pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/hive-site.git
The following commit(s) were added to refs/heads/main by this push:
new 971efd7d Fix some "Raw HTML omitted" warnings and formatting issues
(part 2) (#99)
971efd7d is described below
commit 971efd7db933161a7b05aa5624ddf2a0d94dc579
Author: Thomas Rebele <[email protected]>
AuthorDate: Tue Jun 9 09:32:48 2026 +0200
Fix some "Raw HTML omitted" warnings and formatting issues (part 2) (#99)
---
.../hive-across-multiple-data-centers.md | 14 +++++-----
.../desingdocs/hive-metadata-caching-proposal.md | 31 +++++++---------------
.../desingdocs/hivereplicationv2development.md | 26 +++++++++---------
content/Development/desingdocs/indexdev.md | 2 +-
.../Development/desingdocs/subqueries-in-select.md | 2 +-
.../support-saml-2-0-authentication-mode.md | 12 +++++++++
.../desingdocs/type-qualifiers-in-hive.md | 6 ++---
content/Development/gettingstarted-latest.md | 2 +-
.../docs/latest/admin/adminmanual-configuration.md | 14 +++++-----
.../adminmanual-metastore-3-0-administration.md | 2 +-
.../admin/adminmanual-metastore-administration.md | 16 +++++------
.../latest/admin/hive-on-spark-getting-started.md | 6 ++---
.../docs/latest/admin/setting-up-hiveserver2.md | 4 +--
13 files changed, 67 insertions(+), 70 deletions(-)
diff --git
a/content/Development/desingdocs/hive-across-multiple-data-centers.md
b/content/Development/desingdocs/hive-across-multiple-data-centers.md
index 47ba7f8f..9354660e 100644
--- a/content/Development/desingdocs/hive-across-multiple-data-centers.md
+++ b/content/Development/desingdocs/hive-across-multiple-data-centers.md
@@ -84,10 +84,10 @@ been imposed to simplify the problem:
The same idea can be extended for partitioned tables.
* The user can also decide to run in a particular cluster.
- + Use cluster <ClusterName>
+ + Use cluster `<ClusterName>`
* The system will not make an attempt to choose the cluster for the user, but
only try to figure out if the query can be run
- in the <clusterName>. If the query can run in this cluster, it will succeed.
Otherwise, it will fail.
+ in the `<clusterName>`. If the query can run in this cluster, it will
succeed. Otherwise, it will fail.
* The user can go back to the behavior to use the default cluster.
+ Use cluster
@@ -101,7 +101,7 @@ The same idea can be extended for partitioned tables.
PrimaryCluster - ClusterStorageDescriptor
- and SecondaryClusters - Set<ClusterStorageDescriptor>
+ and SecondaryClusters - Set<ClusterStorageDescriptor>>
The ClusterStorageDescriptor contains the following:
@@ -128,12 +128,12 @@ The existing thrift API's will continue to work as if the
user is trying to acce
New APIs will be added which take the cluster as a new parameter. Almost all
the existing APIs will be
-enhanced to support this. The behavior will be the same as if, the user issued
the command 'USE CLUSTER <CLUSTERNAME>
+enhanced to support this. The behavior will be the same as if, the user issued
the command `USE CLUSTER <CLUSTERNAME>`
* A new parameter will be added to keep the filesystem and jobtrackers for a
cluster
- + hive.cluster.properties: This will be json - ClusterName ->
<FileSystem, JobTracker>
- + use cluster <cluster name> will fail if <cluster name> is not present
hive.cluster.properties
- + The other option was to support create cluster <> etc. but that would
have required storing the cluster information in the
+ + hive.cluster.properties: This will be json - ClusterName ->
<FileSystem, JobTracker>
+ + use cluster `<cluster name>` will fail if `<cluster name>` is not
present hive.cluster.properties
+ + The other option was to support create cluster `<>` etc. but that
would have required storing the cluster information in the
metastore including jobtracker etc. which would be difficult to change
per session.
diff --git a/content/Development/desingdocs/hive-metadata-caching-proposal.md
b/content/Development/desingdocs/hive-metadata-caching-proposal.md
index 712757e3..3deaf115 100644
--- a/content/Development/desingdocs/hive-metadata-caching-proposal.md
+++ b/content/Development/desingdocs/hive-metadata-caching-proposal.md
@@ -63,57 +63,44 @@ Presto has the following cache:
+ userTablePrivileges
* Range scan cache
-+ databaseNamesCache: regex -> database names, facilitates database search
++ databaseNamesCache: regex -> database names, facilitates database search
+ tableNamesCache
+ viewNamesCache
-+ partitionNamesCache: table name -> partition names
++ partitionNamesCache: table name -> partition names
* Other
-+ partitionFilterCache: PS -> partition names, facilitates partition pruning
++ partitionFilterCache: PS -> partition names, facilitates partition pruning
For every partition filter condition, Presto breaks it down into tupleDomain
and remainder:
+```
AddExchanges.planTableScan:
-
DomainTranslator.ExtractionResult decomposedPredicate =
DomainTranslator.fromPredicate(
-
metadata,
-
session,
-
deterministicPredicate,
-
types);
-
public static class ExtractionResult
-
{
-
private final TupleDomain<Symbol> tupleDomain;
-
private final Expression remainingExpression;
-
}
+```
-tupleDomain is a mapping of column -> range or exact value. When converting to
PS, any range will be converted into wildcard and only exact value will be
considered:
+tupleDomain is a mapping of column -> range or exact value. When converting
to PS, any range will be converted into wildcard and only exact value will be
considered:
+```
HivePartitionManager.getFilteredPartitionNames:
-
for (HiveColumnHandle partitionKey : partitionKeys) {
-
if (domain != null && domain.isNullableSingleValue()) {
-
filter.add(((Slice) value).toStringUtf8());
-
else {
-
filter.add(PARTITION_VALUE_WILDCARD);
-
}
-
}
+```
-For example, the expression “state = CA and date between ‘201612’ and ‘201701’
will be broken down to PS (state = CA) and remainder date between ‘201612’ and
‘201701’. Presto will retrieve the partitions with state = CA from the PS ->
partition name cache and partition object cache, and evaluates “date between
‘201612’ and ‘201701’ for every partitions returned. This is a good balance
compare to caching partition names for every expression.
+For example, the expression “state = CA and date between ‘201612’ and ‘201701’
will be broken down to PS (state = CA) and remainder date between ‘201612’ and
‘201701’. Presto will retrieve the partitions with state = CA from the PS ->
partition name cache and partition object cache, and evaluates “date between
‘201612’ and ‘201701’ for every partitions returned. This is a good balance
compare to caching partition names for every expression.
## Our Approach
diff --git a/content/Development/desingdocs/hivereplicationv2development.md
b/content/Development/desingdocs/hivereplicationv2development.md
index 936198a9..9380e6ac 100644
--- a/content/Development/desingdocs/hivereplicationv2development.md
+++ b/content/Development/desingdocs/hivereplicationv2development.md
@@ -168,7 +168,7 @@ Event 100: ALTER TABLE tbl ADD PARTITION (p=1) SET LOCATION
<location>;
Event 110: ALTER TABLE tbl DROP PARTITION (p=1);
Event 120: ALTER TABLE tbl ADD PARTITION (p=1) SET LOCATION <location>;
```
-When loading the dump on the destination side (at a much later point), when
the event 100 is replayed, the load task on the destination will try to pull
the files from the <location> (the _files contains the path of <location>),
which may contain new or different data. To replicate the exact state of the
source at the time event 100 occurred at the source, we do the following:
+When loading the dump on the destination side (at a much later point), when
the event 100 is replayed, the load task on the destination will try to pull
the files from the `<location>` (the _files contains the path of `<location>`),
which may contain new or different data. To replicate the exact state of the
source at the time event 100 occurred at the source, we do the following:
1. When Event 100 occurs at the source, in the notification event, we store
the checksum of the file(s) in the newly added partition along with the file
path(s).
2. When Event 110 occurs at the source, we move the files of the dropped
partition to $cmroot/database/tbl/p=1 instead of purging them.
@@ -212,7 +212,9 @@ The current implementation of replication is built upon
existing commands EXPORT
This is better described via various examples of each of the pieces of the
command syntax, as follows:
-(a) REPL DUMP sales; REPL DUMP sales.['.*?']Replicates out sales
database for bootstrap, from <init-evid>=0 (bootstrap case) to
<end-evid>=<CURR-EVID> with a batch size of 0, i.e. no batching.
+(a) REPL DUMP sales; REPL DUMP sales.['.*?']
+
+Replicates out sales database for bootstrap, from `<init-evid>=0` (bootstrap
case) to `<end-evid>=<CURR-EVID>` with a batch size of 0, i.e. no batching.`
(b) REPL DUMP sales.['T3', '[a-z]+'];
@@ -228,15 +230,15 @@ This sets up db-level replication that excludes all the
tables/views but include
(e) REPL DUMP sales FROM 200 TO 1400;
-The presence of a FROM <init-evid> tag makes this dump not a bootstrap, but a
dump which looks at the event log to produce a delta dump. FROM 200 TO 1400 is
self-evident in that it will go through event ids 200 to 1400 looking for
events from the relevant db.
+The presence of a FROM `<init-evid>` tag makes this dump not a bootstrap, but
a dump which looks at the event log to produce a delta dump. FROM 200 TO 1400
is self-evident in that it will go through event ids 200 to 1400 looking for
events from the relevant db.
(f) REPL DUMP sales FROM 200;
-Similar to above, but with an implicit assumed <end-evid> as being the current
event id at the time the command is run.
+Similar to above, but with an implicit assumed `<end-evid>` as being the
current event id at the time the command is run.
(g) REPL DUMP sales FROM 200 to 1400 LIMIT 100;REPL DUMP sales FROM 200 LIMIT
100;
-Similar to cases (d) & (e), with the addition of a batch size of
<num-evids>=100. This causes us to stop processing if we reach 100 events, and
return at that point. Note that this does not mean that we stop processing at
event id = 300, since we began at 200 - it means that we will stop processing
events when we have processed 100 events in the event stream (that has
unrelated events) belonging to this replication-definition, i.e. of a relevant
db or db.table, then we stop.
+Similar to cases (d) & (e), with the addition of a batch size of
`<num-evids>=100`. This causes us to stop processing if we reach 100 events,
and return at that point. Note that this does not mean that we stop processing
at event id = 300, since we began at 200 - it means that we will stop
processing events when we have processed 100 events in the event stream (that
has unrelated events) belonging to this replication-definition, i.e. of a
relevant db or db.table, then we stop.
(h) REPL DUMP sales.['[a-z]+'] REPLACE sales FROM 200;
@@ -258,8 +260,8 @@ The REPL DUMP command has an optional WITH clause to set
command-specific confi
1. Error codes returned as return error codes (and over jdbc if with HS2)
2. Returns 2 columns in the ResultSet:
- 1. <dir-name> - the directory to which it has dumped info.
- 2. <last-evid> - the last event-id associated with this dump, which
might be the end-evid, or the curr-evid, as the case may be.
+ 1. `<dir-name>` - the directory to which it has dumped info.
+ 2. `<last-evid>` - the last event-id associated with this dump, which
might be the end-evid, or the curr-evid, as the case may be.
#### Note:
@@ -275,20 +277,18 @@ When bootstrap dump is in progress, it blocks rename
table/partition operations
Look up the HiveServer logs for below pair of log messages.
-> REPL DUMP:: Set property for Database: <db_name>, Property:
<bootstrap.dump.state.xxxx>, Value: ACTIVE
->
-> REPL DUMP:: Reset property for Database: <db_name>, Property:
<bootstrap.dump.state.xxxx>
->
+> REPL DUMP:: Set property for Database: `<db_name>`, Property:
`<bootstrap.dump.state.xxxx>`, Value: ACTIVE
>
+> REPL DUMP:: Reset property for Database: `<db_name>`, Property:
`<bootstrap.dump.state.xxxx>`
-If Reset property log is not found for the corresponding Set property log,
then user need to manually reset the database property
<bootstrap.dump.state.xxxx> with value as "IDLE" using ALTER DATABASE command.
+If Reset property log is not found for the corresponding Set property log,
then user need to manually reset the database property
`<bootstrap.dump.state.xxxx>` with value as "IDLE" using ALTER DATABASE command.
## REPL LOAD
`REPL LOAD {<dbname>} FROM <dirname> {WITH ('key1'='value1',
'key2'='value2')};`
-This causes a REPL DUMP present in <dirname> (which is to be a fully qualified
HDFS URL) to be pulled and loaded. If <dbname> is specified, and the original
dump was a database-level dump, this allows Hive to do db-rename-mapping on
import. If dbname is not specified, the original dbname as recorded in the dump
would be used.The REPL LOAD command has an optional WITH clause to set
command-specific configurations to be used when trying to copy from the source
cluster. These configurations [...]
+This causes a REPL DUMP present in `<dirname>` (which is to be a fully
qualified HDFS URL) to be pulled and loaded. If `<dbname>` is specified, and
the original dump was a database-level dump, this allows Hive to do
db-rename-mapping on import. If dbname is not specified, the original dbname as
recorded in the dump would be used.The REPL LOAD command has an optional WITH
clause to set command-specific configurations to be used when trying to copy
from the source cluster. These configurat [...]
#### Return values:
diff --git a/content/Development/desingdocs/indexdev.md
b/content/Development/desingdocs/indexdev.md
index a24afa97..5379e13c 100644
--- a/content/Development/desingdocs/indexdev.md
+++ b/content/Development/desingdocs/indexdev.md
@@ -281,7 +281,7 @@ TBD: we will be adding methods for calling the handler when
an index is dropped
The reference implementation creates what is referred to as a "compact" index.
This means that rather than storing the HDFS location of each occurrence of a
particular value, it only stores the addresses of HDFS blocks containing that
value. This is optimized for point-lookups in the case where a value typically
occurs more than once in nearby rows; the index size is kept small since there
are many fewer blocks than rows. The tradeoff is that extra work is required
during queries in orde [...]
-The compact index is stored in an index table. The index table columns consist
of the indexed columns from the base table followed by a _bucketname string
column (indicating the name of the file containing the indexed block) followed
by an _offsets array<string> column (indicating the block offsets within the
corresponding file). The index table is stored as sorted on the indexed columns
(but not on the generated columns).
+The compact index is stored in an index table. The index table columns consist
of the indexed columns from the base table followed by a _bucketname string
column (indicating the name of the file containing the indexed block) followed
by an `_offsets array<string>` column (indicating the block offsets within the
corresponding file). The index table is stored as sorted on the indexed columns
(but not on the generated columns).
The reference implementation can be plugged in with
diff --git a/content/Development/desingdocs/subqueries-in-select.md
b/content/Development/desingdocs/subqueries-in-select.md
index 9c7d50ab..d1cee53e 100644
--- a/content/Development/desingdocs/subqueries-in-select.md
+++ b/content/Development/desingdocs/subqueries-in-select.md
@@ -79,7 +79,7 @@ SELECT customer.customer_num,
) AS total_ship_chg
FROM customer
```
-* Subqueries with DISTINCT are not allowed. Since DISTINCT <expression> will
be evaluated as GROUP BY <expression>, subqueries with DISTINCT are disallowed
for now.
+* Subqueries with DISTINCT are not allowed. Since `DISTINCT <expression>` will
be evaluated as `GROUP BY <expression>`, subqueries with `DISTINCT` are
disallowed for now.
# Design
diff --git
a/content/Development/desingdocs/support-saml-2-0-authentication-mode.md
b/content/Development/desingdocs/support-saml-2-0-authentication-mode.md
index 5bd9c3ea..29b23f33 100644
--- a/content/Development/desingdocs/support-saml-2-0-authentication-mode.md
+++ b/content/Development/desingdocs/support-saml-2-0-authentication-mode.md
@@ -50,45 +50,57 @@ In order to make sure that the SAML assertions received by
HiveServer2 are valid
Following new configurations will be added to the hive-site.xml which would
need to be configured by the clients.
+```
<property>
<name>hive.server2.authentication</name>
<value>SAML</value>
</property>
+```
This configuration will be set to SAML to indicate that the server will use
SAML 2.0 protocol to authenticate the user.
+```
<property>
<name>hive.server2.saml2.idp.metadata</name>
<value>path_to_idp_metadata.xml</value>
</property>
+```
This configuration will provide a path to the IDP metadata xml file.
+```
<property>
<name>hive.server2.saml2.sp.entity.id</name>
<value>test_sp_entity_id</value>
</property>
+```
This configuration should be same the service provider entity id as configured
in the IDP. Some identity providers require this to be same as the ACS URL.
+```
<property>
<name>hive.server2.saml2.group.attribute.name</name>
<value>group_attribute_name</value>
</property>
+```
This configuration will be used to map the SAML attribute in the response to
the groups of the user. This should be configured in the identity provider as
the attribute name for the group information.
+```
<property>
<name>hive.server2.saml2.group.filter</name>
<value>comma_separated_group_names</value>
</property>
+```
This configuration will be used to configure the allowed group names.
+```
<property>
<name>hive.server2.saml2.sp.callback.url</name>
<value>callback_url_of_hiveserver2</value>
</property>
+```
The http URL endpoint where the SAML assertion is posted back by the IDP.
Currently this must be on the same port as HiveServer2’s http endpoint and must
be TLS enabled (https) on secure setups.
diff --git a/content/Development/desingdocs/type-qualifiers-in-hive.md
b/content/Development/desingdocs/type-qualifiers-in-hive.md
index eed01d3a..ddd0cdd3 100644
--- a/content/Development/desingdocs/type-qualifiers-in-hive.md
+++ b/content/Development/desingdocs/type-qualifiers-in-hive.md
@@ -39,16 +39,14 @@ The type qualifiers could simply be stored as part of the
type string for a colu
This approach would be similar to the attributes in the
INFORMATION_SCHEMA.COLUMNS that some DBMS catalog tables have, such as those
listed below:
-<pre>
-
+```
| CHARACTER_MAXIMUM_LENGTH | bigint(21) unsigned | YES | | NULL | |
| CHARACTER_OCTET_LENGTH | bigint(21) unsigned | YES | | NULL | |
| NUMERIC_PRECISION | bigint(21) unsigned | YES | | NULL | |
| NUMERIC_SCALE | bigint(21) unsigned | YES | | NULL | |
| CHARACTER_SET_NAME | varchar(32) | YES | | NULL | |
| COLLATION_NAME | varchar(32) | YES | | NULL | |
-
-</pre>
+```
We could add new columns to the COLUMNS_V2 table for any type qualifiers we
are trying to support (initially looks like CHARACTER_MAXIMUM_LENGTH,
NUMERIC_PRECISION, NUMERIC_SCALE). Advantages to this would be that it is
easier to query these parameters than the first approach, though types with no
parameters would still have these columns (set to null).
diff --git a/content/Development/gettingstarted-latest.md
b/content/Development/gettingstarted-latest.md
index 72b2184a..e0712ef4 100644
--- a/content/Development/gettingstarted-latest.md
+++ b/content/Development/gettingstarted-latest.md
@@ -77,7 +77,7 @@ To build the current Hive code from the master branch:
Here, {version} refers to the current Hive version.
-If building Hive source using Maven (mvn), we will refer to the directory
"/packaging/target/apache-hive-{version}-SNAPSHOT-bin/apache-hive-{version}-SNAPSHOT-bin"
as <install-dir> for the rest of the page.
+If building Hive source using Maven (mvn), we will refer to the directory
"/packaging/target/apache-hive-{version}-SNAPSHOT-bin/apache-hive-{version}-SNAPSHOT-bin"
as `<install-dir>` for the rest of the page.
#### Compile Hive on branch-1
diff --git a/content/docs/latest/admin/adminmanual-configuration.md
b/content/docs/latest/admin/adminmanual-configuration.md
index b7dd3a55..732a8816 100644
--- a/content/docs/latest/admin/adminmanual-configuration.md
+++ b/content/docs/latest/admin/adminmanual-configuration.md
@@ -43,7 +43,7 @@ The server-specific configuration file is useful in two
situations:
If HiveServer2 is using the metastore in embedded mode,
hivemetastore-site.xml also is loaded.
The order of precedence of the config files is as follows (later one
has higher precedence) –
- hive-site.xml -> hivemetastore-site.xml -> hiveserver2-site.xml ->
'`-hiveconf`' commandline parameters.
+ hive-site.xml -> hivemetastore-site.xml -> hiveserver2-site.xml
-> '`-hiveconf`' commandline parameters.
### hive-site.xml and hive-default.xml.template
@@ -61,8 +61,8 @@ The administrative configuration variables are listed
[below]({{< ref "#below" >
Hive uses temporary folders both on the machine running the Hive client and
the default HDFS instance. These folders are used to store per-query
temporary/intermediate data sets and are normally cleaned up by the hive client
when the query is finished. However, in cases of abnormal hive client
termination, some data may be left behind. The configuration details are as
follows:
-* On the HDFS cluster this is set to */tmp/hive-<username>* by default and is
controlled by the configuration variable *hive.exec.scratchdir*
-* On the client machine, this is hardcoded to */tmp/<username>*
+* On the HDFS cluster this is set to `*/tmp/hive-<username>*` by default and
is controlled by the configuration variable *hive.exec.scratchdir*
+* On the client machine, this is hardcoded to `*/tmp/<username>*`
Note that when writing data to a table/partition, Hive will first write to a
temporary location on the target table's filesystem (using hive.exec.scratchdir
as the temporary location) and then move the data to the target table. This
applies in all cases - whether tables are stored in HDFS (normal case) or in
file systems like S3 or even NFS.
@@ -98,9 +98,9 @@ Version information: Metrics
| hive.ddl.output.format | The data format to use for DDL output (e.g.
`DESCRIBE table`). One of "text" (for human readable text) or "json" (for a
json object). (As of Hive
[0.9.0](https://issues.apache.org/jira/browse/HIVE-2822).) | text |
| hive.exec.script.wrapper | Wrapper around any invocations to script operator
e.g. if this is set to python, the script passed to the script operator will be
invoked as `python <script command>`. If the value is null or not set, the
script is invoked as `<script command>`. | null |
| hive.exec.plan | | null |
-| hive.exec.scratchdir | This directory is used by Hive to store the plans for
different map/reduce stages for the query as well as to stored the intermediate
outputs of these stages.*Hive 0.14.0 and later:* HDFS root scratch directory
for Hive jobs, which gets created with write all
([733](https://issues.apache.org/jira/browse/HIVE-8143)) permission. For each
connecting user, an HDFS scratch directory ${hive.exec.scratchdir}/<username>
is created with ${hive.scratch.dir.permission}. | / [...]
+| hive.exec.scratchdir | This directory is used by Hive to store the plans for
different map/reduce stages for the query as well as to stored the intermediate
outputs of these stages.*Hive 0.14.0 and later:* HDFS root scratch directory
for Hive jobs, which gets created with write all
([733](https://issues.apache.org/jira/browse/HIVE-8143)) permission. For each
connecting user, an HDFS scratch directory `${hive.exec.scratchdir}/<username>`
is created with ${hive.scratch.dir.permission}. | [...]
| hive.scratch.dir.permission | The permission for the user-specific scratch
directories that get created in the root scratch directory
${hive.exec.scratchdir}. (As of Hive
[0.12.0](https://issues.apache.org/jira/browse/HIVE-4487).) | 700 (Hive 0.12.0
and later) |
-| hive.exec.local.scratchdir | This directory is used for temporary files when
Hive runs in local mode. (As of Hive
[0.10.0](https://issues.apache.org/jira/browse/HIVE-1577).) | /tmp/<user.name> |
+| hive.exec.local.scratchdir | This directory is used for temporary files when
Hive runs in local mode. (As of Hive
[0.10.0](https://issues.apache.org/jira/browse/HIVE-1577).) |
`/tmp/<user.name>` |
| hive.exec.submitviachild | Determines whether the map/reduce jobs should be
submitted through a separate jvm in the non local mode. | false - By default
jobs are submitted through the same jvm as the compiler |
| hive.exec.script.maxerrsize | Maximum number of serialization errors allowed
in a user script invoked through `TRANSFORM` or `MAP` or `REDUCE` constructs. |
100000 |
| hive.exec.compress.output | Determines whether the output of the final
map/reduce job in a query is compressed or not. | false |
@@ -119,7 +119,7 @@ Version information: Metrics
| hive.merge.size.per.task | Size of merged files at the end of the job. |
256000000 |
| hive.merge.smallfiles.avgsize | When the average output file size of a job
is less than this number, Hive will start an additional map-reduce job to merge
the output files into bigger files. This is only done for map-only jobs if
hive.merge.mapfiles is true, and for map-reduce jobs if hive.merge.mapredfiles
is true. | 16000000 |
| hive.querylog.enable.plan.progress | Whether to log the plan's progress
every time a job's progress is checked. These logs are written to the location
specified by `hive.querylog.location`. (As of Hive
[0.10](https://issues.apache.org/jira/browse/HIVE-3230).) | true |
-| hive.querylog.location | Directory where structured hive query logs are
created. One file per session is created in this directory. If this variable
set to empty string structured log will not be created. | /tmp/<user.name> |
+| hive.querylog.location | Directory where structured hive query logs are
created. One file per session is created in this directory. If this variable
set to empty string structured log will not be created. | `/tmp/<user.name>` |
| hive.querylog.plan.progress.interval | The interval to wait between logging
the plan's progress in milliseconds. If there is a whole number percentage
change in the progress of the mappers or the reducers, the progress is logged
regardless of this value. The actual interval will be the ceiling of (this
value divided by the value of `hive.exec.counters.pull.interval`) multiplied by
the value of `hive.exec.counters.pull.interval` i.e. if it is not divide evenly
by the value of `hive.exec [...]
| hive.stats.autogather | A flag to gather statistics automatically during the
INSERT OVERWRITE command. (As of Hive
[0.7.0](https://issues.apache.org/jira/browse/HIVE-1361).) | true |
| hive.stats.dbclass | The default database that stores temporary hive
statistics. Valid values are `hbase` and `jdbc` while `jdbc` should have a
specification of the Database to use, separated by a colon (e.g. `jdbc:mysql`).
(As of Hive [0.7.0](https://issues.apache.org/jira/browse/HIVE-1361).) |
jdbc:derby |
@@ -142,7 +142,7 @@ For security configuration (Hive 0.10 and later), see the
[Hive Metastore Securi
| --- | --- | --- |
| hadoop.bin.path | The location of the Hadoop script which is used to submit
jobs to Hadoop when submitting through a separate JVM. |
$HADOOP_HOME/bin/hadoop |
| hadoop.config.dir | The location of the configuration directory of the
Hadoop installation. | $HADOOP_HOME/conf |
-| fs.default.name | The default name of the filesystem (for example, localhost
for hdfs://<clustername>:8020).For YARN this configuration variable is called
fs.defaultFS. | file:/// |
+| fs.default.name | The default name of the filesystem (for example, localhost
for `hdfs://<clustername>:8020`).For YARN this configuration variable is called
fs.defaultFS. | file:/// |
| map.input.file | The filename the map is reading from. | null |
| mapred.job.tracker | The URL to the jobtracker. If this is set to local then
map/reduce is run in the local mode. | local |
| mapred.reduce.tasks | The number of reducers for each map/reduce stage in
the query plan. | 1 |
diff --git
a/content/docs/latest/admin/adminmanual-metastore-3-0-administration.md
b/content/docs/latest/admin/adminmanual-metastore-3-0-administration.md
index 3f103bc4..886b812b 100644
--- a/content/docs/latest/admin/adminmanual-metastore-3-0-administration.md
+++ b/content/docs/latest/admin/adminmanual-metastore-3-0-administration.md
@@ -103,7 +103,7 @@ To run the Metastore as a service, you must first configure
it with a URL.
| Configured On | Parameter | Hive 2 Parameter | Format | Default Value |
Comment |
| --- | --- | --- | --- | --- | --- |
-| Client | metastore.thrift.uris | hive.metastore.uris |
thrift://<HOST>:<PORT>[, thrift://<HOST>:<PORT>...] | none | HOST = hostname,
PORT = should be set to match metastore.thrift.port on the server (which
defaults to 9083. You can provide multiple servers in a comma separate list. |
+| Client | metastore.thrift.uris | hive.metastore.uris |
`thrift://<HOST>:<PORT>[, thrift://<HOST>:<PORT>...]` | none | HOST = hostname,
PORT = should be set to match metastore.thrift.port on the server (which
defaults to 9083. You can provide multiple servers in a comma separate list. |
| Server | metastore.thrift.port | hive.metastore.port | integer | 9083 | Port
Thrift will listen on. |
Once you have configured your clients, you can start the Metastore on a server
using the `start-metastore` utility. See the `-help` option of that utility
for available options. There is no stop-metastore script. You must locate the
process id for the metastore and kill that process.
diff --git a/content/docs/latest/admin/adminmanual-metastore-administration.md
b/content/docs/latest/admin/adminmanual-metastore-administration.md
index c965b66b..a42845df 100644
--- a/content/docs/latest/admin/adminmanual-metastore-administration.md
+++ b/content/docs/latest/admin/adminmanual-metastore-administration.md
@@ -141,7 +141,7 @@ The following example uses a[Remote Metastore Database]({{<
ref "#remote-metasto
| javax.jdo.option.ConnectionUserName | `<user name>` | user name for
connecting to MySQL server |
| javax.jdo.option.ConnectionPassword | `<password>` | password for connecting
to MySQL server |
| hive.metastore.warehouse.dir | `<base hdfs path>` | default location for
Hive tables. |
-| hive.metastore.thrift.bind.host | <host_name> | Host name to bind the
metastore service to. When empty, "localhost" is used. This configuration is
available Hive 4.0.0 onwards. |
+| hive.metastore.thrift.bind.host | `<host_name>` | Host name to bind the
metastore service to. When empty, "localhost" is used. This configuration is
available Hive 4.0.0 onwards. |
From Hive 3.0.0
([HIVE-16452](https://issues.apache.org/jira/browse/HIVE-16452)) onwards the
metastore database stores a GUID which can be queried using the Thrift API
get_metastore_db_uuid by metastore clients in order to identify the backend
database instance. This API can be accessed by the HiveMetaStoreClient using
the method getMetastoreDbUuid().
@@ -162,13 +162,13 @@ From Hive 4.0.0
([HIVE-20794](https://issues.apache.org/jira/browse/HIVE-20794))
| Config Param | Config Value | Comment |
| --- | --- | --- |
| hive.metastore.service.discovery.mode | service discovery mode | When it is
set to "zookeeper", ZooKeeper is used for dynamic service discovery of a remote
metastore. In that case, a metastore adds itself to the ZooKeeper when it is
started and removes itself when it shuts down. By default it is empty. Both the
client and server should have same value for this parameter. |
-| hive.metastore.uris | <host_name>:<port>, <host_name>:<port>, ... | One or
more host and port pairs of ZooKeeper servers forming a ZooKeeper ensemble.
Used when hive.metastore.service.discovery.mode is set to "zookeeper". The
configuration is not used by server otherwise. If all the servers are using the
same port you may specify the port using hive.metastore.zookeeper.client.port
instead of specifying it with every server separately. Both the client and
server should have same value f [...]
-| hive.metastore.zookeeper.client.port | <port> | Port number when same port
number is used by all the ZooKeeper servers in the ensemble. Both the client
and server should have same value for this parameter. |
-| hive.metastore.zookeeper.namespace | <namespace name> | The parent node
under which all ZooKeeper nodes for metastores are created. |
-| hive.metastore.zookeeper.session.timeout | <time in milliseconds> |
ZooKeeper client's session timeout (in milliseconds). The client is
disconnected if a heartbeat is not sent in the timeout. |
-| hive.metastore.zookeeper.connection.timeout | <time in seconds> | ZooKeeper
client's connection timeout in seconds. Connection timeout *
hive.metastore.zookeeper.connection.max.retries with exponential backoff is
when curator client deems connection is lost to zookeeper. |
-| hive.metastore.zookeeper.connection.max.retries | <number> | Max number of
times to retry when connecting to the ZooKeeper server. |
-| hive.metastore.zookeeper.connection.basesleeptime | <time in milliseconds> |
Initial amount of time (in milliseconds) to wait between retries when
connecting to the ZooKeeper server when using ExponentialBackoffRetry policy. |
+| hive.metastore.uris | `<host_name>:<port>, <host_name>:<port>, ...` | One or
more host and port pairs of ZooKeeper servers forming a ZooKeeper ensemble.
Used when hive.metastore.service.discovery.mode is set to "zookeeper". The
configuration is not used by server otherwise. If all the servers are using the
same port you may specify the port using hive.metastore.zookeeper.client.port
instead of specifying it with every server separately. Both the client and
server should have same value [...]
+| hive.metastore.zookeeper.client.port | `<port>` | Port number when same port
number is used by all the ZooKeeper servers in the ensemble. Both the client
and server should have same value for this parameter. |
+| hive.metastore.zookeeper.namespace | `<namespace name>` | The parent node
under which all ZooKeeper nodes for metastores are created. |
+| hive.metastore.zookeeper.session.timeout | `<time in milliseconds>` |
ZooKeeper client's session timeout (in milliseconds). The client is
disconnected if a heartbeat is not sent in the timeout. |
+| hive.metastore.zookeeper.connection.timeout | `<time in seconds>` |
ZooKeeper client's connection timeout in seconds. Connection timeout *
hive.metastore.zookeeper.connection.max.retries with exponential backoff is
when curator client deems connection is lost to zookeeper. |
+| hive.metastore.zookeeper.connection.max.retries | `<number>` | Max number of
times to retry when connecting to the ZooKeeper server. |
+| hive.metastore.zookeeper.connection.basesleeptime | `<time in milliseconds>`
| Initial amount of time (in milliseconds) to wait between retries when
connecting to the ZooKeeper server when using ExponentialBackoffRetry policy. |
diff --git a/content/docs/latest/admin/hive-on-spark-getting-started.md
b/content/docs/latest/admin/hive-on-spark-getting-started.md
index cee3ddb5..b0e4667b 100644
--- a/content/docs/latest/admin/hive-on-spark-getting-started.md
+++ b/content/docs/latest/admin/hive-on-spark-getting-started.md
@@ -41,7 +41,7 @@ For the installation perform the following tasks:
1. Install Spark (either download pre-built Spark, or build assembly from
source).
* Install/build a compatible version. Hive root `pom.xml`'s
<spark.version> defines what version of Spark it was built/tested with.
* Install/build a compatible distribution. Each version of Spark has
several distributions, corresponding with different versions of Hadoop.
- * Once Spark is installed, find and keep note of the
<spark-assembly-*.jar> location.
+ * Once Spark is installed, find and keep note of the
`<spark-assembly-*.jar>` location.
* Note that you must have a version of Spark which does **not** include
the Hive jars. Meaning one which was not built with the Hive profile. If you
will use Parquet tables, it's recommended to also enable the "parquet-provided"
profile. Otherwise there could be conflicts in Parquet dependency. To remove
Hive jars from the installation, simply use the following command under your
Spark repository:
Prior to Spark 2.0.0:
@@ -68,7 +68,7 @@ For the installation perform the following tasks:
./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz
"-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided,orc-provided"
```
2. Start Spark cluster
- * Keep note of the <Spark Master URL>. This can be found in Spark
master WebUI.
+ * Keep note of the `<Spark Master URL>`. This can be found in Spark
master WebUI.
## Configuring YARN
@@ -175,7 +175,7 @@ On this 9 node cluster we’ll have two executors per host.
As such we can confi
| org.apache.spark.SparkException: Job aborted due to stage failure: Task
5.0:0 had a not serializable result: java.io.NotSerializableException:
org.apache.hadoop.io.BytesWritable | Spark serializer not set to Kryo. | Set
spark.serializer to be org.apache.spark.serializer.KryoSerializer, see Step 3
[above]({{< ref "#above" >}}). |
| [ERROR] Terminal initialization failed; falling back to
unsupportedjava.lang.IncompatibleClassChangeError: Found class jline.Terminal,
but interface was expected | Hive has upgraded to Jline2 but jline 0.94 exists
in the Hadoop lib. | 1. Delete jline from the Hadoop lib directory (it's only
pulled in transitively from ZooKeeper). 2. export
HADOOP_USER_CLASSPATH_FIRST=true 3. If this error occurs during mvn test,
perform a mvn clean install on the root project and itests directory. |
| Spark executor gets killed all the time and Spark keeps retrying the failed
stage; you may find similar information in the YARN nodemanager log.WARN
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Container [pid=217989,containerID=container_1421717252700_0716_01_50767235] is
running beyond physical memory limits. Current usage: 43.1 GB of 43 GB physical
memory used; 43.9 GB of 90.3 GB virtual memory used. Killing container. | For
Spark on YARN, [...]
-| Run query and get an error like:FAILED: Execution Error, return code 3 from
org.apache.hadoop.hive.ql.exec.spark.SparkTaskIn Hive logs, it
shows:java.lang.NoClassDefFoundError: Could not initialize class
org.xerial.snappy.Snappy at
org.xerial.snappy.SnappyOutputStream.<init>(SnappyOutputStream.java:79) |
Happens on Mac (not officially supported).This is a general Snappy issue with
Mac and is not unique to Hive on Spark, but workaround is noted here because it
is needed for startup of [...]
+| Run query and get an error like:FAILED: Execution Error, return code 3 from
org.apache.hadoop.hive.ql.exec.spark.SparkTaskIn Hive logs, it
shows:java.lang.NoClassDefFoundError: Could not initialize class
org.xerial.snappy.Snappy at
org.xerial.snappy.SnappyOutputStream.<init>(SnappyOutputStream.java:79) |
Happens on Mac (not officially supported).This is a general Snappy issue with
Mac and is not unique to Hive on Spark, but workaround is noted here because it
is needed for start [...]
| Stack trace: ExitCodeException exitCode=1: .../launch_container.sh: line 27:
$PWD:$PWD/__spark__.jar:$HADOOP_CONF_DIR.../usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure:$PWD/__app__.jar:$PWD/*:
bad substitution | The key mapreduce.application.classpath in
/etc/hadoop/conf/mapred-site.xml contains a variable which is invalid in bash.
| From **mapreduce.application.classpath** remove `
:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${h [...]
| Exception in thread "Driver" scala.MatchError:
java.lang.NoClassDefFoundError: org/apache/hadoop/mapreduce/TaskAttemptContext
(of class java.lang.NoClassDefFoundError) at
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:432)
| MR is not on the YARN classpath. | If on HDP change from
**/hdp/apps/${hdp.version}/mapreduce/mapreduce.tar.gz#mr-framework** to
**/hdp/apps/2.2.0.0-2041/mapreduce/mapreduce.tar.gz#mr-framework** |
| java.lang.OutOfMemoryError: PermGen space with spark.master=local | By
default ([SPARK-1879](https://issues.apache.org/jira/browse/SPARK-1879)),
Spark's own launch scripts increase PermGen to 128 MB, so we need to increase
PermGen in hive launch script. | If use JDK7, append following in
conf/hive-env.sh: ` export HADOOP_OPTS="$HADOOP_OPTS -XX:MaxPermSize=128m" ` If
use JDK8, append following in Conf/hive-env.sh: ` export
HADOOP_OPTS="$HADOOP_OPTS -XX:MaxMetaspaceSize=512m" ` |
diff --git a/content/docs/latest/admin/setting-up-hiveserver2.md
b/content/docs/latest/admin/setting-up-hiveserver2.md
index eff94db0..aa05bc5a 100644
--- a/content/docs/latest/admin/setting-up-hiveserver2.md
+++ b/content/docs/latest/admin/setting-up-hiveserver2.md
@@ -166,7 +166,7 @@ Use the following steps to create and verify self-signed
SSL certificates for us
3. Export this certificate from keystore.jks to a certificate file: keytool
-export -alias example.com -file example.com.crt -keystore keystore.jks
4. Add this certificate to the client's truststore to establish trust: keytool
-import -trustcacerts -alias example.com -file example.com.crt -keystore
truststore.jks
5. Verify that the certificate exists in truststore.jks: keytool -list
-keystore truststore.jks
-6. Then start HiveServer2, and try to connect with beeline using:
jdbc:hive2://<host>:<port>/<database>;ssl=true;sslTrustStore=<path-to-truststore>;trustStorePassword=<truststore-password>
+6. Then start HiveServer2, and try to connect with beeline using:
`jdbc:hive2://<host>:<port>/<database>;ssl=true;sslTrustStore=<path-to-truststore>;trustStorePassword=<truststore-password>`
##### Selectively disabling SSL protocol versions
@@ -187,7 +187,7 @@ Warning
Support is provided for PAM (Hive 0.13 onward, see
[HIVE-6466](https://issues.apache.org/jira/browse/HIVE-6466)). To configure PAM:
* Download the
[JPAM](http://sourceforge.net/projects/jpam/files/jpam/jpam-1.1/) native
library for the relevant architecture.
-* Unzip and copy libjpam.so to a directory (<libjmap-directory>) on the system.
+* Unzip and copy libjpam.so to a directory (`<libjmap-directory>`) on the
system.
* Add the directory to the LD_LIBRARY_PATH environment variable like
so:`export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<libjmap-directory>`
* For some PAM modules, you'll have to ensure that your `/etc/shadow` and
`/etc/login.defs` files are readable by the user running the HiveServer2
process.