This is an automated email from the ASF dual-hosted git repository.
etudenhoefner pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/iceberg.git
The following commit(s) were added to refs/heads/main by this push:
new 396a8441c4 Docs: Add newline to fix lists (#9664)
396a8441c4 is described below
commit 396a8441c4154500733aad87688d9511f88ad9bb
Author: Manu Zhang <[email protected]>
AuthorDate: Tue Feb 6 20:21:57 2024 +0800
Docs: Add newline to fix lists (#9664)
---
docs/docs/configuration.md | 2 ++
docs/docs/delta-lake-migration.md | 2 ++
docs/docs/hive.md | 2 ++
docs/docs/metrics-reporting.md | 2 ++
docs/docs/spark-procedures.md | 3 +++
docs/docs/spark-queries.md | 13 ++++++++-----
docs/docs/spark-writes.md | 4 ++--
site/docs/how-to-release.md | 4 ++--
site/docs/spec.md | 2 ++
site/docs/view-spec.md | 2 ++
10 files changed, 27 insertions(+), 9 deletions(-)
diff --git a/docs/docs/configuration.md b/docs/docs/configuration.md
index 8dc9952767..5ff3e73795 100644
--- a/docs/docs/configuration.md
+++ b/docs/docs/configuration.md
@@ -157,6 +157,7 @@ Here are the catalog properties related to locking. They
are used by some catalo
The following properties from the Hadoop configuration are used by the Hive
Metastore connector.
The HMS table locking is a 2-step process:
+
1. Lock Creation: Create lock in HMS and queue for acquisition
2. Lock Check: Check if lock successfully acquired
@@ -180,6 +181,7 @@ Hive Metastore before the lock is retried from Iceberg.
Warn: Setting `iceberg.engine.hive.lock-enabled`=`false` will cause
HiveCatalog to commit to tables without using Hive locks.
This should only be set to `false` if all following conditions are met:
+
- [HIVE-26882](https://issues.apache.org/jira/browse/HIVE-26882)
is available on the Hive Metastore server
- All other HiveCatalogs committing to tables that this HiveCatalog commits
to are also on Iceberg 1.3 or later
diff --git a/docs/docs/delta-lake-migration.md
b/docs/docs/delta-lake-migration.md
index a7dfb0212f..e9ec81943e 100644
--- a/docs/docs/delta-lake-migration.md
+++ b/docs/docs/delta-lake-migration.md
@@ -36,6 +36,7 @@ The `iceberg-delta-lake` module is not bundled with Spark and
Flink engine runti
### Compatibilities
The module is built and tested with `Delta Standalone:0.6.0` and supports
Delta Lake tables with the following protocol version:
+
* `minReaderVersion`: 1
* `minWriterVersion`: 2
@@ -44,6 +45,7 @@ Please refer to [Delta Lake Table Protocol
Versioning](https://docs.delta.io/lat
### API
The `iceberg-delta-lake` module provides an interface named
`DeltaLakeToIcebergMigrationActionsProvider`, which contains actions that helps
converting from Delta Lake to Iceberg.
The supported actions are:
+
* `snapshotDeltaLakeTable`: snapshot an existing Delta Lake table to an
Iceberg table
### Default Implementation
diff --git a/docs/docs/hive.md b/docs/docs/hive.md
index 15e564f3d3..a8df154405 100644
--- a/docs/docs/hive.md
+++ b/docs/docs/hive.md
@@ -459,12 +459,14 @@ ALTER TABLE t set TBLPROPERTIES
('metadata_location'='<path>/hivemetadata/00003-
### SELECT
Select statements work the same on Iceberg tables in Hive. You will see the
Iceberg benefits over Hive in compilation and execution:
+
* **No file system listings** - especially important on blob stores, like S3
* **No partition listing from** the Metastore
* **Advanced partition filtering** - the partition keys are not needed in the
queries when they could be calculated
* Could handle **higher number of partitions** than normal Hive tables
Here are the features highlights for Iceberg Hive read support:
+
1. **Predicate pushdown**: Pushdown of the Hive SQL `WHERE` clause has been
implemented so that these filters are used at the Iceberg `TableScan` level as
well as by the Parquet and ORC Readers.
2. **Column projection**: Columns from the Hive SQL `SELECT` clause are
projected down to the Iceberg readers to reduce the number of columns read.
3. **Hive query engines**:
diff --git a/docs/docs/metrics-reporting.md b/docs/docs/metrics-reporting.md
index e667398d73..3a83e1baec 100644
--- a/docs/docs/metrics-reporting.md
+++ b/docs/docs/metrics-reporting.md
@@ -26,6 +26,7 @@ As of 1.1.0 Iceberg supports the
[`MetricsReporter`](../../javadoc/{{ icebergVer
### ScanReport
A [`ScanReport`](../../javadoc/{{ icebergVersion
}}/org/apache/iceberg/metrics/ScanReport.html) carries metrics being collected
during scan planning against a given table. Amongst some general information
about the involved table, such as the snapshot id or the table name, it
includes metrics like:
+
* total scan planning duration
* number of data/delete files included in the result
* number of data/delete manifests scanned/skipped
@@ -35,6 +36,7 @@ A [`ScanReport`](../../javadoc/{{ icebergVersion
}}/org/apache/iceberg/metrics/S
### CommitReport
A [`CommitReport`](../../javadoc/{{ icebergVersion
}}/org/apache/iceberg/metrics/CommitReport.html) carries metrics being
collected after committing changes to a table (aka producing a snapshot).
Amongst some general information about the involved table, such as the snapshot
id or the table name, it includes metrics like:
+
* total duration
* number of attempts required for the commit to succeed
* number of added/removed data/delete files
diff --git a/docs/docs/spark-procedures.md b/docs/docs/spark-procedures.md
index 0dc1f1738c..6b3cb06c3a 100644
--- a/docs/docs/spark-procedures.md
+++ b/docs/docs/spark-procedures.md
@@ -459,6 +459,7 @@ CALL catalog_name.system.rewrite_manifests('db.sample',
false);
### `rewrite_position_delete_files`
Iceberg can rewrite position delete files, which serves two purposes:
+
* Minor Compaction: Compact small position delete files into larger ones.
This reduces the size of metadata stored in manifest files and overhead of
opening small delete files.
* Remove Dangling Deletes: Filter out position delete records that refer to
data files that are no longer live. After rewrite_data_files, position delete
records pointing to the rewritten data files are not always marked for removal,
and can remain tracked by the table's live snapshot metadata. This is known as
the 'dangling delete' problem.
@@ -760,6 +761,7 @@ Creates a view that contains the changes from a given table.
| `identifier_columns` | | array<string> | The list of
identifier columns to compute updates. If the argument `compute_updates` is set
to true and `identifier_columns` are not provided, the table’s current
identifier fields will be used. |
Here is a list of commonly used Spark read options:
+
* `start-snapshot-id`: the exclusive start snapshot ID. If not provided, it
reads from the table’s first snapshot inclusively.
* `end-snapshot-id`: the inclusive end snapshot id, default to table's current
snapshot.
* `start-timestamp`: the exclusive start timestamp. If not provided, it reads
from the table’s first snapshot inclusively.
@@ -807,6 +809,7 @@ SELECT * FROM tbl_changes where _change_type = 'INSERT' AND
id = 3 ORDER BY _cha
```
Please note that the changelog view includes Change Data Capture(CDC) metadata
columns
that provide additional information about the changes being tracked. These
columns are:
+
- `_change_type`: the type of change. It has one of the following values:
`INSERT`, `DELETE`, `UPDATE_BEFORE`, or `UPDATE_AFTER`.
- `_change_ordinal`: the order of changes
- `_commit_snapshot_id`: the snapshot ID where the change occurred
diff --git a/docs/docs/spark-queries.md b/docs/docs/spark-queries.md
index 4086b0049f..536c136d7e 100644
--- a/docs/docs/spark-queries.md
+++ b/docs/docs/spark-queries.md
@@ -295,11 +295,11 @@ SELECT * FROM prod.db.table.files;
| 1 |
s3:/.../table/data/00081-4-a9aa8b24-20bc-4d56-93b0-6b7675782bb5-00001-deletes.parquet
| PARQUET | 0 | 1 | 1560 | {2147483545:46,2147483546:152} |
{2147483545:1,2147483546:1} | {2147483545:0,2147483546:0} | {} |
{2147483545:,2147483546:s3:/.../table/data/00000-0-f9709213-22ca-4196-8733-5cb15d2afeb9-00001.parquet}
|
{2147483545:,2147483546:s3:/.../table/data/00000-0-f9709213-22ca-4196-8733-5cb15d2afeb9-00001.parquet}
| NULL | [4] | NULL | NULL | {"data":{"column_size":null,"value_cou [...]
| 2 |
s3:/.../table/data/00047-25-833044d0-127b-415c-b874-038a4f978c29-00612.parquet
| PARQUET | 0 | 126506 | 28613985 | {100:135377,101:11314} |
{100:126506,101:126506} | {100:105434,101:11} | {} | {100:0,101:17} |
{100:404455227527,101:23} | NULL | NULL | [1] | 0 |
{"id":{"column_size":135377,"value_count":126506,"null_value_count":105434,"nan_value_count":null,"lower_bound":0,"upper_bound":404455227527},"data":{"column_size":11314,"value_count":126506,"null_value_count":
11,"nan_value [...]
-!!!info
- Content refers to type of content stored by the data file:
- 0 Data
- 1 Position Deletes
- 2 Equality Deletes
+!!! info
+ Content refers to type of content stored by the data file:
+ * 0 Data
+ * 1 Position Deletes
+ * 2 Equality Deletes
To show only data files or delete files, query `prod.db.table.data_files` and
`prod.db.table.delete_files` respectively.
To show all files, data files and delete files across all tracked snapshots,
query `prod.db.table.all_files`, `prod.db.table.all_data_files` and
`prod.db.table.all_delete_files` respectively.
@@ -317,6 +317,7 @@ SELECT * FROM prod.db.table.manifests;
| s3://.../table/metadata/45b5290b-ee61-4788-b324-b1e2735c0e10-m0.avro | 4479
| 0 | 6668963634911763636 | 8 | 0
| 0 |
[[false,null,2019-05-13,2019-05-15]] |
Note:
+
1. Fields within `partition_summaries` column of the manifests table
correspond to `field_summary` structs within [manifest
list](../../spec.md#manifest-lists), with the following order:
- `contains_null`
- `contains_nan`
@@ -341,6 +342,7 @@ SELECT * FROM prod.db.table.partitions;
| {20211002, 10} | 0 | 3 | 2 | 400
| 0 | 0 | 1
| 1 | 1633169159489000 |
6941468797545315876 |
Note:
+
1. For unpartitioned tables, the partitions table will not contain the
partition and spec_id fields.
2. The partitions metadata table shows partitions with data files or delete
files in the current snapshot. However, delete files are not applied, and so in
some cases partitions may be shown even though all their data rows are marked
deleted by delete files.
@@ -416,6 +418,7 @@ SELECT * FROM prod.db.table.all_manifests;
| s3://.../metadata/a85f78c5-3222-4b37-b7e4-faf944425d48-m0.avro | 6376 | 0 |
6272782676904868561 | 2 | 0 | 0 |[{false, false, 20210101, 20210101}]|
Note:
+
1. Fields within `partition_summaries` column of the manifests table
correspond to `field_summary` structs within [manifest
list](../../spec.md#manifest-lists), with the following order:
- `contains_null`
- `contains_nan`
diff --git a/docs/docs/spark-writes.md b/docs/docs/spark-writes.md
index eff80a3931..7d2f093e88 100644
--- a/docs/docs/spark-writes.md
+++ b/docs/docs/spark-writes.md
@@ -310,11 +310,11 @@ While inserting or updating Iceberg is capable of
resolving schema mismatch at r
* A new column is present in the source but not in the target table.
- The new column is added to the target table. Column values are set to `NULL`
in all the rows already present in the table
+ The new column is added to the target table. Column values are set to
`NULL` in all the rows already present in the table
* A column is present in the target but not in the source.
- The target column value is set to `NULL` when inserting or left unchanged
when updating the row.
+ The target column value is set to `NULL` when inserting or left unchanged
when updating the row.
The target table must be configured to accept any schema change by setting the
property `write.spark.accept-any-schema` to `true`.
diff --git a/site/docs/how-to-release.md b/site/docs/how-to-release.md
index 8a774cc6ee..c775a718c4 100644
--- a/site/docs/how-to-release.md
+++ b/site/docs/how-to-release.md
@@ -267,7 +267,7 @@ svn ci -m 'Iceberg: Add release <VERSION>'
```
!!! Note
-The above step requires PMC privileges to execute.
+ The above step requires PMC privileges to execute.
Next, add a release tag to the git repository based on the passing candidate
tag:
@@ -472,7 +472,7 @@ repositories {
```
!!! Note
-Replace `${MAVEN_URL}` with the URL provided in the release announcement
+ Replace `${MAVEN_URL}` with the URL provided in the release announcement
### Verifying with Spark
diff --git a/site/docs/spec.md b/site/docs/spec.md
index 9223bafda3..2ff625b765 100644
--- a/site/docs/spec.md
+++ b/site/docs/spec.md
@@ -222,6 +222,7 @@ Any struct, including a top-level schema, can evolve
through deleting fields, ad
Grouping a subset of a struct’s fields into a nested struct is **not**
allowed, nor is moving fields from a nested struct into its immediate parent
struct (`struct<a, b, c> ↔ struct<a, struct<b, c>>`). Evolving primitive types
to structs is **not** allowed, nor is evolving a single-field struct to a
primitive (`map<string, int> ↔ map<string, struct<int>>`).
Struct evolution requires the following rules for default values:
+
* The `initial-default` must be set when a field is added and cannot change
* The `write-default` must be set when a field is added and may change
* When a required field is added, both defaults must be set to a non-null value
@@ -1217,6 +1218,7 @@ This serialization scheme is for storing single values as
individual binary valu
### Version 3
Default values are added to struct fields in v3.
+
* The `write-default` is a forward-compatible change because it is only used
at write time. Old writers will fail because the field is missing.
* Tables with `initial-default` will be read correctly by older readers if
`initial-default` is always null for optional fields. Otherwise, old readers
will default optional columns with null. Old readers will fail to read required
fields which are populated by `initial-default` because that default is not
supported.
diff --git a/site/docs/view-spec.md b/site/docs/view-spec.md
index d50405cfe0..9c6ba3413b 100644
--- a/site/docs/view-spec.md
+++ b/site/docs/view-spec.md
@@ -65,6 +65,7 @@ The view version metadata file has the following fields:
| _optional_ | `properties` | A string to string map of view
properties [2] |
Notes:
+
1. The number of versions to retain is controlled by the table property:
`version.history.num-entries`.
2. Properties are used for metadata such as `comment` and for settings that
affect view maintenance. This is not intended to be used for arbitrary metadata.
@@ -103,6 +104,7 @@ A view version can have more than one representation. All
representations for a
View versions are immutable. Once a version is created, it cannot be changed.
This means that representations for a version cannot be changed. If a view
definition changes (or new representations are to be added), a new version must
be created.
Each representation is an object with at least one common field, `type`, that
is one of the following:
+
* `sql`: a SQL SELECT statement that defines the view
Representations further define metadata for each type.