This is an automated email from the ASF dual-hosted git repository. lzljs3620320 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/paimon.git
The following commit(s) were added to refs/heads/master by this push: new b530c83d00 [doc] Add 'Data Skipping By File Index' for primary key table b530c83d00 is described below commit b530c83d00e39948c579c8e608e3045aefad072b Author: JingsongLi <jingsongl...@gmail.com> AuthorDate: Fri Aug 22 13:26:02 2025 +0800 [doc] Add 'Data Skipping By File Index' for primary key table --- docs/content/append-table/query-performance.md | 9 +-------- .../content/primary-key-table/query-performance.md | 23 ++++++++++++++++++++++ 2 files changed, 24 insertions(+), 8 deletions(-) diff --git a/docs/content/append-table/query-performance.md b/docs/content/append-table/query-performance.md index 101970e643..dbc80a1d35 100644 --- a/docs/content/append-table/query-performance.md +++ b/docs/content/append-table/query-performance.md @@ -60,14 +60,7 @@ You can take a look at [Flink COMPACT Action]({{< ref "maintenance/dedicated-com You can use file index too, it filters files by indexing on the reading side. -```sql -CREATE TABLE <PAIMON_TABLE> (<COLUMN> <COLUMN_TYPE> , ...) WITH ( - 'file-index.bloom-filter.columns' = 'c1,c2', - 'file-index.bloom-filter.c1.items' = '200' -); -``` - -Define `file-index.bloom-filter.columns`, Data file index is an external index file and Paimon will create its +Define `file-index.bitmap.columns`, Data file index is an external index file and Paimon will create its corresponding index file for each file. If the index file is too small, it will be stored directly in the manifest, otherwise in the directory of the data file. Each data file corresponds to an index file, which has a separate file definition and can contain different types of indexes with multiple columns. diff --git a/docs/content/primary-key-table/query-performance.md b/docs/content/primary-key-table/query-performance.md index 2ba19b0d3d..7310103307 100644 --- a/docs/content/primary-key-table/query-performance.md +++ b/docs/content/primary-key-table/query-performance.md @@ -59,6 +59,29 @@ Min max query can be also accelerated during compilation and returns very quickl For a regular bucketed table (For example, bucket = 5), the filtering conditions of the primary key will greatly accelerate queries and reduce the reading of a large number of files. +## Data Skipping By File Index + +For full-compacted file, or for primary-key table with `'deletion-vectors.enabled'`, you can use file index, it filters +files by indexing on the reading side. + +Define `file-index.bitmap.columns`, Data file index is an external index file and Paimon will create its +corresponding index file for each file. If the index file is too small, it will be stored directly in the manifest, +otherwise in the directory of the data file. Each data file corresponds to an index file, which has a separate file +definition and can contain different types of indexes with multiple columns. + +Different file indexes may be efficient in different scenarios. For example bloom filter may speed up query in point lookup +scenario. Using a bitmap may consume more space but can result in greater accuracy. + +* [BloomFilter]({{< ref "concepts/spec/fileindex#index-bloomfilter" >}}): `file-index.bloom-filter.columns`. +* [Bitmap]({{< ref "concepts/spec/fileindex#index-bitmap" >}}): `file-index.bitmap.columns`. +* [Range Bitmap]({{< ref "concepts/spec/fileindex#index-range-bitmap" >}}): `file-index.range-bitmap.columns`. + +If you want to add file index to existing table, without any rewrite, you can use `rewrite_file_index` procedure. Before +we use the procedure, you should config appropriate configurations in target table. You can use ALTER clause to config +`file-index.<filter-type>.columns` to the table. + +How to invoke: see [flink procedures]({{< ref "flink/procedures#procedures" >}}) + ## Bucketed Join Fixed Bucketed table (e.g. bucket = 10) can be used to avoid shuffle if necessary in batch query, for example, you can