This is an automated email from the ASF dual-hosted git repository. lzljs3620320 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/paimon.git
The following commit(s) were added to refs/heads/master by this push: new 53b3c2c62f [doc] Optimize doc for file index 53b3c2c62f is described below commit 53b3c2c62f18f99b2b06a052ce278f290448d363 Author: JingsongLi <jingsongl...@gmail.com> AuthorDate: Mon Aug 18 13:01:32 2025 +0800 [doc] Optimize doc for file index --- docs/content/append-table/query-performance.md | 50 ++-------------------- docs/content/concepts/spec/fileindex.md | 39 ++++++++++++++++- .../apache/paimon/spark/PaimonScanBuilder.scala | 4 -- 3 files changed, 41 insertions(+), 52 deletions(-) diff --git a/docs/content/append-table/query-performance.md b/docs/content/append-table/query-performance.md index 7c15455b3d..101970e643 100644 --- a/docs/content/append-table/query-performance.md +++ b/docs/content/append-table/query-performance.md @@ -75,53 +75,9 @@ definition and can contain different types of indexes with multiple columns. Different file indexes may be efficient in different scenarios. For example bloom filter may speed up query in point lookup scenario. Using a bitmap may consume more space but can result in greater accuracy. -`Bloom Filter`: -* `file-index.bloom-filter.columns`: specify the columns that need bloom filter index. -* `file-index.bloom-filter.<column_name>.fpp` to config false positive probability. -* `file-index.bloom-filter.<column_name>.items` to config the expected distinct items in one data file. - -`Bitmap`: -* `file-index.bitmap.columns`: specify the columns that need bitmap index. See [Index Bitmap]({{< ref "concepts/spec/fileindex#index-bitmap" >}}). - -`Range Bitmap Index Bitmap` -* `file-index.range-bitmap.columns`: specify the columns that need range-bitmap index. See [Index Range Bitmap]({{< ref "concepts/spec/fileindex#index-range-bitmap" >}}). - - -Append Table supports using range-bitmap file index to optimize the `EQUALS`, `RANGE`, `AND/OR` and `TOPN` predicate. The bitmap and range-bitmap file index result will be merged and pushed down to the DataFile for filtering rowgroups and pages. - -In the following query examples, the `class_id` and the `score` has been created with range-bitmap file index. And the partition key `dt` is not necessary. - -**Optimize the `EQUALS` predicate:** -```sql -SELECT * FROM TABLE WHERE dt = '20250801' AND score = 100; - -SELECT * FROM TABLE WHERE dt = '20250801' AND score IN (60, 80); -``` - -**Optimize the `RANGE` predicate:** -```sql -SELECT * FROM TABLE WHERE dt = '20250801' AND score > 60; - -SELECT * FROM TABLE WHERE dt = '20250801' AND score < 60; -``` - -**Optimize the `AND/OR` predicate:** -```sql -SELECT * FROM TABLE WHERE dt = '20250801' AND class_id = 1 AND score < 60; - -SELECT * FROM TABLE WHERE dt = '20250801' AND class_id = 1 AND score < 60 OR score > 80; -``` - -**Optimize the `TOPN` predicate:** - -For now, the `TOPN` predicate optimization can not using with other predicates, only support in Apache Spark. -```sql -SELECT * FROM TABLE WHERE dt = '20250801' ORDER BY score ASC LIMIT 10; - -SELECT * FROM TABLE WHERE dt = '20250801' ORDER BY score DESC LIMIT 10; -``` - -More filter types will be supported... +* [BloomFilter]({{< ref "concepts/spec/fileindex#index-bloomfilter" >}}): `file-index.bloom-filter.columns`. +* [Bitmap]({{< ref "concepts/spec/fileindex#index-bitmap" >}}): `file-index.bitmap.columns`. +* [Range Bitmap]({{< ref "concepts/spec/fileindex#index-range-bitmap" >}}): `file-index.range-bitmap.columns`. If you want to add file index to existing table, without any rewrite, you can use `rewrite_file_index` procedure. Before we use the procedure, you should config appropriate configurations in target table. You can use ALTER clause to config diff --git a/docs/content/concepts/spec/fileindex.md b/docs/content/concepts/spec/fileindex.md index b041139a8e..76ded93945 100644 --- a/docs/content/concepts/spec/fileindex.md +++ b/docs/content/concepts/spec/fileindex.md @@ -85,7 +85,10 @@ BODY: column index bytes + column index bytes + colu ## Index: BloomFilter -Define `'file-index.bloom-filter.columns'`. +Options are: +* `file-index.bloom-filter.columns`: specify the columns that need bloom filter index. +* `file-index.bloom-filter.<column_name>.fpp` to config false positive probability. +* `file-index.bloom-filter.<column_name>.items` to config the expected distinct items in one data file. Content of bloom filter index is simple: - numHashFunctions 4 bytes int, BIG_ENDIAN @@ -232,6 +235,40 @@ Options: * `file-index.range-bitmap.columns`: specify the columns that need range-bitmap index. * `file-index.range-bitmap.<column_name>.chunk-size`: to config the chunk size, default value is 16kb. +Table supports using range-bitmap file index to optimize the `EQUALS`, `RANGE`, `AND/OR` and `TOPN` predicate. The bitmap and range-bitmap file index result will be merged and pushed down to the DataFile for filtering rowgroups and pages. + +In the following query examples, the `class_id` and the `score` has been created with range-bitmap file index. And the partition key `dt` is not necessary. + +**Optimize the `EQUALS` predicate:** +```sql +SELECT * FROM TABLE WHERE dt = '20250801' AND score = 100; + +SELECT * FROM TABLE WHERE dt = '20250801' AND score IN (60, 80); +``` + +**Optimize the `RANGE` predicate:** +```sql +SELECT * FROM TABLE WHERE dt = '20250801' AND score > 60; + +SELECT * FROM TABLE WHERE dt = '20250801' AND score < 60; +``` + +**Optimize the `AND/OR` predicate:** +```sql +SELECT * FROM TABLE WHERE dt = '20250801' AND class_id = 1 AND score < 60; + +SELECT * FROM TABLE WHERE dt = '20250801' AND class_id = 1 AND score < 60 OR score > 80; +``` + +**Optimize the `TOPN` predicate:** + +For now, the `TOPN` predicate optimization can not using with other predicates, only support in Apache Spark. +```sql +SELECT * FROM TABLE WHERE dt = '20250801' ORDER BY score ASC LIMIT 10; + +SELECT * FROM TABLE WHERE dt = '20250801' ORDER BY score DESC LIMIT 10; +``` + <pre> Range Bitmap file index format (V1) +-------------------------------------------------+----------------- diff --git a/paimon-spark/paimon-spark-common/src/main/scala/org/apache/paimon/spark/PaimonScanBuilder.scala b/paimon-spark/paimon-spark-common/src/main/scala/org/apache/paimon/spark/PaimonScanBuilder.scala index 899c204718..729613f596 100644 --- a/paimon-spark/paimon-spark-common/src/main/scala/org/apache/paimon/spark/PaimonScanBuilder.scala +++ b/paimon-spark/paimon-spark-common/src/main/scala/org/apache/paimon/spark/PaimonScanBuilder.scala @@ -116,10 +116,6 @@ class PaimonScanBuilder(table: InnerTable) } val order = orders(0) - if (!order.expression().isInstanceOf[NamedReference]) { - return false - } - val fieldName = orders.head.expression() match { case nr: NamedReference => nr.fieldNames.mkString(".") case _ => return false