This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new bcd01c10463 [DOCS] Add notes for column stats in tech spec (#10753)
bcd01c10463 is described below

commit bcd01c10463611be2b073199faabb0d4a9dd58db
Author: Shiyan Xu <[email protected]>
AuthorDate: Mon Feb 26 00:36:12 2024 -0600

    [DOCS] Add notes for column stats in tech spec (#10753)
---
 website/src/pages/tech-specs.md | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/website/src/pages/tech-specs.md b/website/src/pages/tech-specs.md
index fb3b67e63ab..a56c6032ea1 100644
--- a/website/src/pages/tech-specs.md
+++ b/website/src/pages/tech-specs.md
@@ -180,22 +180,26 @@ hash-based joins, point-lookup queries, etc.
 |                           | `bloomFilter`  | bytes      | the actual bloom 
filter for the data file            |
 |                           | `isDeleted`    | boolean    | whether the bloom 
filter entry is valid              |
 
-- **column\_stats** - contains statistics of columns for all the records in 
the table. This enables fine 
-grained file pruning for filters and join conditions in the query. The actual 
payload is an instance of 
+- **column\_stats** - contains statistics of columns for all the records in 
the table. This enables fine-grained
+file pruning for filters and join conditions in the query. The actual payload 
is an instance of
 [HoodieMetadataColumnStats][17] (Refer the schema below).
 
 | Schema                      | Field Name               | Data Type           
                       | Description                                   |
 
|:----------------------------|:-------------------------|:-------------------------------------------|:----------------------------------------------|
-| HoodieMetadataColumnStats   | `fileName`               | string              
                       | file name for which the column stat applies   |
-|                             | `columnName`             | string              
                       | column name for which the column stat apples  |
+| HoodieMetadataColumnStats   | `fileName`               | string              
                       | file name to which the column stat applies    |
+|                             | `columnName`             | string              
                       | column name to which the column stat applies  |
 |                             | `minValue`               | [Wrapper type][19] 
(based on data schema)  | minimum value of the column in the file       |
 |                             | `maxValue`               | [Wrapper type][19] 
(based on data schema)  | maximum value of the column in the file       |
 |                             | `valueCount`             | long                
                       | total count of values                         |
-|                             | `nullCount`              | long                
                       | total count of null values                    |
+|                             | `nullCount`              | long                
                       | total count of `null` values                  |
 |                             | `totalSize`              | long                
                       | total storage size on disk                    |
 |                             | `totalUncompressedSize`  | long                
                       | total uncompressed storage size on disk       |
 |                             | `isDeleted`              | boolean             
                       | whether the column stat entry is valid        |
 
+Notes:
+- By default, all top-level fields are indexed for column stats.
+- When a top-level field is nested, it won't be indexed by default. 
Dot-notation will be recognized for indexing sub-fields via manual 
configuration, e.g., `set 
hoodie.metadata.index.column.stats.column.list=foo.a.b,foo.c`
+
 - **record\_index** - contains information about record keys and their 
location in the dataset. This improves 
 performance of updates since it provides file locations for the updated 
records and also enables fine grained 
 file pruning for filters and join conditions in the query. The payload is an 
instance of 

Reply via email to