[GitHub] [hudi] manojpec commented on a change in pull request #3989: [HUDI-2589] RFC-37: Metadata table based bloom index

GitBox Fri, 17 Dec 2021 18:25:03 -0800


manojpec commented on a change in pull request #3989:
URL: https://github.com/apache/hudi/pull/3989#discussion_r771612579




##########
File path: rfc/rfc-37/rfc-37.md
##########
@@ -0,0 +1,286 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-37: Metadata based Bloom Index
+
+## Proposers
+- @nsivabalan
+- @manojpec
+
+## Approvers
+ - @vinothchandar
+ - @satishkotha
+
+## Status
+JIRA: https://issues.apache.org/jira/browse/HUDI-2703
+
+## Abstract
+Hudi maintains several indices to locate/map incoming records to file groups 
during writes. Most commonly 
+used record index is the HoodieBloomIndex. Larger tables and global index has 
performance issues
+as the bloom filter from a large number of data files needed to be read and 
looked up. Reading from several
+files over the cloud object storage like S3 also faces request throttling 
issues. We are proposing to 
+build a new Metadata index (metadata table based bloom index) to boost the 
performance of existing bloom index. 
+
+## Background
+HoodieBloomIndex is used to find the location of incoming records during every 
write. Bloom index assists Hudi in
+deterministically routing records to a given file group and to distinguish 
inserts vs updates. This aggregate bloom
+index is built from several bloom filters stored in the base file footers. 
Prior to bloom filter lookup, the file
+pruning for the incoming records is also done based on the record key min/max 
stats stored in the base file footers.
+In this RFC, we plan to build a new index for the bloom filters under the 
metadata table which to assist in 
+bloom index based record location tagging. 
+
+## Design
+HoodieBloomIndex involves the following steps to find the right location of 
incoming records
+1. Find all the interested partitions and list all its data files.
+2. File Pruning: Load record key min/max details from all the interested data 
file footers. Filter files and generate
+   files to keys mapping for the incoming records based on the key ranges 
using range interval tree built from
+   previously loaded min/max details.
+3. Bloom Filter lookup: Filter files and prune files to keys mapping for the 
incoming keys mapping based on the bloom
+   filter key lookup
+4. Final Look up in actual data files to find the right location of every 
incoming record
+
+As we could see from step 1 and 2, we are in need of min and max values for 
"_hoodie_record_key" and bloom filters
+from all interested data files to perform the location tagging. In this 
design, we will add these key stats and
+bloom filter to the metadata table and thereby able to quickly load the 
interested details and do faster lookups.
+
+Metadata table already has one partition `files` to help in partition file 
listing. For the metadata table based
+indices, we are proposing to add following two new partitions:
+1. `bloom_filter` - for the file level bloom filter
+2. `column_stats` - for the key range stats
+
+Why metadata table: 
+Metadata table uses HBase HFile - the map file format to store and retrieve 
data. HFile is an indexed file format and
+supports map like faster lookups by keys. Since, we will be storing 
stats/bloom for every file and the index will do
+lookups based on files, we should be able to benefit from the faster lookups 
in HFile. 
+
+<img src="metadata_index_1.png" alt="High Level Metadata Index Design" 
width="480"/>
+
+Following sections will talk about different partitions, key formats and then 
dive into the data and control flows.
+
+### MetaIndex/BloomFilter:
+
+A new partition `bloom_filter` will be added under the metadata table. Bloom 
filters from all the base files in the
+data table will be added here. Metadata table is already in the HFile format. 
The existing metadata payload schema will
+be extended and shared for this partition also. The type field will be used to 
detect the bloom filter payload record.
+Here is the schema for the bloom filter payload record.
+```
+           {
+            "doc": "Metadata about base file bloom filters",
+            "name": "BloomFilterMetadata",
+            "type": [
+                "null",
+                {
+                    "doc": "Base FileID and its BloomFilter details",
+                    "name": "HoodieMetadataBloomFilter",
+                    "type": "record",
+                    "fields": [
+                        {
+                            "doc": "Version/type of the bloom filter metadata",
+                            "name": "version",
+                            "type": "string"
+                        },
+                        {
+                            "doc": "Instant timestamp when this metadata was 
created/updated",
+                            "name": "timestamp",
+                            "type": "string"
+                        },
+                        {
+                            "doc": "Bloom filter binary byte array",
+                            "name": "bloomfilter",
+                            "type": "bytes"
+                        },
+                        {
+                            "doc": "True if this entry is valid",
+                            "name": "valid",
+                            "type": "boolean"
+                        }
+                    ]
+                }
+            ]
+        }
+```
+
+The key for the bloom filter record would be an encoded string representing 
the partition and base file combo. The
+partition and the file names are converted to deterministic hash based IDs, 
and then they are base64 encoded. Hash
+based IDs are easy to generate for the incoming new inserts records and for 
the lookup for the update records. 
+It and doesn't need any dictionary to be added for the reverse lookups. Hash 
bits are chosen based on the
+cardinality and the collision probability desired for the support max scale 
deployment. Base64 encoding the hash IDs
+further reduces the on-disk storage space for these keys.
+
+```
+key = base64_encode(hash64(partition name) + hash128(file path))
+```
+
+<img src="metadata_index_bloom_partition.png" alt="Bloom filter partition" 
width="500"/>
+
+### MetaIndex/ColumnStats:
+
+Another new partition `column_stats` will also be added under the metadata 
table to make the record key lookup code path
+much more performant. This metadata index also helps in the data skipping 
(please look at RFC-27 for more details). In
+the context of faster record key lookups for the update use cases, proposing 
`column_stats` index to be used for
+file pruning when generating the file to candidate keys mapping for the update 
records.The existing metadata payload
+schema will be extended and shared for this partition also. The type field 
will be used to detect the column stats
+payload record. Here is the schema for the column stats payload record.
+
+```
+        {
+            "name": "ColumnStatsMetadata",
+            "doc": "Contains information about column ranges for all data 
files in the table",
+            "type": [
+                "null",
+                {
+                    "type": "record",
+                    "name": "HoodieColumnStats",
+                    "fields": [
+                        {
+                            "name": "rangeLow",
+                            "type": [
+                                "null",
+                                "string"
+                            ],
+                            "doc": "Low end of the range. For now, this is a 
String. Based on main data table schema, we can convert it to appropriate type"

Review comment:
       Right. Will plan for the unification part.

##########
File path: rfc/rfc-37/rfc-37.md
##########
@@ -0,0 +1,286 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-37: Metadata based Bloom Index
+
+## Proposers
+- @nsivabalan
+- @manojpec
+
+## Approvers
+ - @vinothchandar
+ - @satishkotha
+
+## Status
+JIRA: https://issues.apache.org/jira/browse/HUDI-2703
+
+## Abstract
+Hudi maintains several indices to locate/map incoming records to file groups 
during writes. Most commonly 
+used record index is the HoodieBloomIndex. Larger tables and global index has 
performance issues
+as the bloom filter from a large number of data files needed to be read and 
looked up. Reading from several
+files over the cloud object storage like S3 also faces request throttling 
issues. We are proposing to 
+build a new Metadata index (metadata table based bloom index) to boost the 
performance of existing bloom index. 
+
+## Background
+HoodieBloomIndex is used to find the location of incoming records during every 
write. Bloom index assists Hudi in
+deterministically routing records to a given file group and to distinguish 
inserts vs updates. This aggregate bloom
+index is built from several bloom filters stored in the base file footers. 
Prior to bloom filter lookup, the file
+pruning for the incoming records is also done based on the record key min/max 
stats stored in the base file footers.
+In this RFC, we plan to build a new index for the bloom filters under the 
metadata table which to assist in 
+bloom index based record location tagging. 
+
+## Design
+HoodieBloomIndex involves the following steps to find the right location of 
incoming records
+1. Find all the interested partitions and list all its data files.
+2. File Pruning: Load record key min/max details from all the interested data 
file footers. Filter files and generate

Review comment:
       This is the existing/today's design. The proposed model below file 
pruning will be based off ColumnStats index from the new metadata partition. 
There are 3 projects which need to be unified at the infra level -- (1) Record 
Level Indexing (2) Bloom and Column Stats Indexing (3) 
ZOrdering/LayoutOptimization. Probably we need to do them after the initial 
implementation. 

##########
File path: rfc/rfc-37/rfc-37.md
##########
@@ -0,0 +1,286 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-37: Metadata based Bloom Index
+
+## Proposers
+- @nsivabalan
+- @manojpec
+
+## Approvers
+ - @vinothchandar
+ - @satishkotha
+
+## Status
+JIRA: https://issues.apache.org/jira/browse/HUDI-2703
+
+## Abstract
+Hudi maintains several indices to locate/map incoming records to file groups 
during writes. Most commonly 
+used record index is the HoodieBloomIndex. Larger tables and global index has 
performance issues
+as the bloom filter from a large number of data files needed to be read and 
looked up. Reading from several
+files over the cloud object storage like S3 also faces request throttling 
issues. We are proposing to 
+build a new Metadata index (metadata table based bloom index) to boost the 
performance of existing bloom index. 
+
+## Background
+HoodieBloomIndex is used to find the location of incoming records during every 
write. Bloom index assists Hudi in
+deterministically routing records to a given file group and to distinguish 
inserts vs updates. This aggregate bloom
+index is built from several bloom filters stored in the base file footers. 
Prior to bloom filter lookup, the file
+pruning for the incoming records is also done based on the record key min/max 
stats stored in the base file footers.
+In this RFC, we plan to build a new index for the bloom filters under the 
metadata table which to assist in 
+bloom index based record location tagging. 
+
+## Design
+HoodieBloomIndex involves the following steps to find the right location of 
incoming records
+1. Find all the interested partitions and list all its data files.
+2. File Pruning: Load record key min/max details from all the interested data 
file footers. Filter files and generate
+   files to keys mapping for the incoming records based on the key ranges 
using range interval tree built from
+   previously loaded min/max details.
+3. Bloom Filter lookup: Filter files and prune files to keys mapping for the 
incoming keys mapping based on the bloom
+   filter key lookup
+4. Final Look up in actual data files to find the right location of every 
incoming record
+
+As we could see from step 1 and 2, we are in need of min and max values for 
"_hoodie_record_key" and bloom filters
+from all interested data files to perform the location tagging. In this 
design, we will add these key stats and
+bloom filter to the metadata table and thereby able to quickly load the 
interested details and do faster lookups.
+
+Metadata table already has one partition `files` to help in partition file 
listing. For the metadata table based
+indices, we are proposing to add following two new partitions:
+1. `bloom_filter` - for the file level bloom filter
+2. `column_stats` - for the key range stats
+
+Why metadata table: 
+Metadata table uses HBase HFile - the map file format to store and retrieve 
data. HFile is an indexed file format and
+supports map like faster lookups by keys. Since, we will be storing 
stats/bloom for every file and the index will do
+lookups based on files, we should be able to benefit from the faster lookups 
in HFile. 
+
+<img src="metadata_index_1.png" alt="High Level Metadata Index Design" 
width="480"/>
+
+Following sections will talk about different partitions, key formats and then 
dive into the data and control flows.
+
+### MetaIndex/BloomFilter:
+
+A new partition `bloom_filter` will be added under the metadata table. Bloom 
filters from all the base files in the
+data table will be added here. Metadata table is already in the HFile format. 
The existing metadata payload schema will
+be extended and shared for this partition also. The type field will be used to 
detect the bloom filter payload record.
+Here is the schema for the bloom filter payload record.
+```
+           {
+            "doc": "Metadata about base file bloom filters",
+            "name": "BloomFilterMetadata",
+            "type": [
+                "null",
+                {
+                    "doc": "Base FileID and its BloomFilter details",
+                    "name": "HoodieMetadataBloomFilter",
+                    "type": "record",
+                    "fields": [
+                        {
+                            "doc": "Version/type of the bloom filter metadata",
+                            "name": "version",
+                            "type": "string"
+                        },
+                        {
+                            "doc": "Instant timestamp when this metadata was 
created/updated",
+                            "name": "timestamp",
+                            "type": "string"
+                        },
+                        {
+                            "doc": "Bloom filter binary byte array",
+                            "name": "bloomfilter",
+                            "type": "bytes"
+                        },
+                        {
+                            "doc": "True if this entry is valid",
+                            "name": "valid",
+                            "type": "boolean"
+                        }
+                    ]
+                }
+            ]
+        }
+```
+
+The key for the bloom filter record would be an encoded string representing 
the partition and base file combo. The
+partition and the file names are converted to deterministic hash based IDs, 
and then they are base64 encoded. Hash
+based IDs are easy to generate for the incoming new inserts records and for 
the lookup for the update records. 
+It and doesn't need any dictionary to be added for the reverse lookups. Hash 
bits are chosen based on the
+cardinality and the collision probability desired for the support max scale 
deployment. Base64 encoding the hash IDs
+further reduces the on-disk storage space for these keys.
+
+```
+key = base64_encode(hash64(partition name) + hash128(file path))
+```
+
+<img src="metadata_index_bloom_partition.png" alt="Bloom filter partition" 
width="500"/>
+
+### MetaIndex/ColumnStats:
+
+Another new partition `column_stats` will also be added under the metadata 
table to make the record key lookup code path
+much more performant. This metadata index also helps in the data skipping 
(please look at RFC-27 for more details). In
+the context of faster record key lookups for the update use cases, proposing 
`column_stats` index to be used for
+file pruning when generating the file to candidate keys mapping for the update 
records.The existing metadata payload
+schema will be extended and shared for this partition also. The type field 
will be used to detect the column stats
+payload record. Here is the schema for the column stats payload record.
+
+```
+        {
+            "name": "ColumnStatsMetadata",
+            "doc": "Contains information about column ranges for all data 
files in the table",
+            "type": [
+                "null",
+                {
+                    "type": "record",
+                    "name": "HoodieColumnStats",

Review comment:
       Added nullCount to the schema. 

##########
File path: rfc/rfc-37/rfc-37.md
##########
@@ -0,0 +1,286 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-37: Metadata based Bloom Index
+
+## Proposers
+- @nsivabalan
+- @manojpec
+
+## Approvers
+ - @vinothchandar
+ - @satishkotha
+
+## Status
+JIRA: https://issues.apache.org/jira/browse/HUDI-2703
+
+## Abstract
+Hudi maintains several indices to locate/map incoming records to file groups 
during writes. Most commonly 
+used record index is the HoodieBloomIndex. Larger tables and global index has 
performance issues
+as the bloom filter from a large number of data files needed to be read and 
looked up. Reading from several
+files over the cloud object storage like S3 also faces request throttling 
issues. We are proposing to 
+build a new Metadata index (metadata table based bloom index) to boost the 
performance of existing bloom index. 
+
+## Background
+HoodieBloomIndex is used to find the location of incoming records during every 
write. Bloom index assists Hudi in
+deterministically routing records to a given file group and to distinguish 
inserts vs updates. This aggregate bloom
+index is built from several bloom filters stored in the base file footers. 
Prior to bloom filter lookup, the file
+pruning for the incoming records is also done based on the record key min/max 
stats stored in the base file footers.
+In this RFC, we plan to build a new index for the bloom filters under the 
metadata table which to assist in 
+bloom index based record location tagging. 
+
+## Design
+HoodieBloomIndex involves the following steps to find the right location of 
incoming records
+1. Find all the interested partitions and list all its data files.
+2. File Pruning: Load record key min/max details from all the interested data 
file footers. Filter files and generate
+   files to keys mapping for the incoming records based on the key ranges 
using range interval tree built from
+   previously loaded min/max details.
+3. Bloom Filter lookup: Filter files and prune files to keys mapping for the 
incoming keys mapping based on the bloom
+   filter key lookup
+4. Final Look up in actual data files to find the right location of every 
incoming record
+
+As we could see from step 1 and 2, we are in need of min and max values for 
"_hoodie_record_key" and bloom filters
+from all interested data files to perform the location tagging. In this 
design, we will add these key stats and
+bloom filter to the metadata table and thereby able to quickly load the 
interested details and do faster lookups.
+
+Metadata table already has one partition `files` to help in partition file 
listing. For the metadata table based
+indices, we are proposing to add following two new partitions:
+1. `bloom_filter` - for the file level bloom filter
+2. `column_stats` - for the key range stats
+
+Why metadata table: 
+Metadata table uses HBase HFile - the map file format to store and retrieve 
data. HFile is an indexed file format and
+supports map like faster lookups by keys. Since, we will be storing 
stats/bloom for every file and the index will do
+lookups based on files, we should be able to benefit from the faster lookups 
in HFile. 
+
+<img src="metadata_index_1.png" alt="High Level Metadata Index Design" 
width="480"/>
+
+Following sections will talk about different partitions, key formats and then 
dive into the data and control flows.
+
+### MetaIndex/BloomFilter:
+
+A new partition `bloom_filter` will be added under the metadata table. Bloom 
filters from all the base files in the
+data table will be added here. Metadata table is already in the HFile format. 
The existing metadata payload schema will
+be extended and shared for this partition also. The type field will be used to 
detect the bloom filter payload record.
+Here is the schema for the bloom filter payload record.
+```
+           {
+            "doc": "Metadata about base file bloom filters",
+            "name": "BloomFilterMetadata",
+            "type": [
+                "null",
+                {
+                    "doc": "Base FileID and its BloomFilter details",
+                    "name": "HoodieMetadataBloomFilter",
+                    "type": "record",
+                    "fields": [
+                        {
+                            "doc": "Version/type of the bloom filter metadata",
+                            "name": "version",
+                            "type": "string"
+                        },
+                        {
+                            "doc": "Instant timestamp when this metadata was 
created/updated",
+                            "name": "timestamp",
+                            "type": "string"
+                        },
+                        {
+                            "doc": "Bloom filter binary byte array",
+                            "name": "bloomfilter",
+                            "type": "bytes"
+                        },
+                        {
+                            "doc": "True if this entry is valid",
+                            "name": "valid",
+                            "type": "boolean"
+                        }
+                    ]
+                }
+            ]
+        }
+```
+
+The key for the bloom filter record would be an encoded string representing 
the partition and base file combo. The
+partition and the file names are converted to deterministic hash based IDs, 
and then they are base64 encoded. Hash
+based IDs are easy to generate for the incoming new inserts records and for 
the lookup for the update records. 
+It and doesn't need any dictionary to be added for the reverse lookups. Hash 
bits are chosen based on the
+cardinality and the collision probability desired for the support max scale 
deployment. Base64 encoding the hash IDs
+further reduces the on-disk storage space for these keys.
+
+```
+key = base64_encode(hash64(partition name) + hash128(file path))
+```
+
+<img src="metadata_index_bloom_partition.png" alt="Bloom filter partition" 
width="500"/>
+
+### MetaIndex/ColumnStats:
+
+Another new partition `column_stats` will also be added under the metadata 
table to make the record key lookup code path
+much more performant. This metadata index also helps in the data skipping 
(please look at RFC-27 for more details). In
+the context of faster record key lookups for the update use cases, proposing 
`column_stats` index to be used for
+file pruning when generating the file to candidate keys mapping for the update 
records.The existing metadata payload
+schema will be extended and shared for this partition also. The type field 
will be used to detect the column stats
+payload record. Here is the schema for the column stats payload record.
+
+```
+        {
+            "name": "ColumnStatsMetadata",
+            "doc": "Contains information about column ranges for all data 
files in the table",
+            "type": [
+                "null",
+                {
+                    "type": "record",
+                    "name": "HoodieColumnStats",
+                    "fields": [
+                        {
+                            "name": "rangeLow",
+                            "type": [
+                                "null",
+                                "string"
+                            ],
+                            "doc": "Low end of the range. For now, this is a 
String. Based on main data table schema, we can convert it to appropriate type"
+                        },
+                        {
+                            "name": "rangeHigh",
+                            "type": [
+                                "null",
+                                "string"
+                            ],
+                            "doc": "High end of the range. For now, this is a 
String. Based on main data table schema, we can convert it to appropriate type"
+                        },
+                        {
+                            "name": "isDeleted",
+                            "type": "boolean",
+                            "doc": "True if this file has been deleted"
+                        }
+                    ]
+                }
+            ]
+        }
+```
+
+Column stats records hold key ranges (min and max) for the file. The key for 
the column stat record would be an
+encoded string representing the tuple set of column name, partition name and 
the base file. The string names of
+these fields are converted to deterministic hash based IDs, and then they are 
base64 encoded, just like the
+bloom filter key.
+
+```
+key = base64_encode(hash64(column name) + hash64(partition name) + 
hash128(file path))
+```
+
+While Hash based IDs have quite a few desirable properties in the context of 
Hudi index lookups, there is an impact
+on the column level schema changes though. Refer to [Schema 
Evolution](#Schema-Evolution) section for more details.
+
+Below picture gives a pictorial representation of Column stats partition in 
metadata table.
+<img src="metadata_index_col_stats.png" alt="Column Stats Partition" 
width="480"/>
+
+### Metadata Index lookup:
+
+For the incoming upsert records, given their keys, tag their current location. 
The new algorithm for the
+index lookup would be
+
+1. Generate the list of partitions and the list of keys under each partition 
to be looked up
+2. For all the involved partitions, load all its file list
+3. Level 1: Range pruning using `column_stats` index:
+   1. For each of the record key, generate the column stats index lookup key 
based on the tuple 
+      (__hoodie_record_key, partition name, file path)
+   2. Meta index lookup with the above key and if available get the value 
payload with the column stats details
+   3. Build the key range interval tree from all the looked up keys
+   4. Prune the partition and its candidate files based on the range interval 
lookups
+4. Leve 2: Record pruning using `bloom_filter`  index:
+   1. From the shortlisted file candidates per partition, generate bloom 
filter index lookup key based on the tuple
+      (partition name, file path)
+   2. Meta index lookup with the above key to load the base file bloom filter
+   3. Bloom filter lookup for the record key to generate the probable keys
+5. Level 3: Record validation
+   1. Given the list of files and their probable keys from above pruning, do 
the actual file lookup to
+      confirm the keys
+   2. Return the location (file id) of the final matching keys
+
+### Schema Evolution:
+
+HashID based key are deterministically generated from the tuple input. Mean, 
for the tuple consisting of column name,
+partition name and file name, the key generated would always be the same. So, 
a table where the schema gets changed over
+time would have an impact on the keys already generated. The most common 
schema evolution use cases like change of
+column type, adding a new column are not affected though. Other relatively 
uncommon use cases like column name rename,
+dropping a column and adding a column with dropped name would have indices 
referring them more than needed. This would
+lead to the index lookup matching stale/new records across evolved schemas. 
+
+To avoid looking up stale/new index records, here are the design options we 
have:
+1. (Preferred) Query rewrite / Result recordset pruning
+   1. Schema evolution layer should introduce query rewrite stage to detect 
evolved schemas for the input query and
+      optionally include additional predicates to the query
+   2. The resultant recordset can also be pruned based on the commit time and 
the schema change time
+3. Making input tuple set schema aware 
+   1. Along with column name, partition name and file path, a version/tag can 
also be added to make the key
+      generated very schema specific. But, this choice has a performance 
impact as the lookup now has to be more of a
+      prefix based instead of pointed lookups. That is, index lookup have to 
return records for all the versions/tags

Review comment:
       Schema versions and evolution logic are not pushed to the index layer. 
The layer above have to modify the query based on its knowledge of when the 
evolution and what happened. Mean, the index is going to return all possible 
results for the key being lookedup and the layer above have to do any pruning 
if needed. 
   
   Regarding index not available, yes we are taking care of this in the async 
index building project. Not all index will be available and ready for use all 
the times and we have to fallback if not available. 
   

##########
File path: rfc/rfc-37/rfc-37.md
##########
@@ -0,0 +1,286 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-37: Metadata based Bloom Index
+
+## Proposers
+- @nsivabalan
+- @manojpec
+
+## Approvers
+ - @vinothchandar
+ - @satishkotha
+
+## Status
+JIRA: https://issues.apache.org/jira/browse/HUDI-2703
+
+## Abstract
+Hudi maintains several indices to locate/map incoming records to file groups 
during writes. Most commonly 
+used record index is the HoodieBloomIndex. Larger tables and global index has 
performance issues
+as the bloom filter from a large number of data files needed to be read and 
looked up. Reading from several
+files over the cloud object storage like S3 also faces request throttling 
issues. We are proposing to 
+build a new Metadata index (metadata table based bloom index) to boost the 
performance of existing bloom index. 
+
+## Background
+HoodieBloomIndex is used to find the location of incoming records during every 
write. Bloom index assists Hudi in
+deterministically routing records to a given file group and to distinguish 
inserts vs updates. This aggregate bloom
+index is built from several bloom filters stored in the base file footers. 
Prior to bloom filter lookup, the file
+pruning for the incoming records is also done based on the record key min/max 
stats stored in the base file footers.
+In this RFC, we plan to build a new index for the bloom filters under the 
metadata table which to assist in 
+bloom index based record location tagging. 
+
+## Design
+HoodieBloomIndex involves the following steps to find the right location of 
incoming records
+1. Find all the interested partitions and list all its data files.
+2. File Pruning: Load record key min/max details from all the interested data 
file footers. Filter files and generate
+   files to keys mapping for the incoming records based on the key ranges 
using range interval tree built from
+   previously loaded min/max details.
+3. Bloom Filter lookup: Filter files and prune files to keys mapping for the 
incoming keys mapping based on the bloom
+   filter key lookup
+4. Final Look up in actual data files to find the right location of every 
incoming record
+
+As we could see from step 1 and 2, we are in need of min and max values for 
"_hoodie_record_key" and bloom filters
+from all interested data files to perform the location tagging. In this 
design, we will add these key stats and
+bloom filter to the metadata table and thereby able to quickly load the 
interested details and do faster lookups.
+
+Metadata table already has one partition `files` to help in partition file 
listing. For the metadata table based
+indices, we are proposing to add following two new partitions:
+1. `bloom_filter` - for the file level bloom filter
+2. `column_stats` - for the key range stats
+
+Why metadata table: 
+Metadata table uses HBase HFile - the map file format to store and retrieve 
data. HFile is an indexed file format and
+supports map like faster lookups by keys. Since, we will be storing 
stats/bloom for every file and the index will do
+lookups based on files, we should be able to benefit from the faster lookups 
in HFile. 
+
+<img src="metadata_index_1.png" alt="High Level Metadata Index Design" 
width="480"/>
+
+Following sections will talk about different partitions, key formats and then 
dive into the data and control flows.
+
+### MetaIndex/BloomFilter:
+
+A new partition `bloom_filter` will be added under the metadata table. Bloom 
filters from all the base files in the
+data table will be added here. Metadata table is already in the HFile format. 
The existing metadata payload schema will
+be extended and shared for this partition also. The type field will be used to 
detect the bloom filter payload record.
+Here is the schema for the bloom filter payload record.
+```
+           {
+            "doc": "Metadata about base file bloom filters",
+            "name": "BloomFilterMetadata",
+            "type": [
+                "null",
+                {
+                    "doc": "Base FileID and its BloomFilter details",
+                    "name": "HoodieMetadataBloomFilter",
+                    "type": "record",
+                    "fields": [
+                        {
+                            "doc": "Version/type of the bloom filter metadata",
+                            "name": "version",
+                            "type": "string"
+                        },
+                        {
+                            "doc": "Instant timestamp when this metadata was 
created/updated",
+                            "name": "timestamp",
+                            "type": "string"
+                        },
+                        {
+                            "doc": "Bloom filter binary byte array",
+                            "name": "bloomfilter",
+                            "type": "bytes"
+                        },
+                        {
+                            "doc": "True if this entry is valid",
+                            "name": "valid",
+                            "type": "boolean"
+                        }
+                    ]
+                }
+            ]
+        }
+```
+
+The key for the bloom filter record would be an encoded string representing 
the partition and base file combo. The
+partition and the file names are converted to deterministic hash based IDs, 
and then they are base64 encoded. Hash
+based IDs are easy to generate for the incoming new inserts records and for 
the lookup for the update records. 
+It and doesn't need any dictionary to be added for the reverse lookups. Hash 
bits are chosen based on the
+cardinality and the collision probability desired for the support max scale 
deployment. Base64 encoding the hash IDs
+further reduces the on-disk storage space for these keys.
+
+```
+key = base64_encode(hash64(partition name) + hash128(file path))
+```
+
+<img src="metadata_index_bloom_partition.png" alt="Bloom filter partition" 
width="500"/>
+
+### MetaIndex/ColumnStats:
+
+Another new partition `column_stats` will also be added under the metadata 
table to make the record key lookup code path
+much more performant. This metadata index also helps in the data skipping 
(please look at RFC-27 for more details). In
+the context of faster record key lookups for the update use cases, proposing 
`column_stats` index to be used for
+file pruning when generating the file to candidate keys mapping for the update 
records.The existing metadata payload
+schema will be extended and shared for this partition also. The type field 
will be used to detect the column stats
+payload record. Here is the schema for the column stats payload record.
+
+```
+        {
+            "name": "ColumnStatsMetadata",
+            "doc": "Contains information about column ranges for all data 
files in the table",
+            "type": [
+                "null",
+                {
+                    "type": "record",
+                    "name": "HoodieColumnStats",
+                    "fields": [
+                        {
+                            "name": "rangeLow",
+                            "type": [
+                                "null",
+                                "string"
+                            ],
+                            "doc": "Low end of the range. For now, this is a 
String. Based on main data table schema, we can convert it to appropriate type"
+                        },
+                        {
+                            "name": "rangeHigh",
+                            "type": [
+                                "null",
+                                "string"
+                            ],
+                            "doc": "High end of the range. For now, this is a 
String. Based on main data table schema, we can convert it to appropriate type"
+                        },
+                        {
+                            "name": "isDeleted",
+                            "type": "boolean",
+                            "doc": "True if this file has been deleted"
+                        }
+                    ]
+                }
+            ]
+        }
+```
+
+Column stats records hold key ranges (min and max) for the file. The key for 
the column stat record would be an
+encoded string representing the tuple set of column name, partition name and 
the base file. The string names of
+these fields are converted to deterministic hash based IDs, and then they are 
base64 encoded, just like the
+bloom filter key.
+
+```
+key = base64_encode(hash64(column name) + hash64(partition name) + 
hash128(file path))
+```
+
+While Hash based IDs have quite a few desirable properties in the context of 
Hudi index lookups, there is an impact
+on the column level schema changes though. Refer to [Schema 
Evolution](#Schema-Evolution) section for more details.
+
+Below picture gives a pictorial representation of Column stats partition in 
metadata table.
+<img src="metadata_index_col_stats.png" alt="Column Stats Partition" 
width="480"/>
+
+### Metadata Index lookup:
+
+For the incoming upsert records, given their keys, tag their current location. 
The new algorithm for the
+index lookup would be
+
+1. Generate the list of partitions and the list of keys under each partition 
to be looked up
+2. For all the involved partitions, load all its file list
+3. Level 1: Range pruning using `column_stats` index:
+   1. For each of the record key, generate the column stats index lookup key 
based on the tuple 
+      (__hoodie_record_key, partition name, file path)
+   2. Meta index lookup with the above key and if available get the value 
payload with the column stats details
+   3. Build the key range interval tree from all the looked up keys

Review comment:
       I see. Probably we can do the key to candidate file selection right at 
the time of loading column stats instead of having one more iteration of range 
interval tree lookups. Will investigate this further. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] manojpec commented on a change in pull request #3989: [HUDI-2589] RFC-37: Metadata table based bloom index

Reply via email to