[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6345: [HUDI-4552]: RFC-58: Integrate column stats index with all query engines

GitBox Thu, 25 Aug 2022 12:02:34 -0700


alexeykudinkin commented on code in PR #6345:
URL: https://github.com/apache/hudi/pull/6345#discussion_r955319299



##########
rfc/rfc-58/rfc-58.md:
##########
@@ -0,0 +1,69 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-58: Integrate column stats index with all query engines
+
+
+
+## Proposers
+
+- @pratyakshsharma
+
+## Approvers
+- @bhavanisudha
+- @danny0405
+- @codope
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4552
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Query engines like hive or presto typically scan a large amount of data for 
query planning and execution. Proper indexing can help reduce this scan to a 
great extent. Parquet files are the most commonly used file format for storing 
columnar data with various lakehouse techniques mainly because of their strong 
support with spark and
+the kind of indexing that they employ at different levels. Parquet files 
maintain indexes at file level, row group level and page level. Till some time 
back, Hudi used to make use of these indexes for fast querying via the parquet 
reader libraries. The problem with this approach was every file object had to 
be opened once to read the index stored in parquet footer to be able to do file 
pruning. This could potentially become a bottleneck in case of a large number of
+files. With the introduction of [multi-modal 
index](https://www.onehouse.ai/blog/introducing-multi-modal-index-for-the-lakehouse-in-apache-hudi)
 in Hudi, this problem has been solved to a great extent. Currently the data 
skipping support using this multi-modal index is available for spark and 
[flink](https://issues.apache.org/jira/browse/HUDI-4353) engines. We intend to 
extend this support for other query engines like presto, trino and hive in this 
RFC. 
+
+## Background
+[RFC-27](https://github.com/apache/hudi/blob/master/rfc/rfc-27/rfc-27.md) 
added a new partition corresponding to column_stats index in metadata table of 
Hudi. We plan to use the information stored in this partition for pruning the 
files. 
+
+## Implementation
+Describe the new thing you want to do in appropriate detail, how it fits into 
the project architecture.
+Provide a detailed description of how you intend to implement this 
feature.This may be fairly extensive and have large subsections of its own.
+Or it may be a few sentences. Use judgement based on the scope of the change.
+
+We propose two different approaches for integrating column stats index with 
different query engines and discuss the pros and cons for the same below.
+1. **Using domains** - Presto and Trino have the concept of column domains. 
Domain is actually the set of possible values that need to be returned for a 
particular column. Domains get created at the time of creating splits for 
processing. Domains basically contain a map of column to possible values where 
the possible values are populated after doing the necessary pre work of 
combining all the different filter predicates supplied as part of the query. 
[This draft PR](https://github.com/apache/hudi/pull/6087) shows the use of 
these domains for integrating data skipping index with presto engine. 
+This basically involves exposing a new api in HoodieTableMetadata.java as 
below - 
+
+```java
+FileStatus[] getFilesToQueryUsingCSI(List<String> columns, 
ColumnDomain<ColumnHandle> columnDomain) throws IOException;

Review Comment:
   @pratyakshsharma i think we're missing quite a few core concepts in here 
that we should expand on: 
   
   - Clearly define what `ColumnDomain`, `ColumnHandle`s are
   - How's filtering will be performed
   - How's it going to be generalized from where it's today to the proposed 
architecture



##########
rfc/rfc-58/rfc-58.md:
##########
@@ -0,0 +1,69 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-58: Integrate column stats index with all query engines
+
+
+
+## Proposers
+
+- @pratyakshsharma
+
+## Approvers
+- @bhavanisudha
+- @danny0405
+- @codope
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4552
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Query engines like hive or presto typically scan a large amount of data for 
query planning and execution. Proper indexing can help reduce this scan to a 
great extent. Parquet files are the most commonly used file format for storing 
columnar data with various lakehouse techniques mainly because of their strong 
support with spark and
+the kind of indexing that they employ at different levels. Parquet files 
maintain indexes at file level, row group level and page level. Till some time 
back, Hudi used to make use of these indexes for fast querying via the parquet 
reader libraries. The problem with this approach was every file object had to 
be opened once to read the index stored in parquet footer to be able to do file 
pruning. This could potentially become a bottleneck in case of a large number of
+files. With the introduction of [multi-modal 
index](https://www.onehouse.ai/blog/introducing-multi-modal-index-for-the-lakehouse-in-apache-hudi)
 in Hudi, this problem has been solved to a great extent. Currently the data 
skipping support using this multi-modal index is available for spark and 
[flink](https://issues.apache.org/jira/browse/HUDI-4353) engines. We intend to 
extend this support for other query engines like presto, trino and hive in this 
RFC. 
+
+## Background
+[RFC-27](https://github.com/apache/hudi/blob/master/rfc/rfc-27/rfc-27.md) 
added a new partition corresponding to column_stats index in metadata table of 
Hudi. We plan to use the information stored in this partition for pruning the 
files. 
+
+## Implementation
+Describe the new thing you want to do in appropriate detail, how it fits into 
the project architecture.
+Provide a detailed description of how you intend to implement this 
feature.This may be fairly extensive and have large subsections of its own.
+Or it may be a few sentences. Use judgement based on the scope of the change.
+
+We propose two different approaches for integrating column stats index with 
different query engines and discuss the pros and cons for the same below.
+1. **Using domains** - Presto and Trino have the concept of column domains. 
Domain is actually the set of possible values that need to be returned for a 
particular column. Domains get created at the time of creating splits for 
processing. Domains basically contain a map of column to possible values where 
the possible values are populated after doing the necessary pre work of 
combining all the different filter predicates supplied as part of the query. 
[This draft PR](https://github.com/apache/hudi/pull/6087) shows the use of 
these domains for integrating data skipping index with presto engine. 

Review Comment:
   Hudi was actually pretty successful at making engine-agnostic code to stay 
generic enough and not opinionated by particular implementation (too much) 
making it possible for us to facilitate integrations with other engines 
relatively easy. We'd stay the course and make sure we continue on that and 
don't get entangled w/ any particular engine.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6345: [HUDI-4552]: RFC-58: Integrate column stats index with all query engines

Reply via email to