Expression Abstractions [hudi]

via GitHub Thu, 06 Feb 2025 18:25:37 -0800


cshuo commented on code in PR #12795:
URL: https://github.com/apache/hudi/pull/12795#discussion_r1945872572



##########
rfc/rfc-88/rfc-88.md:
##########
@@ -0,0 +1,601 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-88: New Schema/DataType/Expression Abstractions
+
+## Proposers
+
+- @cshuo
+- @danny0405
+
+## Approvers
+- ..
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-8966
+
+## Abstract
+
+Hudi currently is tightly coupled with Avro, particularly in terms of basic 
data types, schema, and the internal record
+representation used in read/write paths. This coupling leads to numerous 
issues, for example, record-level unnecessary 
+Ser/De costs are introduced because of engine native row and Avro record 
converting, the data type can not be extended
+to support other complex/advanced type, such as Variant and the basic read/ 
write functionality codes cannot be effectively 
+reused among different engines. As for Expression, currently, different 
engines have their own implementation to achieve 
+pushdown optimization, which is not friendly for extending as more indices are 
introduced.
+
+This RFC aims to propose an improvement to the current Schema/Type/Expression 
abstractions, to achieve the following goals:
+* Use a native schema as the authoritative schema, and make the type system 
extensible to support or customize other types, e.g, Variant.
+* Abstract the common implementation of writer/readers and move them to 
hudi-common module, and engines just need implement getter/setters for specific 
rows(Flink RowData and Spark InternalRow).
+* Add a concentrated and sharable expression abstraction for all kinds of 
expression pushdown for all engines and integrate it deeply with the MDT 
indices.
+
+
+## Background
+### Two 'Schema's
+There exist two Schemas currently in Hudi's table management, Table schema in 
Avro format and a Hudi native `InternalSchema`. 
+During the processes of reading, writing and other operations, there are 
numerous mutual conversions, reconciliations, 
+and validation logics between the Avro table schema and `InternalSchema`, 
which incurs more difficulties in the understanding 
+and maintaining of specific functionalities.
+
+#### 1. Avro Schema
+Hudi currently uses Avro schema as the table schema, to represent the 
structure of data written into the table. The table 
+schema is stored in the metadata of each writing commit to ensure that data of 
different versions can be resolved and reading 
+correctly, specifically:
+* For reading: the Avro table schema is used throughout the scan process, to 
properly build readers, do some scan optimization and deserialize underlying 
data into specific records.
+* For writing: the Avro table schema is used to check the validity of incoming 
data, build proper file writers, and finally commit the data with the schema 
itself stored in the commit metadata.
+
+#### 2. InternalSchema
+`InternalSchema` is introduced to support the comprehensive schema evolution 
in RFC-33. The most notable feature of 
+`InternalSchema` is that it adds an `id` attribute to each column field, which 
is used to track all the column changes. 
+Currently, `InternalSchema` is also stored in the metadata of each writing 
commit if the schema evolution is enabled.
+* For reading, with schema evolution enabled, `InternalSchema` is used to 
resolving data committed at different instant properly by make reconciliation 
between current table schema and historical `InternalSchema`.
+* For writing, `InternalSchema` is necessary to deduce the proper writing 
schema by reconciling the input source schema with the latest table schema. In 
this way, the compatibility of the reading and writing process in schema 
evolution scenario can be well guaranteed.
+
+### Unnecessary AVRO Ser/De
+Avro format is the default representation when dealing with records (reading, 
writing, clustering etc.). While it's simpler 
+to share more common functionalities, such as reading and writing of log 
block, it incurs more unnecessary Ser/De costs 
+between engine specific row (RowData for Flink, Internal for Spark).
+Take Flink-Hudi as an example. For the upsert streaming writing cases, the 
basic data transforming flow is:
+![flink_writing_avro_serde](flink_writing_avro_serde.png)
+
+For the Flink streaming reading cases, the basic data transforming flow is:
+![flink_read_avro_serde](flink_read_avro_serde.png)
+
+As can be seen, there exists unnecessary record-level Avro Ser/De costs both 
in the log reading and writing process and 

Review Comment:
   > we need to pass RowData to writers and implement new writers 
   Yeah that's ok, and not conflict with goal of the RFC. The proposal here do 
not require that `RowData` must transformed into `HoodieRecord` before 
shuffling to writer operator, it can be done inside the writer operator just 
before real writing.



##########
rfc/rfc-88/rfc-88.md:
##########
@@ -0,0 +1,601 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-88: New Schema/DataType/Expression Abstractions
+
+## Proposers
+
+- @cshuo
+- @danny0405
+
+## Approvers
+- ..
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-8966
+
+## Abstract
+
+Hudi currently is tightly coupled with Avro, particularly in terms of basic 
data types, schema, and the internal record
+representation used in read/write paths. This coupling leads to numerous 
issues, for example, record-level unnecessary 
+Ser/De costs are introduced because of engine native row and Avro record 
converting, the data type can not be extended
+to support other complex/advanced type, such as Variant and the basic read/ 
write functionality codes cannot be effectively 
+reused among different engines. As for Expression, currently, different 
engines have their own implementation to achieve 
+pushdown optimization, which is not friendly for extending as more indices are 
introduced.
+
+This RFC aims to propose an improvement to the current Schema/Type/Expression 
abstractions, to achieve the following goals:
+* Use a native schema as the authoritative schema, and make the type system 
extensible to support or customize other types, e.g, Variant.
+* Abstract the common implementation of writer/readers and move them to 
hudi-common module, and engines just need implement getter/setters for specific 
rows(Flink RowData and Spark InternalRow).
+* Add a concentrated and sharable expression abstraction for all kinds of 
expression pushdown for all engines and integrate it deeply with the MDT 
indices.
+
+
+## Background
+### Two 'Schema's
+There exist two Schemas currently in Hudi's table management, Table schema in 
Avro format and a Hudi native `InternalSchema`. 
+During the processes of reading, writing and other operations, there are 
numerous mutual conversions, reconciliations, 
+and validation logics between the Avro table schema and `InternalSchema`, 
which incurs more difficulties in the understanding 
+and maintaining of specific functionalities.
+
+#### 1. Avro Schema
+Hudi currently uses Avro schema as the table schema, to represent the 
structure of data written into the table. The table 
+schema is stored in the metadata of each writing commit to ensure that data of 
different versions can be resolved and reading 
+correctly, specifically:
+* For reading: the Avro table schema is used throughout the scan process, to 
properly build readers, do some scan optimization and deserialize underlying 
data into specific records.
+* For writing: the Avro table schema is used to check the validity of incoming 
data, build proper file writers, and finally commit the data with the schema 
itself stored in the commit metadata.
+
+#### 2. InternalSchema
+`InternalSchema` is introduced to support the comprehensive schema evolution 
in RFC-33. The most notable feature of 
+`InternalSchema` is that it adds an `id` attribute to each column field, which 
is used to track all the column changes. 
+Currently, `InternalSchema` is also stored in the metadata of each writing 
commit if the schema evolution is enabled.
+* For reading, with schema evolution enabled, `InternalSchema` is used to 
resolving data committed at different instant properly by make reconciliation 
between current table schema and historical `InternalSchema`.
+* For writing, `InternalSchema` is necessary to deduce the proper writing 
schema by reconciling the input source schema with the latest table schema. In 
this way, the compatibility of the reading and writing process in schema 
evolution scenario can be well guaranteed.
+
+### Unnecessary AVRO Ser/De
+Avro format is the default representation when dealing with records (reading, 
writing, clustering etc.). While it's simpler 
+to share more common functionalities, such as reading and writing of log 
block, it incurs more unnecessary Ser/De costs 
+between engine specific row (RowData for Flink, Internal for Spark).
+Take Flink-Hudi as an example. For the upsert streaming writing cases, the 
basic data transforming flow is:
+![flink_writing_avro_serde](flink_writing_avro_serde.png)
+
+For the Flink streaming reading cases, the basic data transforming flow is:
+![flink_read_avro_serde](flink_read_avro_serde.png)
+
+As can be seen, there exists unnecessary record-level Avro Ser/De costs both 
in the log reading and writing process and 

Review Comment:
   > we need to pass RowData to writers and implement new writers 
   
   Yeah that's ok, and not conflict with goal of the RFC. The proposal here do 
not require that `RowData` must transformed into `HoodieRecord` before 
shuffling to writer operator, it can be done inside the writer operator just 
before real writing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-8966] RFC-88: New Schema/DataType/Expression Abstractions [hudi]

Reply via email to