zhangyue19921010 commented on code in PR #12795: URL: https://github.com/apache/hudi/pull/12795#discussion_r1950323640
########## rfc/rfc-88/rfc-88.md: ########## @@ -0,0 +1,601 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +# RFC-88: New Schema/DataType/Expression Abstractions + +## Proposers + +- @cshuo +- @danny0405 + +## Approvers +- .. + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI-8966 + +## Abstract + +Hudi currently is tightly coupled with Avro, particularly in terms of basic data types, schema, and the internal record +representation used in read/write paths. This coupling leads to numerous issues, for example, record-level unnecessary +Ser/De costs are introduced because of engine native row and Avro record converting, the data type can not be extended +to support other complex/advanced type, such as Variant and the basic read/ write functionality codes cannot be effectively +reused among different engines. As for Expression, currently, different engines have their own implementation to achieve +pushdown optimization, which is not friendly for extending as more indices are introduced. + +This RFC aims to propose an improvement to the current Schema/Type/Expression abstractions, to achieve the following goals: +* Use a native schema as the authoritative schema, and make the type system extensible to support or customize other types, e.g, Variant. +* Abstract the common implementation of writer/readers and move them to hudi-common module, and engines just need implement getter/setters for specific rows(Flink RowData and Spark InternalRow). +* Add a concentrated and sharable expression abstraction for all kinds of expression pushdown for all engines and integrate it deeply with the MDT indices. + + +## Background +### Two 'Schema's +There exist two Schemas currently in Hudi's table management, Table schema in Avro format and a Hudi native `InternalSchema`. +During the processes of reading, writing and other operations, there are numerous mutual conversions, reconciliations, +and validation logics between the Avro table schema and `InternalSchema`, which incurs more difficulties in the understanding +and maintaining of specific functionalities. + +#### 1. Avro Schema +Hudi currently uses Avro schema as the table schema, to represent the structure of data written into the table. The table +schema is stored in the metadata of each writing commit to ensure that data of different versions can be resolved and reading +correctly, specifically: +* For reading: the Avro table schema is used throughout the scan process, to properly build readers, do some scan optimization and deserialize underlying data into specific records. +* For writing: the Avro table schema is used to check the validity of incoming data, build proper file writers, and finally commit the data with the schema itself stored in the commit metadata. + +#### 2. InternalSchema +`InternalSchema` is introduced to support the comprehensive schema evolution in RFC-33. The most notable feature of +`InternalSchema` is that it adds an `id` attribute to each column field, which is used to track all the column changes. +Currently, `InternalSchema` is also stored in the metadata of each writing commit if the schema evolution is enabled. +* For reading, with schema evolution enabled, `InternalSchema` is used to resolving data committed at different instant properly by make reconciliation between current table schema and historical `InternalSchema`. +* For writing, `InternalSchema` is necessary to deduce the proper writing schema by reconciling the input source schema with the latest table schema. In this way, the compatibility of the reading and writing process in schema evolution scenario can be well guaranteed. + +### Unnecessary AVRO Ser/De +Avro format is the default representation when dealing with records (reading, writing, clustering etc.). While it's simpler +to share more common functionalities, such as reading and writing of log block, it incurs more unnecessary Ser/De costs +between engine specific row (RowData for Flink, Internal for Spark). +Take Flink-Hudi as an example. For the upsert streaming writing cases, the basic data transforming flow is: + + +For the Flink streaming reading cases, the basic data transforming flow is: + + +As can be seen, there exists unnecessary record-level Avro Ser/De costs both in the log reading and writing process and +the similar problem exists when spark writes hoodie logs with Avro HoodieRecord. The costs can be eliminated if Avro +record can be substituted by engine specific row. Actually, RFC-46 has a good start to remove the need for conversion +from engine-specific internal representation to Avro. Currently, only Spark and Hive have implemented the new `HoodieRecord`, +and there arises another issue, i.e., no engine-agnostic basic infras for writers/readers for different engines and +different formats. We elaborate on the issue in the next part. + +### Separated File Readers/Writers for Different Engines +While Hudi works seamlessly with various query engines, such Spark, Flink, Hive etc., the underlying basic functionality +infrastructures for file reader/writer are not highly abstracted. Out of all the engines, Spark and Hudi are the most +seamlessly integrated. The Spark-Hudi integration leverages Spark's core read/write functionalities a lot. While these +functionalities are well-tested and performant in Spark using cases, they cannot be reused in Hudi's integration with other engines. + +As an example, Spark-Hudi integration uses `ParquetFileFormat` from Spark to build reader for parquet files, and uses +`ParquetWriteSupport` to build writer to handle parquet file writing. These reading/write infras can not be reused directly +in other engine integrations without abstraction, e.g.,the Flink-Hudi module has to implement its own basic file readers/writers. + +### Separated Expression optimization +Currently, the pushdown optimization for different engines do not share much in the integration between Expression and +MDT indices. For instance, there is `HoodieFileIndex` for spark scan to support partition pruning and data filtering, +which heavily relies on Spark DataFrame operations to evaluate Expressions against indices stats data(`SparkBaseIndexSupport`). +Meanwhile, for Flink engine, a separate Expression evaluator is also implemented to support evaluating pushed-down Flink +Expression against MDT stats data. There are quite a few common processing logics of index data between Flink FileIndex +and Spark FileIndex, which introduces difficulties both for the development and maintaining of functionalities about index optimization. + +## Goals +To make the scope of the RFC more clear, we emphasize the goals here: +* Unify the avro schema and internal schema into one and clean the codes. +* Abstract out the basic functionality for writers/readers based on data type abstraction, then each engine just implements the getter/setter logic, and eliminate the unnecessary Ser/De cost for the avro writer/read path. +* Improve the expression abstractions and the integration with the Hudi file index and MDT indices, so that all these optimization can be shared between engines. Specifically, this should be used for expression indexes. + +## Design/Implementation + +### Abstraction of new Schema +As elaborated above, to support comprehensive schema evolution, the basic data type and `InternalSchema` are introduced +and `InternalSchema#RecordType` can be losslessly converted to/from Avro schema. + +Currently, `InternalSchema` is designed for internal use, and we are proposing a new Schema abstraction to serve as a +public API, while retaining the existing capabilities of `InternalSchema`. +```java +/** + * Definition of the schema for a Hudi table. + */ +@PublicAPIClass +public class Schema implements Serializable { + private final RecordType record; + private int maxColumnId; + private long versionId; + private List<String> recordKeys; + private List<String> partitionKeys; + + private transient Map<Integer, Field> idToField = null; + private transient Map<String, Integer> nameToId = null; + private transient Map<Integer, String> idToName = null; + private transient Map<String, Integer> nameToPosition = null; + + public Schema(List<Field> fields, long versionId, int maxColumnId, List<String> recordKeys, List<String> partitionKeys) {} + + // return record type + public RecordType recordType() {} + // return all fields + public List<Field> fields() {} + // return name of record key fields + public List<String> recordKeys() {} + // return name of partition fields + public List<String> partitionKeys() {} + + public static class Builder { + private final List<Field> fields; + ... + private List<String> partitionKeys; + + public Builder addField(Field field) {..} + ... + public Schema build() { + return new Schema(fields, versionId, maxColumnId, recordKeys, partitionKeys); + } + } +} + +/** + * Definition of the type of record in Hudi. + */ +@PublicAPIClass +public class RecordType implements Type { + // name is necessary to provide for lossless conversion b/w Avro and Schema + private final String name; + private final Field[] fields; + + public RecordType(List<Field> fields) {} + public RecordType(String name, List<Field> fields) {} + + public List<Field> fields() {} + public Field field(String name) {} + public Field field(int id) {} +} + +/** + * Definition of a column/field in RecordType. + */ +@PublicAPIClass +public static class Field implements Serializable { + private final boolean isOptional; + private final int id; + private final String name; + private final Type type; + private final String doc; + private final Object defaultValue; + private NestedField() {} + + public static class Builder { + private final String name; + ... + private final Object defaultValue; + + public Builder(String name) { + this.name = name; + } + + public Builder withId(int id) {..} + ... + public NestedField build() { + return new Field(isOptional, id, name, type, doc, defaultValue); + } + } +} +``` +For the first stated goal, the following changes are proposed: +* Substitute avro schema with `Schema` in commit metadata. + - Persistence: SCHEMA_KEY: Avro schema string -> `Schema` string + - Resolving compatibility: firstly try resolving as `Schema`, and fallback to Avro schema for compatibility. +* Substitute Avro schema with `Schema` in reader/writer path of all engines. + - Replace schemas in configuration. + - Replace schemas in reader/writer building process. +* Substitute Avro schema with `Schema` in log block. + +### Abstraction of Writers/Readers Review Comment: The abstraction of Write and Read is crucial. However, IMO, Maybe we need to further unify the abstraction of file format and integrate the operations (i.e., reading and writing) related to it. Currently, Hudi supports multiple formats such as Avro, Parquet, ORC, and Hfile, and more will be added in the future. Here, we can abstract FileFormat, FileFormatFactory, FormatWriter, FormatWriterFactor, etc. in the Common layer. Meanwhile, we can abstract a new model `Hudi-format` to implement specific Readers and Writers for different formats. We can also consider integrating Arrow in the future. <img width="405" alt="截屏2025-02-11 15 12 57" src="https://github.com/user-attachments/assets/a9cf2bfb-b7f0-4a34-9679-e11e1756172b" /> <img width="400" alt="截屏2025-02-11 15 13 26" src="https://github.com/user-attachments/assets/9ce2e8ef-c738-4747-8a42-f7ffb2d6372c" /> ########## rfc/rfc-88/rfc-88.md: ########## @@ -0,0 +1,601 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +# RFC-88: New Schema/DataType/Expression Abstractions + +## Proposers + +- @cshuo +- @danny0405 + +## Approvers +- .. + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI-8966 + +## Abstract + +Hudi currently is tightly coupled with Avro, particularly in terms of basic data types, schema, and the internal record +representation used in read/write paths. This coupling leads to numerous issues, for example, record-level unnecessary +Ser/De costs are introduced because of engine native row and Avro record converting, the data type can not be extended +to support other complex/advanced type, such as Variant and the basic read/ write functionality codes cannot be effectively +reused among different engines. As for Expression, currently, different engines have their own implementation to achieve +pushdown optimization, which is not friendly for extending as more indices are introduced. + +This RFC aims to propose an improvement to the current Schema/Type/Expression abstractions, to achieve the following goals: +* Use a native schema as the authoritative schema, and make the type system extensible to support or customize other types, e.g, Variant. +* Abstract the common implementation of writer/readers and move them to hudi-common module, and engines just need implement getter/setters for specific rows(Flink RowData and Spark InternalRow). +* Add a concentrated and sharable expression abstraction for all kinds of expression pushdown for all engines and integrate it deeply with the MDT indices. + + +## Background +### Two 'Schema's +There exist two Schemas currently in Hudi's table management, Table schema in Avro format and a Hudi native `InternalSchema`. +During the processes of reading, writing and other operations, there are numerous mutual conversions, reconciliations, +and validation logics between the Avro table schema and `InternalSchema`, which incurs more difficulties in the understanding +and maintaining of specific functionalities. + +#### 1. Avro Schema +Hudi currently uses Avro schema as the table schema, to represent the structure of data written into the table. The table +schema is stored in the metadata of each writing commit to ensure that data of different versions can be resolved and reading +correctly, specifically: +* For reading: the Avro table schema is used throughout the scan process, to properly build readers, do some scan optimization and deserialize underlying data into specific records. +* For writing: the Avro table schema is used to check the validity of incoming data, build proper file writers, and finally commit the data with the schema itself stored in the commit metadata. + +#### 2. InternalSchema +`InternalSchema` is introduced to support the comprehensive schema evolution in RFC-33. The most notable feature of +`InternalSchema` is that it adds an `id` attribute to each column field, which is used to track all the column changes. +Currently, `InternalSchema` is also stored in the metadata of each writing commit if the schema evolution is enabled. +* For reading, with schema evolution enabled, `InternalSchema` is used to resolving data committed at different instant properly by make reconciliation between current table schema and historical `InternalSchema`. +* For writing, `InternalSchema` is necessary to deduce the proper writing schema by reconciling the input source schema with the latest table schema. In this way, the compatibility of the reading and writing process in schema evolution scenario can be well guaranteed. + +### Unnecessary AVRO Ser/De +Avro format is the default representation when dealing with records (reading, writing, clustering etc.). While it's simpler +to share more common functionalities, such as reading and writing of log block, it incurs more unnecessary Ser/De costs +between engine specific row (RowData for Flink, Internal for Spark). +Take Flink-Hudi as an example. For the upsert streaming writing cases, the basic data transforming flow is: + + +For the Flink streaming reading cases, the basic data transforming flow is: + + +As can be seen, there exists unnecessary record-level Avro Ser/De costs both in the log reading and writing process and +the similar problem exists when spark writes hoodie logs with Avro HoodieRecord. The costs can be eliminated if Avro +record can be substituted by engine specific row. Actually, RFC-46 has a good start to remove the need for conversion +from engine-specific internal representation to Avro. Currently, only Spark and Hive have implemented the new `HoodieRecord`, +and there arises another issue, i.e., no engine-agnostic basic infras for writers/readers for different engines and +different formats. We elaborate on the issue in the next part. + +### Separated File Readers/Writers for Different Engines +While Hudi works seamlessly with various query engines, such Spark, Flink, Hive etc., the underlying basic functionality +infrastructures for file reader/writer are not highly abstracted. Out of all the engines, Spark and Hudi are the most +seamlessly integrated. The Spark-Hudi integration leverages Spark's core read/write functionalities a lot. While these +functionalities are well-tested and performant in Spark using cases, they cannot be reused in Hudi's integration with other engines. + +As an example, Spark-Hudi integration uses `ParquetFileFormat` from Spark to build reader for parquet files, and uses +`ParquetWriteSupport` to build writer to handle parquet file writing. These reading/write infras can not be reused directly +in other engine integrations without abstraction, e.g.,the Flink-Hudi module has to implement its own basic file readers/writers. + +### Separated Expression optimization +Currently, the pushdown optimization for different engines do not share much in the integration between Expression and +MDT indices. For instance, there is `HoodieFileIndex` for spark scan to support partition pruning and data filtering, +which heavily relies on Spark DataFrame operations to evaluate Expressions against indices stats data(`SparkBaseIndexSupport`). +Meanwhile, for Flink engine, a separate Expression evaluator is also implemented to support evaluating pushed-down Flink +Expression against MDT stats data. There are quite a few common processing logics of index data between Flink FileIndex +and Spark FileIndex, which introduces difficulties both for the development and maintaining of functionalities about index optimization. + +## Goals +To make the scope of the RFC more clear, we emphasize the goals here: +* Unify the avro schema and internal schema into one and clean the codes. +* Abstract out the basic functionality for writers/readers based on data type abstraction, then each engine just implements the getter/setter logic, and eliminate the unnecessary Ser/De cost for the avro writer/read path. +* Improve the expression abstractions and the integration with the Hudi file index and MDT indices, so that all these optimization can be shared between engines. Specifically, this should be used for expression indexes. + +## Design/Implementation + +### Abstraction of new Schema +As elaborated above, to support comprehensive schema evolution, the basic data type and `InternalSchema` are introduced +and `InternalSchema#RecordType` can be losslessly converted to/from Avro schema. + +Currently, `InternalSchema` is designed for internal use, and we are proposing a new Schema abstraction to serve as a +public API, while retaining the existing capabilities of `InternalSchema`. +```java +/** + * Definition of the schema for a Hudi table. + */ +@PublicAPIClass +public class Schema implements Serializable { + private final RecordType record; + private int maxColumnId; + private long versionId; + private List<String> recordKeys; + private List<String> partitionKeys; + + private transient Map<Integer, Field> idToField = null; + private transient Map<String, Integer> nameToId = null; + private transient Map<Integer, String> idToName = null; + private transient Map<String, Integer> nameToPosition = null; + + public Schema(List<Field> fields, long versionId, int maxColumnId, List<String> recordKeys, List<String> partitionKeys) {} + + // return record type + public RecordType recordType() {} + // return all fields + public List<Field> fields() {} + // return name of record key fields + public List<String> recordKeys() {} + // return name of partition fields + public List<String> partitionKeys() {} + + public static class Builder { + private final List<Field> fields; + ... + private List<String> partitionKeys; + + public Builder addField(Field field) {..} + ... + public Schema build() { + return new Schema(fields, versionId, maxColumnId, recordKeys, partitionKeys); + } + } +} + +/** + * Definition of the type of record in Hudi. + */ +@PublicAPIClass +public class RecordType implements Type { + // name is necessary to provide for lossless conversion b/w Avro and Schema + private final String name; + private final Field[] fields; + + public RecordType(List<Field> fields) {} + public RecordType(String name, List<Field> fields) {} + + public List<Field> fields() {} + public Field field(String name) {} + public Field field(int id) {} +} + +/** + * Definition of a column/field in RecordType. + */ +@PublicAPIClass +public static class Field implements Serializable { + private final boolean isOptional; + private final int id; + private final String name; + private final Type type; + private final String doc; + private final Object defaultValue; + private NestedField() {} + + public static class Builder { + private final String name; + ... + private final Object defaultValue; + + public Builder(String name) { + this.name = name; + } + + public Builder withId(int id) {..} + ... + public NestedField build() { + return new Field(isOptional, id, name, type, doc, defaultValue); + } + } +} +``` +For the first stated goal, the following changes are proposed: +* Substitute avro schema with `Schema` in commit metadata. + - Persistence: SCHEMA_KEY: Avro schema string -> `Schema` string + - Resolving compatibility: firstly try resolving as `Schema`, and fallback to Avro schema for compatibility. +* Substitute Avro schema with `Schema` in reader/writer path of all engines. + - Replace schemas in configuration. + - Replace schemas in reader/writer building process. +* Substitute Avro schema with `Schema` in log block. + +### Abstraction of Writers/Readers Review Comment: The abstraction of Write and Read is crucial. However, IMO, Maybe we need to further unify the abstraction of file format and integrate the operations (i.e., reading and writing) related to it. Currently, Hudi supports multiple formats such as Avro, Parquet, ORC, and Hfile, and more will be added in the future. Here, we can abstract FileFormat, FileFormatFactory, FormatWriter, FormatWriterFactor, etc. in the Common layer. Meanwhile, we can abstract a new model `Hudi-format` to implement specific Readers and Writers for different formats. We can also consider integrating Arrow in the future. <img width="405" alt="截屏2025-02-11 15 12 57" src="https://github.com/user-attachments/assets/a9cf2bfb-b7f0-4a34-9679-e11e1756172b" /> <img width="400" alt="截屏2025-02-11 15 13 26" src="https://github.com/user-attachments/assets/9ce2e8ef-c738-4747-8a42-f7ffb2d6372c" /> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
