Re: [PR] [HUDI-9730] RFC-99 Hudi Type System [hudi]

via GitHub Wed, 27 Aug 2025 17:29:17 -0700


bvaradar commented on code in PR #13743:
URL: https://github.com/apache/hudi/pull/13743#discussion_r2305721166



##########
rfc/rfc-99/rfc-99.md:
##########
@@ -0,0 +1,219 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+# RFC-99: Hudi Type System
+
+## Proposers
+
+- @bvaradar
+
+## Approvers
+
+- @vinothchandar
+
+
+## Status
+
+Umbrella ticket: [HUDI-9730](https://issues.apache.org/jira/browse/HUDI-9730)
+
+
+## Abstract
+The main goal is to propose a native Hudi type system as the authoritative 
representation for Hudi data types, making the system more extensible and the 
semantics of data types clear and unified. While Hudi currently uses Avro for 
schema representation, introducing a more comprehensive, Arrow-based type 
system will make it easier to provide consistent handling and implementation of 
data types across different engines and improve support for modern data 
paradigms like multi-modal and semi-structured data.
+
+There is [earlier attempt](https://github.com/apache/hudi/pull/12795/files) to 
define a common schema but it was geared towards building more general 
abstractions. This RFC relooks at the specific need for defining a type system 
model for Hudi to become more extensible fnd also support non-traditional 
usecases.
+   
+## Background
+Apache Hudi currently uses Apache Avro as the canonical representation for its 
schema. While this has served the project well, introducing a native, 
engine-agnostic type system offers a strategic opportunity to evolve Hudi's 
core abstractions for the future. The primary motivations for this evolution 
are:
+
+- A common type system allows us to build richer functionalities and common 
interface across engines and non-JVM clients to interact with Hudi data 
directly and efficiently.
+- A native type system provides a formal framework for introducing new, 
complex data types. This will accelerate Hudi's ability to offer first-class 
support for emerging use cases in AI/ML (vectors, tensors) and semi-structured 
data analysis (VARIANT), keeping Hudi at the forefront of data lakehouse 
technology
+- By standardizing on an in-memory format, Hudi can eliminate costly 
serialization and deserialization steps when exchanging data with a growing 
number of Arrow-native tools and engines. This unlocks zero-copy data access, 
significantly boosting performance for both read and write paths.
+
+## Design
+
+The canonical in-memory representation for all types will be based on the 
Apache Arrow specification. The main reasons for this is that :
+
+- Apache Arrow provides a standard in-memory format that eliminates the costly 
process of data serialization and deserialization when moving data across 
system boundaries. This enables "zero-copy" data exchange, which radically 
reduces computational overhead and query latency.
+- This helps us more easily achieve seamless data exchange with ecosystem of 
Arrow-native tools.
+- Query engines have good support for Arrow type systems which is multi-modal 
itself. This aligns with our goals of providing first-class multi-modal type 
system support.
+
+The proposed type system will be implemented such that the in-memory layout is 
compatible with Apache Arrow to get the performance benefits.
+
+ 
+### **Type Specification**
+
+The below section defines the types that are going to be supported and finally 
how they map to other system's data types.
+ 
+#### **3.1. Primitive Types**
+
+These are the fundamental scalar types that form the basis of the type system.
+This includes standard signed integers in 8, 16, 32, and 64-bit widths 
(TINYINT, SMALLINT, INTEGER, BIGINT), as well as floating-point numbers like 
FLOAT and DOUBLE. The system also provides types for BOOLEAN, DECIMAL, STRING, 
BINARY, FIXED, and UUID. A notable addition in the new proposal is the explicit 
support for unsigned integer types (UINT8, UINT16, UINT32, UINT64) to enhance 
data fidelity and accommodate a wider range of use cases. A half-precision 
FLOAT16 is also introduced to support AI/ML workloads.
+
+| Logical Type | Description | Parameters |
+| :---- | :---- | :---- |
+| BOOLEAN | A logical boolean value (true/false). | None |
+| TINYINT | An 8-bit signed integer. | None |
+| UINT8 | An 8-bit **unsigned** integer. | None |
+| SMALLINT | A 16-bit signed integer. | None |
+| UINT16 | A 16-bit **unsigned** integer. | None |
+| INTEGER | A 32-bit signed integer. | None |
+| UINT32 | A 32-bit **unsigned** integer. | None |
+| BIGINT | A 64-bit signed integer. | None |
+| UINT64 | A 64-bit **unsigned** integer. | None |
+| FLOAT16 | A 16-bit half-precision floating-point number. | None |
+| FLOAT | A 32-bit single-precision floating-point number. | None |
+| DOUBLE | A 64-bit double-precision floating-point number. | None |
+| DECIMAL(p, s) | An exact numeric with specified precision/scale. | p, s |
+| STRING | A variable-length UTF-8 character string, limited to 2GB per value. 
| None |
+| LARGE\_STRING | A variable-length UTF-8 character string for values 
exceeding 2GB. | None |
+| BINARY | A variable-length sequence of bytes, limited to 2GB per value. | 
None |
+| LARGE\_BINARY | A variable-length sequence of bytes for values exceeding 
2GB. | None |
+| FIXED(n) | A fixed-length sequence of n bytes. | n |
+| UUID | A 128-bit universally unique identifier. | None |
+
+#### **3.2. Temporal Types**
+
+These types handle date and time representations with high precision and 
timezone awareness.
+
+| Logical Type | Description | Parameters |
+| :---- | :---- | :---- |
+| DATE | A calendar date (year, month, day). | None |
+| DATE64 | A calendar date stored as milliseconds. | None |
+| TIME(precision) | A time of day without a timezone. | s, ms, us, ns |
+| TIMESTAMP(precision) | An instant in time without a timezone. | us or ns |
+| TIMESTAMPTZ(precision) | An instant in time with a timezone, normalized and 
stored as UTC. | us or ns |
+| DURATION(unit) | An exact physical time duration, independent of calendars. 
| s, ms, us, ns |
+| INTERVAL | Represents a duration of time (e.g., months, days, milliseconds). 
| None |
+
+#### **3.3. Composite Types**
+
+These types allow for the creation of complex, nested data structures.
+
+| Logical Type | Description | Parameters |
+| :---- | :---- | :---- |
+| STRUCT\<name: type, ...\> | An ordered collection of named fields. | Field 
list |
+| LIST\<element\_type\> | An ordered list of elements of the same type. | 
Element type |
+| MAP\<key\_type, value\_type\> | A collection of key-value pairs. Keys must 
be unique. | Key, Value types |
+| UNION\<type1, type2, ...\> | A value that can be one of several specified 
types. | Type list |

Review Comment:
   Don't think so. Kindly look at the above comment. I am not taking this as an 
objective. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-9730] RFC-99 Hudi Type System [hudi]

Reply via email to