hemanthsavasere opened a new pull request, #2288:
URL: https://github.com/apache/fluss/pull/2288
### Purpose
Linked issue: close #2186
### Brief change log
This PR implements comprehensive support for Apache Arrow `ARRAY` data types
in the Lance data lake integration, enabling Fluss to write array columns
(e.g., `ARRAY<INT>`, `ARRAY<STRING>`, nested arrays) to Lance datasets.
Previously, the Lance integration only supported primitive types and
strings. Users couldn't write tables with array columns to Lance, limiting the
types of data that could be stored in the Lance data lake format. This is a
critical gap for users working with complex data structures like:
- Log events with tag arrays
- User profiles with list-type fields
- Time series data with array-valued metrics
- Any schema requiring nested or repeated data
**Key Implementation Notes**
1. **Separate element writers per type:** Following the existing pattern in
`fluss-common`, each primitive type has its own array element writer. This
provides type safety and mirrors the Arrow type system.
2. **Generic `ArrowFieldWriter<InternalArray>`:** Array element writers
extend `ArrowFieldWriter<InternalArray>` to work with array data, distinct from
row-based writers.
3. **Count-based positioning:** Element writers use `getCount()` for vector
position tracking, which auto-increments via `write()` method. This is critical
for correct array element placement.
4. **Nested array support:** `ArrowNestedListWriter` composes element
writers recursively, enabling arbitrary nesting depth.
**Type System Integration:**
- Extended `LanceArrowUtils.toArrowField()` to handle `ArrayType` and
generate proper Arrow `List` fields with element type children
- Added `ArrayType` visitor in `DataTypeToArrowTypeConverter` to map to
`ArrowType.List.INSTANCE`
- Updated `createFieldWriter()` to recognize `ListVector` and create
appropriate array writers
**Array Field Writers:**
- **`ArrowListWriter`** - Main writer for array fields, handles array
iteration and delegates to element writers
- **`ArrowNestedListWriter`** - Specialized writer for nested arrays (e.g.,
`ARRAY<ARRAY<INT>>`)
- **15 Array Element Writers** in `writers/array/` package supporting all
Fluss primitive types:
- Numeric: `ArrowArrayIntWriter`, `ArrowArrayBigIntWriter`,
`ArrowArraySmallIntWriter`, `ArrowArrayTinyIntWriter`, `ArrowArrayFloatWriter`,
`ArrowArrayDoubleWriter`
- String/Binary: `ArrowArrayVarCharWriter`, `ArrowArrayVarBinaryWriter`,
`ArrowArrayBinaryWriter`
- Temporal: `ArrowArrayDateWriter`, `ArrowArrayTimeWriter`,
`ArrowArrayTimestampLtzWriter`, `ArrowArrayTimestampNtzWriter`
- Other: `ArrowArrayBooleanWriter`, `ArrowArrayDecimalWriter`
**Implementation Details:**
- Array element writers use `getCount()` for vector position tracking
- Proper handling of null arrays, empty arrays, and arrays with null elements
- Support for nested arrays via recursive writer composition
- Fixed critical bug: Element writers must call `write()` (not `doWrite()`)
to properly increment position count
### Tests
**Integration Tests (5 new tests in `LanceTieringTest`):**
1. `testArrayTypeInt` - Tests `ARRAY<INT>` with 10 rows of varying-length
arrays
2. `testArrayTypeString` - Tests `ARRAY<STRING>` with BinaryString arrays
3. `testArrayTypeNullable` - Tests null arrays, empty arrays, and arrays
containing null elements
4. `testNestedArrayType` - Tests `ARRAY<ARRAY<INT>>` with 2-level nesting:
`[[1,2], [3,4,5], [6]]`
5. `testMultiplePrimitiveArrayTypes` - Tests 5 different array types in
single schema (INT, BIGINT, DOUBLE, BOOLEAN, DATE)
**Unit Tests (3 new tests in `ArrowListWriterTest`):**
1. `testWriteSimpleArray` - Tests writing `[1, 2, 3]`
2. `testWriteNullArray` - Tests null array handling
3. `testWriteEmptyArray` - Tests empty array `[]` handling
### API and Format
None
### Documentation
Yes - This Introduces a Major New Feature
Feature: Array Type Support for Lance Data Lake
What's New:
Before this change:
- Users could NOT write tables with ARRAY columns to Lance
- Attempting to use ARRAY<INT>, ARRAY<STRING>, etc. would throw
UnsupportedOperationException
- Only primitive types (INT, STRING, DATE, etc.) were supported
After this change:
- Users CAN write tables with ARRAY columns to Lance
- Support for all array types: ARRAY<INT>, ARRAY<STRING>, ARRAY<DATE>, etc.
- Support for nested arrays: ARRAY<ARRAY<INT>>
- Proper handling of null arrays, empty arrays, and arrays with null
elements
Feature Scope:
Supported Array Types (15 total):
- Numeric arrays: INT, BIGINT, SMALLINT, TINYINT, FLOAT, DOUBLE
- String/Binary arrays: VARCHAR, VARBINARY, BINARY
- Temporal arrays: DATE, TIME, TIMESTAMP_LTZ, TIMESTAMP_NTZ
- Other: BOOLEAN, DECIMAL
- Nested: ARRAY<ARRAY>
Example Use Cases Enabled:
// User event logging with tags
Schema.newBuilder()
.column("event_id", DataTypes.INT())
.column("tags", DataTypes.ARRAY(DataTypes.STRING()))
.build();
// Time series with array metrics
Schema.newBuilder()
.column("timestamp", DataTypes.TIMESTAMP_LTZ())
.column("sensor_readings", DataTypes.ARRAY(DataTypes.DOUBLE()))
.build();
// Complex nested data
Schema.newBuilder()
.column("user_id", DataTypes.INT())
.column("session_paths",
DataTypes.ARRAY(DataTypes.ARRAY(DataTypes.STRING())))
.build();
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]