hemanthsavasere opened a new pull request, #2288:
URL: https://github.com/apache/fluss/pull/2288

   ### Purpose
   
   Linked issue: close #2186 
   
   ### Brief change log
   
   This PR implements comprehensive support for Apache Arrow `ARRAY` data types 
in the Lance data lake integration, enabling Fluss to write array columns 
(e.g., `ARRAY<INT>`, `ARRAY<STRING>`, nested arrays) to Lance datasets.
   
   Previously, the Lance integration only supported primitive types and 
strings. Users couldn't write tables with array columns to Lance, limiting the 
types of data that could be stored in the Lance data lake format. This is a 
critical gap for users working with complex data structures like:
   - Log events with tag arrays
   - User profiles with list-type fields
   - Time series data with array-valued metrics
   - Any schema requiring nested or repeated data
   
   **Key Implementation Notes** 
   1. **Separate element writers per type:** Following the existing pattern in 
`fluss-common`, each primitive type has its own array element writer. This 
provides type safety and mirrors the Arrow type system.
   
   2. **Generic `ArrowFieldWriter<InternalArray>`:** Array element writers 
extend `ArrowFieldWriter<InternalArray>` to work with array data, distinct from 
row-based writers.
   
   3. **Count-based positioning:** Element writers use `getCount()` for vector 
position tracking, which auto-increments via `write()` method. This is critical 
for correct array element placement.
   
   4. **Nested array support:** `ArrowNestedListWriter` composes element 
writers recursively, enabling arbitrary nesting depth.
   
   
   **Type System Integration:**
   - Extended `LanceArrowUtils.toArrowField()` to handle `ArrayType` and 
generate proper Arrow `List` fields with element type children
   - Added `ArrayType` visitor in `DataTypeToArrowTypeConverter` to map to 
`ArrowType.List.INSTANCE`
   - Updated `createFieldWriter()` to recognize `ListVector` and create 
appropriate array writers
   
   
   **Array Field Writers:**
   - **`ArrowListWriter`** - Main writer for array fields, handles array 
iteration and delegates to element writers
   - **`ArrowNestedListWriter`** - Specialized writer for nested arrays (e.g., 
`ARRAY<ARRAY<INT>>`)
   - **15 Array Element Writers** in `writers/array/` package supporting all 
Fluss primitive types:
     - Numeric: `ArrowArrayIntWriter`, `ArrowArrayBigIntWriter`, 
`ArrowArraySmallIntWriter`, `ArrowArrayTinyIntWriter`, `ArrowArrayFloatWriter`, 
`ArrowArrayDoubleWriter`
     - String/Binary: `ArrowArrayVarCharWriter`, `ArrowArrayVarBinaryWriter`, 
`ArrowArrayBinaryWriter`
     - Temporal: `ArrowArrayDateWriter`, `ArrowArrayTimeWriter`, 
`ArrowArrayTimestampLtzWriter`, `ArrowArrayTimestampNtzWriter`
     - Other: `ArrowArrayBooleanWriter`, `ArrowArrayDecimalWriter`
   
   **Implementation Details:**
   - Array element writers use `getCount()` for vector position tracking
   - Proper handling of null arrays, empty arrays, and arrays with null elements
   - Support for nested arrays via recursive writer composition
   - Fixed critical bug: Element writers must call `write()` (not `doWrite()`) 
to properly increment position count
   
   
   
   ### Tests
   
   **Integration Tests (5 new tests in `LanceTieringTest`):**
   1. `testArrayTypeInt` - Tests `ARRAY<INT>` with 10 rows of varying-length 
arrays
   2. `testArrayTypeString` - Tests `ARRAY<STRING>` with BinaryString arrays
   3. `testArrayTypeNullable` - Tests null arrays, empty arrays, and arrays 
containing null elements
   4. `testNestedArrayType` - Tests `ARRAY<ARRAY<INT>>` with 2-level nesting: 
`[[1,2], [3,4,5], [6]]`
   5. `testMultiplePrimitiveArrayTypes` - Tests 5 different array types in 
single schema (INT, BIGINT, DOUBLE, BOOLEAN, DATE)
   
   **Unit Tests (3 new tests in `ArrowListWriterTest`):**
   1. `testWriteSimpleArray` - Tests writing `[1, 2, 3]`
   2. `testWriteNullArray` - Tests null array handling
   3. `testWriteEmptyArray` - Tests empty array `[]` handling
   
   ### API and Format
   
   None
   
   ### Documentation
   
   Yes - This Introduces a Major New Feature
   
     Feature: Array Type Support for Lance Data Lake
   
     What's New:
   
     Before this change:
     - Users could NOT write tables with ARRAY columns to Lance
     - Attempting to use ARRAY<INT>, ARRAY<STRING>, etc. would throw 
UnsupportedOperationException
     - Only primitive types (INT, STRING, DATE, etc.) were supported
   
     After this change:
     - Users CAN write tables with ARRAY columns to Lance
     - Support for all array types: ARRAY<INT>, ARRAY<STRING>, ARRAY<DATE>, etc.
     - Support for nested arrays: ARRAY<ARRAY<INT>>
     - Proper handling of null arrays, empty arrays, and arrays with null 
elements
   
     Feature Scope:
   
     Supported Array Types (15 total):
     - Numeric arrays: INT, BIGINT, SMALLINT, TINYINT, FLOAT, DOUBLE
     - String/Binary arrays: VARCHAR, VARBINARY, BINARY
     - Temporal arrays: DATE, TIME, TIMESTAMP_LTZ, TIMESTAMP_NTZ
     - Other: BOOLEAN, DECIMAL
     - Nested: ARRAY<ARRAY>
   
     Example Use Cases Enabled:
     // User event logging with tags
     Schema.newBuilder()
         .column("event_id", DataTypes.INT())
         .column("tags", DataTypes.ARRAY(DataTypes.STRING()))
         .build();
   
     // Time series with array metrics
     Schema.newBuilder()
         .column("timestamp", DataTypes.TIMESTAMP_LTZ())
         .column("sensor_readings", DataTypes.ARRAY(DataTypes.DOUBLE()))
         .build();
   
     // Complex nested data
     Schema.newBuilder()
         .column("user_id", DataTypes.INT())
         .column("session_paths", 
DataTypes.ARRAY(DataTypes.ARRAY(DataTypes.STRING())))
         .build();


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to