XuQianJin-Stars opened a new pull request, #2345:
URL: https://github.com/apache/fluss/pull/2345
### Purpose
Linked issue: close #1569
Refactor fluss-lake-lance module to eliminate code duplication and improve
maintainability.
**Issue**: The fluss-lake-lance module previously duplicated 1100+ lines of
Arrow-related code (ArrowWriter and 18 ArrowFieldWriter implementations)
because fluss-common uses shaded Arrow API while Lance library requires
non-shaded Arrow API.
**Solution**: Implement a bridge pattern with LanceArrowWriter to directly
write Fluss InternalRow to non-shaded Arrow vectors, and adopt batch processing
approach (inspired by lance-flink) instead of stream processing.
### Brief change log
**1. Eliminated Code Duplication (1100+ lines removed)**
- Removed duplicate `ArrowWriter` class from fluss-lake-lance
- Removed 18 duplicate `ArrowFieldWriter` implementations:
- ArrowTinyIntWriter, ArrowSmallIntWriter, ArrowIntWriter,
ArrowBigIntWriter
- ArrowBooleanWriter, ArrowFloatWriter, ArrowDoubleWriter
- ArrowVarCharWriter, ArrowVarBinaryWriter, ArrowBinaryWriter
- ArrowDecimalWriter, ArrowDateWriter, ArrowTimeWriter
- ArrowTimestampLtzWriter, ArrowTimestampNtzWriter
- And ArrowFieldWriter base class
**2. Created LanceArrowWriter**
- Implements a bridge to write Fluss InternalRow directly to non-shaded
Arrow VectorSchemaRoot
- Supports all Fluss data types (TinyInt, SmallInt, Int, BigInt, Float,
Double, Boolean, String, Char, Binary, VarBinary, Decimal, Date, Time,
TimestampLtz, TimestampNtz)
- Uses custom FieldWriter implementations that directly operate on
non-shaded Arrow vectors
**3. Refactored LanceLakeWriter (Batch Processing)**
- Changed from stream processing (ArrowReader + semaphores) to simple batch
processing
- Uses ArrayList buffer to collect rows
- Writes buffered data to VectorSchemaRoot when batch size is reached
- Directly calls `Fragment.create(datasetUri, allocator, root, writeParams)`
(similar to lance-flink approach)
**4. Updated LanceArrowUtils**
- Simplified to only handle schema conversion (Fluss RowType to non-shaded
Arrow Schema)
- Removed duplicate field writer creation methods
**5. Updated Dependencies**
- Changed fluss-common dependency scope from `provided` to `compile` in
pom.xml
- Removed explicit non-shaded Arrow dependencies (inherited from lance-core)
**6. Cleanup**
- Removed unused imports and methods from LanceDatasetAdapter
- Fixed deprecated ArrowType.Decimal constructor usage in tests
- Added @SuppressWarnings for unchecked cast in LanceCommittableSerializer
### Tests
**Unit Tests**
- ✅ `LanceTieringTest.testTieringWriteTable` (with/without partitions)
- ✅ `LakeEnabledTableCreateITCase` (table creation with various data types)
**Verification**
- All existing tests pass without modifications
- Performance: Test execution time remains similar (~1.2s for
LanceTieringTest)
- No behavioral changes to external APIs
### API and Format
**API Changes**: None - This is an internal refactoring
- External APIs (`LakeWriter`, `LanceLakeTieringFactory`) remain unchanged
- Lance dataset format and compatibility are preserved
**Internal Changes**:
- `LanceLakeWriter` implementation changed from stream to batch processing
- New `LanceArrowWriter` class introduced (internal use only)
### Documentation
**Documentation Updates**: Not required
- This is an internal code refactoring
- No new features exposed to users
- Existing Lance documentation remains valid
**Code Quality Improvements**:
- Reduced code duplication by 1100+ lines
- Simplified architecture (batch processing is easier to understand than
stream processing)
- Better maintainability (single implementation instead of duplicate writers)
- Follows lance-flink's proven approach
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]