durgaprasadml opened a new pull request, #38750:
URL: https://github.com/apache/beam/pull/38750
Description:
## What does this PR do?
This PR implements the Delta Lake source reader using the Delta Kernel API
and adds performance/integration tests for Delta Lake reads.
The implementation introduces a parallelized read path for Delta tables by
planning scans on the coordinator and distributing Parquet file reads across
Beam workers.
## Changes Included
### DeltaIO Reader Implementation
- Completed DeltaIO.ReadRows
- Added Delta Kernel snapshot loading support
- Added scan planning and file descriptor generation
- Implemented parallel Parquet reads using Beam transforms
- Added Beam Schema inference from Delta schemas
- Added logical Delta Row → Beam Row conversion
- Added support for:
- primitive types
- nested structs
- arrays
- maps
### Performance / Integration Tests
Added:
- DeltaIOIT
- DeltaIOTestPipelineOptions
Test scenarios:
- testReadSmall
- testReadLarge
- testReadPartitioned
The tests:
- generate Delta tables locally
- create Delta logs dynamically
- validate partitioned reads
- collect throughput and latency metrics
- publish metrics using IOITMetrics
### Build Updates
Updated sdks/java/io/delta/build.gradle with required integration test
dependencies and Hadoop runtime dependencies required by Delta Kernel.
## Verification
Executed:
bash ./gradlew :sdks:java:io:delta:compileJava ./gradlew
:sdks:java:io:delta:compileTestJava ./gradlew :sdks:java:io:delta:test --tests
org.apache.beam.sdk.io.delta.DeltaIOIT
Fixes #38559
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]