ebyhr opened a new pull request, #16588: URL: https://github.com/apache/iceberg/pull/16588
## Summary - Fixes #14284 Adds native Parquet file reader and writer implementations that use Iceberg's `InputFile`/`OutputFile` abstractions instead of Hadoop's `FileSystem`, enabling usage without `parquet-hadoop` runtime dependency. This allows systems like Trino (which prohibits Hadoop dependencies) to use Iceberg's Parquet module by setting `iceberg.parquet-client=NATIVE` JVM property. ## Changes ### New Native I/O Classes - **`NativeParquetFileReader`**: Reads Parquet files using `InputFile`, parses footer and pages without `parquet-hadoop` - **`NativeParquetFileWriter`**: Writes Parquet files using `OutputFile` - **`ParquetFileReaderFactory`/`ParquetFileWriterFactory`**: Route to native or Hadoop implementation based on `iceberg.parquet-client` property ### New Metadata Classes (adapted from Trino) - **`BlockMetadata`**: Row group metadata record - **`ColumnChunkMetadata`**: Abstract base with `IntColumnChunkMetadata`/`LongColumnChunkMetadata` variants for memory efficiency - **`ColumnChunkProperties`**: Column properties record - **`FileMetadata`**: File-level metadata container - **`ParquetMetadata`**: Top-level metadata container - **`ParquetMetadataUtil`**: Converts Iceberg metadata to Hadoop metadata for backward compatibility - **`StatisticsConverter`**: Converts `format.Statistics` to `column.statistics.Statistics` without `parquet-hadoop` dependency ### Modified Classes - **`ParquetUtil.fileMetrics()`**: Routes through factory to support native client - **`Parquet`**: Added `PARQUET_CLIENT` constant and `ParquetClient` enum (`NATIVE`, `HADOOP`) ### Key Implementation Details - Removed `ParquetMetadataConverter` dependency by implementing `convertEncoding()`, `convertEncodingStats()`, `toFormatEncoding()`, and `toFormatStatistics()` locally - Native reader/writer use only `parquet-format` and `parquet-column` (no `parquet-hadoop` at runtime) - `parquet-hadoop` remains `compileOnly` dependency for interface compatibility (`CompressionCodecName`) - Factory pattern allows seamless switching via JVM property: `-Diceberg.parquet-client=NATIVE` - Default remains `HADOOP` for backward compatibility ## Test Plan - Added `TestNativeParquetFileReader` for basic read validation - Existing Parquet tests pass with native client via property override - Build succeeds: `./gradlew :iceberg-parquet:build -x test -x integrationTest` ## Compatibility - **Backward compatible**: Defaults to Hadoop client - **Trino compatible**: Can run without `parquet-hadoop` on classpath - Metadata classes follow Apache 2.0 licensed Trino implementation patterns 🤖 Generated with Claude Code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
