venkateshwaracholan opened a new pull request, #16614: URL: https://github.com/apache/iceberg/pull/16614
## Summary Iceberg currently hardcodes Hadoop vectored I/O on when building Parquet read options by calling `withUseHadoopVectoredIo(true)`. This prevents users from disabling vectored I/O with Parquet’s existing `parquet.hadoop.vectored.io.enabled=false` setting. This change: - Removes the hardcoded vectored I/O override. - Preserves Parquet’s default vectored I/O behavior. - Propagates `parquet.hadoop.vectored.io.enabled` through Spark read planning so executor-side Parquet readers can honor it. - Adds Parquet and Spark tests for the config behavior. ## Why For `S3FileIO`, Spark executor-side reads use Iceberg input files such as `S3InputFile`, not `HadoopInputFile`. That means Hadoop configuration is not automatically available when Parquet read options are built on executors. By carrying `parquet.hadoop.vectored.io.enabled` from Spark read options, Spark Hadoop configuration, or table properties into the Parquet reader, users can disable vectored I/O for environments where it causes excessive on-heap allocation. ## Testing - `./gradlew :iceberg-parquet:test --tests org.apache.iceberg.parquet.TestParquet` - `./gradlew :iceberg-spark:iceberg-spark-4.1_2.13:test --tests org.apache.iceberg.spark.TestSparkReadConf` - `./gradlew :iceberg-spark:iceberg-spark-4.1_2.13:compileTestJava` - `./gradlew spotlessCheck` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
