[PR] Spark, Parquet: honor Hadoop vectored IO read config [iceberg]

via GitHub Fri, 29 May 2026 12:43:27 -0700


venkateshwaracholan opened a new pull request, #16614:
URL: https://github.com/apache/iceberg/pull/16614


   ## Summary
   
   Iceberg currently hardcodes Hadoop vectored I/O on when building Parquet 
read options by calling `withUseHadoopVectoredIo(true)`. This prevents users 
from disabling vectored I/O with Parquet’s existing 
`parquet.hadoop.vectored.io.enabled=false` setting.
   
   This change:
   - Removes the hardcoded vectored I/O override.
   - Preserves Parquet’s default vectored I/O behavior.
   - Propagates `parquet.hadoop.vectored.io.enabled` through Spark read 
planning so executor-side Parquet readers can honor it.
   - Adds Parquet and Spark tests for the config behavior.
   
   ## Why
   
   For `S3FileIO`, Spark executor-side reads use Iceberg input files such as 
`S3InputFile`, not `HadoopInputFile`. That means Hadoop configuration is not 
automatically available when Parquet read options are built on executors.
   
   By carrying `parquet.hadoop.vectored.io.enabled` from Spark read options, 
Spark Hadoop configuration, or table properties into the Parquet reader, users 
can disable vectored I/O for environments where it causes excessive on-heap 
allocation.
   
   ## Testing
   
   - `./gradlew :iceberg-parquet:test --tests 
org.apache.iceberg.parquet.TestParquet`
   - `./gradlew :iceberg-spark:iceberg-spark-4.1_2.13:test --tests 
org.apache.iceberg.spark.TestSparkReadConf`
   - `./gradlew :iceberg-spark:iceberg-spark-4.1_2.13:compileTestJava`
   - `./gradlew spotlessCheck`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Spark, Parquet: honor Hadoop vectored IO read config [iceberg]

Reply via email to