[
https://issues.apache.org/jira/browse/HADOOP-19863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Steve Loughran resolved HADOOP-19863.
-------------------------------------
Fix Version/s: 3.5.1
Resolution: Fixed
> Incorrect Vectored IO metrics from Local Filesystem
> ---------------------------------------------------
>
> Key: HADOOP-19863
> URL: https://issues.apache.org/jira/browse/HADOOP-19863
> Project: Hadoop Common
> Issue Type: Bug
> Components: fs
> Affects Versions: 3.5.0
> Reporter: Peter Toth
> Assignee: Steve Loughran
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.5.1
>
> Attachments: Screenshot 2026-04-16 at 19.02.30.png, Screenshot
> 2026-04-16 at 19.03.51.png
>
>
> As discussed in
> [https://github.com/apache/parquet-java/issues/2703#issuecomment-4260121705]
> we noticed that when vectoried IO is enabled the {{BytesRead}} metrics of
> Spark tasks are not correct.
> Spark fetches that metric via {{FileSystem.getAllStatistics}} see
> -
> [https://github.com/apache/spark/blob/5d491f62748b4b9c34bc3b5bd7390f7b5ca75053/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L98-L109]
> and
> -
> [https://github.com/apache/spark/blob/5d491f62748b4b9c34bc3b5bd7390f7b5ca75053/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L164-L170]
> Repro with latest Spark 4.2.0-SNAPSHOT using Hadoop 3.5.0:
> Vectored IO is enabled by default:
> {code:java}
> ➜ bin/spark-shell
> scala> spark.createDataFrame((0 until 5000).map(i => (i,
> s"left_$i"))).repartition(1).write.parquet("/tmp/t2")
> scala> spark.read.parquet("/tmp/t2").createOrReplaceTempView("t2")
> scala> sql("SELECT * FROM t2").collect()
> {code}
> !Screenshot 2026-04-16 at 19.02.30.png|width=85%!
> Vectored IO is disabled explicitely:
> {code:java}
> ➜ bin/spark-shell --conf
> spark.hadoop.parquet.hadoop.vectored.io.enabled=false
> scala> spark.read.parquet("/tmp/t2").createOrReplaceTempView("t2")
> scala> sql("SELECT * FROM t2").collect()
> {code}
> !Screenshot 2026-04-16 at 19.03.51.png|width=85%!
> In my case the generated test file size was ~45KB:
> {code:java}
> ➜ ls -ll /tmp/t2
> total 88
> -rw-r--r--@ 1 ptoth wheel 0 Apr 16 18:57 _SUCCESS
> -rw-r--r--@ 1 ptoth wheel 44944 Apr 16 18:57
> part-00000-cf825cf6-2fa5-46a2-b897-dbb9dc9828a7-c000.snappy.parquet{code}
> I believe reading the parquet footers don't go through vectored IO so the
> decreased 1680B probably belongs to that.
> There is no data pruning in the query so the metric value should be around
> the file size.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]