[jira] [Commented] (HADOOP-19863) Incorrect Vectored IO metrics from Local Filesystem

ASF GitHub Bot (Jira) Wed, 13 May 2026 15:08:05 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-19863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18080781#comment-18080781
 ]


ASF GitHub Bot commented on HADOOP-19863:
-----------------------------------------

hadoop-yetus commented on PR #8496:
URL: https://github.com/apache/hadoop/pull/8496#issuecomment-4445570417

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |:----:|----------:|--------:|:--------:|:-------:|
   | +0 :ok: |  reexec  |   1m  1s |  |  Docker mode activated.  |
   |||| _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
   |||| _ branch-3.5 Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  42m 58s |  |  branch-3.5 passed  |
   | +1 :green_heart: |  compile  |  15m 55s |  |  branch-3.5 passed with JDK 
Ubuntu-21.0.10+7-Ubuntu-124.04  |
   | +1 :green_heart: |  compile  |  16m 14s |  |  branch-3.5 passed with JDK 
Ubuntu-17.0.18+8-Ubuntu-124.04.1  |
   | +1 :green_heart: |  checkstyle  |   1m 30s |  |  branch-3.5 passed  |
   | +1 :green_heart: |  mvnsite  |   1m 56s |  |  branch-3.5 passed  |
   | +1 :green_heart: |  javadoc  |   1m 30s |  |  branch-3.5 passed with JDK 
Ubuntu-21.0.10+7-Ubuntu-124.04  |
   | +1 :green_heart: |  javadoc  |   1m 26s |  |  branch-3.5 passed with JDK 
Ubuntu-17.0.18+8-Ubuntu-124.04.1  |
   | +1 :green_heart: |  spotbugs  |   3m 10s |  |  branch-3.5 passed  |
   | +1 :green_heart: |  shadedclient  |  30m 49s |  |  branch has no errors 
when building and testing our client artifacts.  |
   |||| _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 11s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  15m 19s |  |  the patch passed with JDK 
Ubuntu-21.0.10+7-Ubuntu-124.04  |
   | +1 :green_heart: |  javac  |  15m 19s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  16m 25s |  |  the patch passed with JDK 
Ubuntu-17.0.18+8-Ubuntu-124.04.1  |
   | +1 :green_heart: |  javac  |  16m 25s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   1m 26s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m 55s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 28s |  |  the patch passed with JDK 
Ubuntu-21.0.10+7-Ubuntu-124.04  |
   | +1 :green_heart: |  javadoc  |   1m 28s |  |  the patch passed with JDK 
Ubuntu-17.0.18+8-Ubuntu-124.04.1  |
   | +1 :green_heart: |  spotbugs  |   3m 22s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  30m 51s |  |  patch has no errors 
when building and testing our client artifacts.  |
   |||| _ Other Tests _ |
   | +1 :green_heart: |  unit  |  22m 58s |  |  hadoop-common in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   1m 16s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 215m 24s |  |  |
   
   
   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.54 ServerAPI=1.54 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8496/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/8496 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 775a870acedd 5.15.0-173-generic #183-Ubuntu SMP Fri Mar 6 
13:29:34 UTC 2026 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | branch-3.5 / aa7c7f73a70306d1a52a1cce4142921992dc0758 |
   | Default Java | Ubuntu-17.0.18+8-Ubuntu-124.04.1 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-21-openjdk-amd64:Ubuntu-21.0.10+7-Ubuntu-124.04 
/usr/lib/jvm/java-17-openjdk-amd64:Ubuntu-17.0.18+8-Ubuntu-124.04.1 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8496/1/testReport/ |
   | Max. process+thread count | 1285 (vs. ulimit of 5500) |
   | modules | C: hadoop-common-project/hadoop-common U: 
hadoop-common-project/hadoop-common |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8496/1/console |
   | versions | git=2.43.0 maven=3.9.15 spotbugs=4.9.7 |
   | Powered by | Apache Yetus 0.14.1 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




> Incorrect Vectored IO metrics from Local Filesystem
> ---------------------------------------------------
>
>                 Key: HADOOP-19863
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19863
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 3.5.0
>            Reporter: Peter Toth
>            Assignee: Steve Loughran
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.6.0
>
>         Attachments: Screenshot 2026-04-16 at 19.02.30.png, Screenshot 
> 2026-04-16 at 19.03.51.png
>
>
> As discussed in 
> [https://github.com/apache/parquet-java/issues/2703#issuecomment-4260121705] 
> we noticed that when vectoried IO is enabled the {{BytesRead}} metrics of 
> Spark tasks are not correct.
> Spark fetches that metric via {{FileSystem.getAllStatistics}} see
>  - 
> [https://github.com/apache/spark/blob/5d491f62748b4b9c34bc3b5bd7390f7b5ca75053/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L98-L109]
>  and
>  - 
> [https://github.com/apache/spark/blob/5d491f62748b4b9c34bc3b5bd7390f7b5ca75053/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L164-L170]
> Repro with latest Spark 4.2.0-SNAPSHOT using Hadoop 3.5.0:
> Vectored IO is enabled by default:
> {code:java}
> ➜  bin/spark-shell
> scala> spark.createDataFrame((0 until 5000).map(i => (i, 
> s"left_$i"))).repartition(1).write.parquet("/tmp/t2")
> scala> spark.read.parquet("/tmp/t2").createOrReplaceTempView("t2")
> scala> sql("SELECT * FROM t2").collect()
> {code}
> !Screenshot 2026-04-16 at 19.02.30.png|width=85%!
> Vectored IO is disabled explicitely:
> {code:java}
> ➜  bin/spark-shell --conf 
> spark.hadoop.parquet.hadoop.vectored.io.enabled=false
> scala> spark.read.parquet("/tmp/t2").createOrReplaceTempView("t2")
> scala> sql("SELECT * FROM t2").collect()
> {code}
> !Screenshot 2026-04-16 at 19.03.51.png|width=85%!
> In my case the generated test file size was ~45KB:
> {code:java}
> ➜  ls -ll /tmp/t2
> total 88
> -rw-r--r--@ 1 ptoth  wheel      0 Apr 16 18:57 _SUCCESS
> -rw-r--r--@ 1 ptoth  wheel  44944 Apr 16 18:57 
> part-00000-cf825cf6-2fa5-46a2-b897-dbb9dc9828a7-c000.snappy.parquet{code}
> I believe reading the parquet footers don't go through vectored IO so the 
> decreased 1680B probably belongs to that.
> There is no data pruning in the query so the metric value should be around 
> the file size.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-19863) Incorrect Vectored IO metrics from Local Filesystem

Reply via email to