Github user sryza commented on a diff in the pull request:
https://github.com/apache/spark/pull/3120#discussion_r20262822
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -252,7 +258,7 @@ class HadoopRDD[K, V](
&& bytesReadCallback.isDefined) {
recordsSinceMetricsUpdate = 0
val bytesReadFn = bytesReadCallback.get
- inputMetrics.bytesRead = bytesReadFn()
+ inputMetrics.bytesRead = bytesReadFn() + bytesReadAtStart
--- End diff --
The issue is that the Hadoop APIs we're relying on don't really allow
bytesReadFn to return incremental bytes read. All they do is report on the
total bytes read by the thread. If reads on behalf of different RDDs are
interleaved on the same thread, the only way I can think of to determine the
bytes read by each RDD is to measure the bytes read so far before and after
every read operation. Which is prohibitively expensive.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]