Github user sryza commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3120#discussion_r20262822
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -252,7 +258,7 @@ class HadoopRDD[K, V](
                 && bytesReadCallback.isDefined) {
               recordsSinceMetricsUpdate = 0
               val bytesReadFn = bytesReadCallback.get
    -          inputMetrics.bytesRead = bytesReadFn()
    +          inputMetrics.bytesRead = bytesReadFn() + bytesReadAtStart
    --- End diff --
    
    The issue is that the Hadoop APIs we're relying on don't really allow 
bytesReadFn to return incremental bytes read.  All they do is report on the 
total bytes read by the thread.  If reads on behalf of different RDDs are 
interleaved on the same thread, the only way I can think of to determine the 
bytes read by each RDD is to measure the bytes read so far before and after 
every read operation.  Which is prohibitively expensive. 
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to