Github user sryza commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2087#discussion_r18832502
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala 
---
    @@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
         UserGroupInformation.loginUserFromKeytab(principalName, keytabFilename)
       }
     
    +  /**
    +   * Returns a function that can be called to find the number of Hadoop 
FileSystem bytes read by
    +   * this thread so far. Reflection is required because thread-level 
FileSystem statistics are only
    +   * available as of Hadoop 2.5 (see HADOOP-10688). Returns None if the 
required method can't be
    +   * found.
    +   */
    +  def getInputBytesReadCallback(path: Path, conf: Configuration): 
Option[() => Long] = {
    --- End diff --
    
    getInputBytesReadCallback only gets called once per task - to find the 
function and return it.  Are you worried about that per-task overhead?  We 
still do have a single reflective call to actually invoke it when we want to 
populate the metric.  The internet has a few different opinions on the overhead 
of this.  Most likely is that it's only about twice the overhead of a direct 
function call, but I've also seen threads that say it's much more.  Either way, 
this was my root of my earlier wariness about having this call on the read path 
as opposed to doing it asynchonously in a separate thread.
    
    On the catch-all exception, you're right - will rework that part.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to