Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/2087#issuecomment-58960161
  
    Hey Sandy, had a couple questions about behavior and assumptions from 
Hadoop. A couple of things here. The current approach does a lot of reflection 
every time we invoke this statistics function which is very expensive. Also 
there is some "catch all" exception handling that could bite us. Two things 
that would help this are:
    
    1. Determine a single time up-front whether we will try to compute these 
advanced statistics:
    ```
      private val statsClass = 
"org.apache.hadoop.fs.FileSystem$Statistics$StatisticsData"
      private val statsFunction = "getThreadStatistics"
    
      /** Whether to attempt accessing per-thread statistics from Hadoop */
      private[spark] val hasAdvancedStatistics =
        
Try(Class.forName(statsClass).getDeclaredMethod(statsFunction)).map(true).getOrElse(false)
    ```
    
    And then remove the exception blocks elsewhere. I.e. if we detect advanced 
statistics are available and then there is a failure getting them, we should 
throw an exception.
    
    2. Perform as much reflection as possible off the critical path. I.e. the 
Hadoop RDD should look up the entire function for the computing thread at the 
beginning, then it can invoke that function within the hot loop only.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to