Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-58960161
Hey Sandy, had a couple questions about behavior and assumptions from
Hadoop. A couple of things here. The current approach does a lot of reflection
every time we invoke this statistics function which is very expensive. Also
there is some "catch all" exception handling that could bite us. Two things
that would help this are:
1. Determine a single time up-front whether we will try to compute these
advanced statistics:
```
private val statsClass =
"org.apache.hadoop.fs.FileSystem$Statistics$StatisticsData"
private val statsFunction = "getThreadStatistics"
/** Whether to attempt accessing per-thread statistics from Hadoop */
private[spark] val hasAdvancedStatistics =
Try(Class.forName(statsClass).getDeclaredMethod(statsFunction)).map(true).getOrElse(false)
```
And then remove the exception blocks elsewhere. I.e. if we detect advanced
statistics are available and then there is a failure getting them, we should
throw an exception.
2. Perform as much reflection as possible off the critical path. I.e. the
Hadoop RDD should look up the entire function for the computing thread at the
beginning, then it can invoke that function within the hot loop only.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]