Github user sryza commented on a diff in the pull request:
https://github.com/apache/spark/pull/2087#discussion_r18832502
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
---
@@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
UserGroupInformation.loginUserFromKeytab(principalName, keytabFilename)
}
+ /**
+ * Returns a function that can be called to find the number of Hadoop
FileSystem bytes read by
+ * this thread so far. Reflection is required because thread-level
FileSystem statistics are only
+ * available as of Hadoop 2.5 (see HADOOP-10688). Returns None if the
required method can't be
+ * found.
+ */
+ def getInputBytesReadCallback(path: Path, conf: Configuration):
Option[() => Long] = {
--- End diff --
getInputBytesReadCallback only gets called once per task - to find the
function and return it. Are you worried about that per-task overhead? We
still do have a single reflective call to actually invoke it when we want to
populate the metric. The internet has a few different opinions on the overhead
of this. Most likely is that it's only about twice the overhead of a direct
function call, but I've also seen threads that say it's much more. Either way,
this was my root of my earlier wariness about having this call on the read path
as opposed to doing it asynchonously in a separate thread.
On the catch-all exception, you're right - will rework that part.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]