Github user sryza commented on the pull request:
https://github.com/apache/spark/pull/7918#issuecomment-138687768
Sorry for the delay here, have been on PTO.
IIUC, the change here makes Spark work with some exotic InputFormats that
it previously did not work with due to thread safety, at the possible expense
of performance. Users can revert to the old behavior with a config.
There's no associated JIRA, but a6eeb5ffd54956667ec4e793149fdab90041ad6c is
the hash of the change that appears to have introduced the input format cache.
Unfortunately, I don't see any performance numbers there justifying its
addition.
It seems like the main overhead we're trying to avoid is the reflective
calls. What about caching the constructor so that we don't need to look it up
for each task?
Lastly, is an equivalent change needed for NewHadoopRDD?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]