Github user steveloughran commented on the pull request:
https://github.com/apache/spark/pull/4491#issuecomment-83615286
1. the `"InterfaceAudience.Private"` tags in Hadoop are a "please don't
use` hint, although if you look at YARN AMs, they end up importing & using
stuff which is tagged that way; you can't current do an AM which uses it. What
it does mean is: they may be unintentionally changed, including signatures and
semantics, and if they break your code, it's your responsibility to find that
out and complain before the next hadoop release ships. Summary: test against
hadoop trunk or at least beta releases.
2. The crypto code is still encountering a few stabilisation problems
related to multithreading, stuff that doesn't show up in the unit tests. The
code in 2.6 has already be supplanted by the code in branch-2/trunk. Forking
off your own code means tracking those changes and keeping in sync...keeping
the code in Java would aid diffing and cherry picking there. Even without
trying to handle the quirks of the extended Hadoop streams, concurrency issues
like [HADOOP-11710](https://issues.apache.org/jira/browse/HADOOP-11710) may
matter.
3. There's also the problem that encryption performance comes from native
binaries; which means for YARN deployments: either bind to the hadoop.so/.dll
on the PATH , or push up a new version & extend PATH in container launch
contexts, and on other deployments come up with new solutions. If you can stick
to JCE routines (as this patch does) life may be simpler.
A standalone security JAR+ library would be better, with code shared by
both Hadoop & other apps. You could talk to the Hadoop project about isolating
it in Hadoop itself, though that will imply a separate native build & lib, etc.
The other tactic is to make the shuffle mechanism more pluggable, and on
YARN clusters switch to an encrypted shuffle provided by a separate library, or
use the YARN NM via whatever extension points need to be added. The latter
tactic will avoid any native library path setup issues, and will allow
alternative deployments (standalone, mesos) to switch to an encrypted shuffle
later
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]