[GitHub] spark issue #17725: [SPARK-20435][CORE] More thorough redaction of sensitive...

markgrover Fri, 21 Apr 2017 21:32:40 -0700

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/17725

Few decisions that I made here:
* I considered if just `sun.java.command` property should be checked and
redacted but that seemed to specific and likely a bandaid to the current
problem, not a long-term solution, so decided against doing it.
* Redaction for the `SparkListenerEnvironmentUpdate` event was solely being
done on `Spark Properties`, while `sun.java.command` is a part of `System
Properties`. I considered doing redaction for `System Properties` in addition
to `Spark Properties` (that would have gone somewhere around
[here](https://github.com/apache/spark/pull/17725/files#diff-e4a5a68c15eed95d038acfed84b0b66aL258))
but decided against it because that would have even more hardcoding and I
didn't see why these 2 special kinds of properties are special enough to be
redacted but the rest of them. So, decided to redact information from all kinds
of properties.
* One way to redact the property value would have been to redact the
minimum possible set from the value while keeping the rest of the value intact.
For example, if the following were the unredacted case:
`"sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf
spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=secret_password --conf
spark.other.property=2"`
One option for the redacted output could have been:
`"sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf
spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=*********(redacted) --conf
spark.other.property=2"`
However, such a redaction is very hard to maintain. For example, we would
had to take the current regex (which is `(?i)secret|password` by default and
add matchers to it like so `(?i)secret|password` like
`"("+SECRET_REDACTION_DEFAULT+"[^ ]*=)[^ ]*"`. That would allow us to squeeze
out and replaced just the matched portion. But this all seemed very fragile and
even worse when the user supplies a non-default regex so I decided it was
easiest to simply replace the entire value, even though only a small part of it
contained `secret` or `password` in it.
* One thing which I didn't explicitly check was the performance
implications of this change. The reason I bring this up is because, previously
we were comparing keys with a regex, now if the key
doesn't match, we match the value with the regex. So, in the worst case, we
are twice as many regex matches as before. Also, before we were simply doing
regex matching on `Spark Properties`, now we do them on all properties - `Spark
Properties`, `System Properties`, `JVM Properties` and `Classpath Properties`.
I don't think this should have a big performance impact so I didn't invest time
in it, mentioning here in interest of full disclosure.

Thanks in advance for reviewing.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #17725: [SPARK-20435][CORE] More thorough redaction of sensitive...

Reply via email to