Github user markgrover commented on the issue:

    https://github.com/apache/spark/pull/17725
  
    Few decisions that I made here:
    * I considered if just `sun.java.command` property should be checked and 
redacted but that seemed to specific and likely a bandaid to the current 
problem, not a long-term solution, so decided against doing it.
    * Redaction for the `SparkListenerEnvironmentUpdate` event was solely being 
done on `Spark Properties`, while `sun.java.command` is a part of `System 
Properties`. I considered doing redaction for `System Properties` in addition 
to `Spark Properties` (that would have gone somewhere around 
[here](https://github.com/apache/spark/pull/17725/files#diff-e4a5a68c15eed95d038acfed84b0b66aL258))
 but decided against it because that would have even more hardcoding and I 
didn't see why these 2 special kinds of properties are special enough to be 
redacted but the rest of them. So, decided to redact information from all kinds 
of properties.
    * One way to redact the property value would have been to redact the 
minimum possible set from the value while keeping the rest of the value intact. 
For example, if the following were the unredacted case:
    `"sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf 
spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=secret_password --conf 
spark.other.property=2"`
    One option for the redacted output could have been:
    `"sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf 
spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=*********(redacted) --conf 
spark.other.property=2"`
    However, such a redaction is very hard to maintain. For example, we would 
had to take the current regex (which is `(?i)secret|password` by default and 
add matchers to it like so `(?i)secret|password` like 
`"("+SECRET_REDACTION_DEFAULT+"[^ ]*=)[^ ]*"`. That would allow us to squeeze 
out and replaced just the matched portion. But this all seemed very fragile and 
even worse when the user supplies a non-default regex so I decided it was 
easiest to simply replace the entire value, even though only a small part of it 
contained `secret` or `password` in it.
    * One thing which I didn't explicitly check was the performance 
implications of this change. The reason I bring this up is because, previously 
we were comparing keys with a regex, now if the key 
    doesn't match, we match the value with the regex. So, in the worst case, we 
are twice as many regex matches as before. Also, before we were simply doing 
regex matching on `Spark Properties`, now we do them on all properties - `Spark 
Properties`, `System Properties`, `JVM Properties` and `Classpath Properties`. 
I don't think this should have a big performance impact so I didn't invest time 
in it, mentioning here in interest of full disclosure.
    
    Thanks in advance for reviewing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to