I consider the context info as more important than just logging; at hadoop level we do it to attach things like task/jobIds, kerberos principals etc to all store requests. https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/auditing.html
So worrying about how pass and manage that at the thread level matters. Hadoop does various things like creating audit spans propagating it and the IOStatistics context through submitter/task pools, so as to correlate better all activity. After all, if something deleted all the files in a table: you want to know who and what, just as you want to understand what is overloading your store with too many GET requests. You also need to think about cluster-wide context info, such as the specific SQL query triggering all the work. This is context info which the spark driver has to generate and which is then propagated down to the work thread across the cluster. I'd like that in the Hadoop auditing stuff too. if you are targeting hadoop 3.3.2+ only, I can provide some guidance as having it go further down the stack can only be advantageous. I actually played with generating Log4j to JSON lines a long, long time ago. https://hadoop.apache.org/docs/r2.8.2/hadoop-project-dist/hadoop-common/api/org/apache/hadoop/log/Log4Json.html The files get really big fast. I'd recommend considering Avro as an option from the outset. It's also really good it's still be able to use ripgrep to scan a few GB of gzipped output *and get all messages a few seconds either side*. One interesting place for the structured output is actually test reports. Everyone is a bit crippled here because they are stuck using an XML test result format which was written purely to match our XSL understanding at the time (2000!) rather than thinking of the need to correlate integration test output. It might be that scalatest can support better structure here than a simple model of "tests and stack traces" + stdout/stderr. While rethinking test output formats is quite a radical undertaking and not one I would recommend to people, leaving the option open would be good. Finally, if you plan for the structured output to be stable over time, you're going to have to have a policy in place about making those logs part of the spark public API -because they will implicitly be this. HDFS Audit Log is one such log; I think there may actually even be some regression tests. On Mon, 11 Mar 2024 at 01:36, Gengliang Wang <ltn...@gmail.com> wrote: > Hi all, > > I'd like to start the vote for SPIP: Structured Logging Framework for > Apache Spark > > References: > > - JIRA ticket <https://issues.apache.org/jira/browse/SPARK-47240> > - SPIP doc > > <https://docs.google.com/document/d/1rATVGmFLNVLmtxSpWrEceYm7d-ocgu8ofhryVs4g3XU/edit?usp=sharing> > - Discussion thread > <https://lists.apache.org/thread/gocslhbfv1r84kbcq3xt04nx827ljpxq> > > Please vote on the SPIP for the next 72 hours: > > [ ] +1: Accept the proposal as an official SPIP > [ ] +0 > [ ] -1: I don’t think this is a good idea because … > > Thanks! > Gengliang Wang >