I consider the context info as more important than just logging; at hadoop
level we do it to attach things like task/jobIds, kerberos principals etc
to all store requests.
https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/auditing.html

So worrying about how pass and manage that at the thread level matters.
Hadoop does various things like creating audit spans propagating it and the
IOStatistics context through submitter/task pools, so as to correlate
better all activity. After all, if something deleted all the files in a
table: you want to know who and what, just as you want to understand what
is overloading your store with too many GET requests.

You also need to think about cluster-wide context info, such as the
specific SQL query triggering all the work. This is context info which the
spark driver has to generate and which is then propagated down to the work
thread across the cluster. I'd like that in the Hadoop auditing stuff too.
if you are targeting hadoop 3.3.2+ only, I can provide some guidance as
having it go further down the stack can only be advantageous.

I actually played with generating Log4j to JSON lines a long, long time ago.
https://hadoop.apache.org/docs/r2.8.2/hadoop-project-dist/hadoop-common/api/org/apache/hadoop/log/Log4Json.html
The files get really big fast. I'd recommend considering Avro as an option
from the outset. It's also really good it's still be able to use ripgrep to
scan a few GB of gzipped output *and get all messages a few seconds either
side*.

One interesting place for the structured output is actually test reports.
Everyone is a bit crippled here because they are stuck using an XML test
result format which was written purely to match our XSL understanding at
the time (2000!) rather than thinking of the need to correlate integration
test output. It might be that scalatest can support better structure here
than a simple model of "tests and stack traces" + stdout/stderr. While
rethinking test output formats is quite a radical undertaking and not one I
would recommend to people, leaving the option open would be good.

Finally, if you plan for the structured output to be stable over time,
you're going to have to have a policy in place about making those logs part
of the spark public API -because they will implicitly be this. HDFS Audit
Log is one such log; I think there may actually even be some regression
tests.

On Mon, 11 Mar 2024 at 01:36, Gengliang Wang <ltn...@gmail.com> wrote:

> Hi all,
>
> I'd like to start the vote for SPIP: Structured Logging Framework for
> Apache Spark
>
> References:
>
>    - JIRA ticket <https://issues.apache.org/jira/browse/SPARK-47240>
>    - SPIP doc
>    
> <https://docs.google.com/document/d/1rATVGmFLNVLmtxSpWrEceYm7d-ocgu8ofhryVs4g3XU/edit?usp=sharing>
>    - Discussion thread
>    <https://lists.apache.org/thread/gocslhbfv1r84kbcq3xt04nx827ljpxq>
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Thanks!
> Gengliang Wang
>

Reply via email to