[DISCUSS] The way we log

Štefan Miklošovič Wed, 29 May 2024 14:25:22 -0700

There is a ticket I have a mental block to merge (1). The way it happened
is that in CASSANDRA-19429 a user identified that there is a logging
statement which acquires a lock which slows down query execution. When
reworked, under CASSANDRA-19429, a reporter saw ridiculous performance
gains on machines like r8g.24xlarge and r7i.24xlarge (numbers in the
ticket). The diff was around 3x speed up which is quite hard to believe and
myself with Jon Haddad tried to replicate that but we were not able to do
it reliably because our machines were way smaller and we have not had
resources to verify that.


Anyway, that got me thinking ... If there are, potentially, so low-hanging
performance gains just by rewriting how we log, are not there any other
cases like that where we could see the performance gains too? Heck ... how
do we even log anyway? That got me rabbit hole-ing, I went through the
codebase and looked at what our logging on TRACE looks like and it resulted
in (1).

The way we log is pretty inconsistent and it seems to me that various
people approach the logging by some "beliefs" they try to follow and I am
not completely sure where that comes from and whether other people look at
it the same way.

I saw a lot of cases like this:

if (logger.isTraceEnabled())
    logger.trace("a log message");

and sometimes just:

    logger.trace("a log message");

Why do we do it once like that and another time the other way?

When we do "if (logger.isTraceEnabled())" ..., if we are _not_ on TRACE, we
check it at most once. But if we are on TRACE, we actually check that
_twice_. The first time in "if" and the second time in "logger.trace()".
But that is not necessary, because logger.trace() is checking that anyway,
we do not need to wrap it. People probably think that if they just wrap it
in "if", it will be faster, but I do not see any meaningful differences
when I benchmark it and the documentation for slf4j tells the same. The
difference is negligible and in the realm of nanoseconds (!!!).

On the other hand, logger.isTraceEnabled() is handy to use if there is some
heavy-weight object to log, like this

if (logger.isTraceEnabled()
    logger.trace("a log message {}",
object.takesALongTimeAndMemoryToConstructString());

If it takes a considerable amount of time/memory to log, more than the
evaluation of isTraceEnabled itself, then it indeed makes sense to log it
like that.

It does not make sense to wrap it in "if" when we do this:

if (logger.isTraceEnabled())
    logger.trace("a log message {}", object);

The reason why is that logger.trace() itself evaluates if it is on TRACE or
not and "object.toString()" will be called _after_ we are 100% sure we are
indeed going to log it. So wrapping it is not necessary again and
redundant.

For cases like this, this is not necessary to wrap either:

if (logger.isTraceEnabled())
    logger.trace("a log message {}, {}", object1, object2);

However, when there is a log with more than 2 arguments, like below,
according to the documentation, there is a hidden cost of constructing an
array of three objects. For one or two objects, no array, for 3 and more,
hidden object creation, even short-lived.

if (logger.isTraceEnabled())
    logger.trace("a log message {}, {}, {}", object1, object2, object3);

This is wrong too:

logger.trace("a log message " + object1 + " " + object2 + " " + object3);

Because there is a cost of concatenating the strings and implicitly calling
.toString() on each object. It has to be evaluated every single time, even
though we are not going to log it on TRACE, because trace() needs to have a
string parameter regardless of whether we are logging or not.

Based on these findings, I went through the code and I have incorporated
these rules and I rewrote it like this:

1) no wrapping in "if" if we are not logging more than 2 parameters.
2) rewritten log messages to not contain any string concatenation but
moving it all to placeholders ({}).
3) wrap it in "if" if we need to execute a method(s) on parameter(s) which
is resource-consuming.

I realized I was not able to merge CASSANDRA-19632 because I just invented
these rules on my own without consulting that with anybody else and I do
not want to merge it and then people come after me that it is wrong or they
would question that.

Do these rules make sense to everybody and do you think this is something
we can stick to? This is quite hard to check-style so I am more for a
written consensus on a ML.

(1) https://issues.apache.org/jira/browse/CASSANDRA-19632

[DISCUSS] The way we log

Reply via email to