leventov opened a new issue #7362: Principles for choosing logging severity
URL: https://github.com/apache/incubator-druid/issues/7362
 
 
   ### Motivation
   
   There are no guidelines for choosing logging severity (fatal, error, 
warning, info, debug, and trace) for developers. Logging is inconsistent in 
different classes.
   
   ### Proposed principles
   
   #### Assertion errors
   Currently, Druid's `Logger` class has `wtf` methods which make error-level 
statements with "WTF?! :" prefix. First of all, I suggest to rename it to 
`assertionError`. They have "ASSERTION_ERROR" prefix instead.
   
   These are for the situations when coding assumptions are broken, something 
that is considered impossible happens. We would otherwise throw 
`AssertionError` if Druid didn't want to be more resilient.
   
   Note: something like catching `IOException` from network or disk I/O methods 
shouldn't log on this level, because such methods can throw `IOException`. This 
should be reserved for situations that are considered truly impossible on "Java 
language" level.
   
   Note: sometimes, InterruptedExceptions should be logged on this level 
(rather than ERROR), when we truly don't except that anybody would interrupt 
our call. 
   
   Action for Druid cluster operators:
    - Track "ASSERTION_ERROR"
    - Always file an issue to Druid if "ASSERTION_ERROR" appears in logs.
   
   #### Error
   
    - Failed query (due to insufficient resources, exceptions, etc.)
    - Failed network or disk I/O operation
    - Failed to load a segment on a node
    - Failed task
    - Timeouts
    - Usually, code like `catch (Exception e) {}` should log on this level.
   
   Note: in retry code, each failed our timed out attempt should log on this 
level, not the warning level.
   
   Action for Druid cluster operators:
    - Track "ERROR".
    - *Consider* filing an issue to Druid if "ERROR" (without "ASSERTION_ERROR" 
following) appears in logs. Not all such errors are problems with Druid, e. g. 
problem with network in the cluster may result in a spike in I/O action and 
query errors.
   
   #### Warning
   
    - Calling to an endpoint that was renamed, but the former name is still 
accepted for backward compatibility (e. g. see #7361)
    - Druid detects something that may be a sign of misconfiguration of a 
query, or a node, or the whole cluster (e. g. too much time spend in balancing 
segments)
    - Druid detects that some cluster resource is about to exhaust (e. g. 
capacity of a tier)
   
   Action for Druid cluster operators:
    - Track "WARNING".
    - Consider taking some action: call into new endpoint name, reconfigure the 
cluster, add historical nodes, etc. if "WARNING" appears in logs.
   
   If there are no bugs in Druid code, "WARNING" should never be something 
Druid cluster operators file an issue about directly. This is another criterion 
for choosing between the error and the warning levels for any logging statement.
   
   #### Info
   
   Some "interesting" information that may be helpful for introspection of 
Druid node or cluster operation, but doesn't qualify to be a warning.
   
   Action for Druid cluster operators: none, unless during an investigation of 
some bug or error condition.
   
   Note: information that is too mundane to be useful even for bug or error 
investigation should not be logged as info (and not logged at all, as proposed 
below)
   
   #### Debug, Trace
   
   These levels are not used in production code. Logging statements on these 
levels can be added during testing to inspect something (if debugging doesn't 
work or help) but should be removed before committing code.
   
   Rationale: if a logging statement doesn't deserve to be "info", it's net 
value by contaminating the code is negative.
   
   ### Future work
   
   Fixing all existing logging statements in the codebase is out of the scope 
of this proposal. It should be a gradual community effort after the principles 
are worked out.
   
   After we remove all debug and trace logging from production code (maybe 
making some of them "info" instead), we can add static analysis checks to 
prohibit adding new such statements in production code.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to