Csaba Ringhofer created IMPALA-7178:
---------------------------------------
Summary: Reduce logging for common data errors
Key: IMPALA-7178
URL: https://issues.apache.org/jira/browse/IMPALA-7178
Project: IMPALA
Issue Type: Improvement
Components: Backend
Reporter: Csaba Ringhofer
Assignee: Csaba Ringhofer
Some data errors (for example out-of-range parquet timestamps) can dominate
logs if a table contains a large number of rows with invalid data. If an error
has its own error code (see common/thrift/generate_error_codes.py), then these
errors are already aggregated to the user (RuntimeState::LogError()) for every
query, but the logs will contain a new line for every occurrence. This not too
useful most of times, as the log lines will repeat the same information (the
corrupt data itself is not logged as it can be sensitive information).
The best would to reduce logging without loosing information:
- the first occurrence of an error should be logged (per
query/fragment/table/file/column) to help investigation of cases where the data
error leads to other errors and to avoid breaking log analyzer tools that
search for the current format
- other occurrences can be aggregated, like "in query Q table T column C XY
error occurred N times"
An extra goal is to avoid calling RuntimeState::LogError() for other
occurrences than the first one, as RuntimeState::LogError() uses a lock.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)