[jira] [Updated] (IMPALA-7178) Reduce logging for common data errors

Csaba Ringhofer (JIRA) Fri, 15 Jun 2018 07:05:32 -0700


     [ 
https://issues.apache.org/jira/browse/IMPALA-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Csaba Ringhofer updated IMPALA-7178:
------------------------------------
    Description: 
Some data errors (for example out-of-range parquet timestamps) can dominate 
logs if a table contains a large number of rows with invalid data. If an error 
has its own error code (see common/thrift/generate_error_codes.py), then these 
errors are already aggregated to the user (RuntimeState::LogError()) for every 
query, but the logs will contain a new line for every occurrence. This is not 
too useful most of times, as the log lines will repeat the same information 
(the corrupt data itself is not logged as it can be sensitive information).

The best would to reduce logging without loosing information:
- the first occurrence of an error should be logged (per 
query/fragment/table/file/column) to help the investigation of cases where the 
data error leads to other errors and to avoid breaking log analyzer tools that 
search for the current format
- other occurrences can be aggregated, like "in query Q table T column C XY 
error occurred N times"

An extra goal is to avoid calling RuntimeState::LogError() for other 
occurrences than the first one, as RuntimeState::LogError() uses a (per 
fragment) lock.


  was:
Some data errors (for example out-of-range parquet timestamps) can dominate 
logs if a table contains a large number of rows with invalid data. If an error 
has its own error code (see common/thrift/generate_error_codes.py), then these 
errors are already aggregated to the user (RuntimeState::LogError()) for every 
query, but the logs will contain a new line for every occurrence. This not too 
useful most of times, as the log lines will repeat  the same information (the 
corrupt data itself is not logged as it can be sensitive information).

The best would to reduce logging without loosing information:
- the first occurrence of an error should be logged (per 
query/fragment/table/file/column) to help investigation of cases where the data 
error leads to other errors and to avoid breaking log analyzer tools that 
search for the current format
- other occurrences can be aggregated, like "in query Q table T column C XY 
error occurred N times"

An extra goal is to avoid calling RuntimeState::LogError() for other 
occurrences than the first one, as RuntimeState::LogError() uses a lock.



> Reduce logging for common data errors
> -------------------------------------
>
>                 Key: IMPALA-7178
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7178
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>            Reporter: Csaba Ringhofer
>            Assignee: Csaba Ringhofer
>            Priority: Major
>
> Some data errors (for example out-of-range parquet timestamps) can dominate 
> logs if a table contains a large number of rows with invalid data. If an 
> error has its own error code (see common/thrift/generate_error_codes.py), 
> then these errors are already aggregated to the user 
> (RuntimeState::LogError()) for every query, but the logs will contain a new 
> line for every occurrence. This is not too useful most of times, as the log 
> lines will repeat the same information (the corrupt data itself is not logged 
> as it can be sensitive information).
> The best would to reduce logging without loosing information:
> - the first occurrence of an error should be logged (per 
> query/fragment/table/file/column) to help the investigation of cases where 
> the data error leads to other errors and to avoid breaking log analyzer tools 
> that search for the current format
> - other occurrences can be aggregated, like "in query Q table T column C XY 
> error occurred N times"
> An extra goal is to avoid calling RuntimeState::LogError() for other 
> occurrences than the first one, as RuntimeState::LogError() uses a (per 
> fragment) lock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (IMPALA-7178) Reduce logging for common data errors

Reply via email to