[ 
https://issues.apache.org/jira/browse/IMPALA-7756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer updated IMPALA-7756:
------------------------------------
    Description: 
Errors in data files can be handled in several ways:
a. the error can lead to a non-ok STATUS returned to the user
b. some errors are ignored intentionally to support files written by buggy 
writers from the past or to avoid the performance cost of checking / complexity 
of bubbling up the status
c. some errors lead to hitting a DCHECK - this means that it will be ignored in 
release builds, but will cause crash in debug builds

c. is problematic because fuzz tests can generate such corrupt files, leading 
to crashed (and flaky) tests. On the other side, ignoring errors can mean that 
bugs in writers are not detected by Impala's tests.

c. could be replaced with DCHECK like macros  that are switched off if a flag 
(e.g. 'strict_parsers') is false. The flag would have no effect in release 
builds, and would be true by default during tests, while some specific tests 
(fuzz tests + parser tests for corrupt files) would start impalads with the 
flag turned off.

The drawback would be that related tests would have to be custom cluster tests, 
making the test suite slower. The main benefit is that it is easy to add such 
checks to the code and they can provide guardrails during file writer 
development.

An alternative approach would be to implement this as a query option, which 
would make tests simpler and faster, but would make the backend changes less 
trivial.

There are already macros that seem to have the same purpose (e.g. 
FILE_CHECK_GE),, but they are implemented as simple DCHECKs (see 
https://github.com/apache/impala/blob/15e8ce4f273945ce548fe677ee0140dea8068e6d/be/src/util/debug-util.h#L153I)

  was:
Errors in data files can be handled in several ways:
a. the error can lead to a non-ok STATUS returned to the user
b. some errors are ignored intentionally to support files written by buggy 
writers from the past or to avoid the performance cost of checking / complexity 
of bubbling up the status
c. some errors lead to hitting a DCHECK - this means that it will be ignored in 
release builds, but will cause crash in debug builds

c. is problematic because fuzz tests can generate such corrupt files, leading 
to crashed (and flaky) tests. On the other side, ignoring errors can mean that 
bugs in writers are not detected by Impala's tests.

c. could be replaced with DCHECK like macros  that are switched off if a flag 
(e.g. 'strict_parsers') is false. The flag would have no effect in release 
builds, and would be true by default during tests, while some specific tests 
(fuzz tests + parser tests for corrupt files) would start impalads with the 
flag turned off.

The drawback would be that related tests would have to be custom cluster tests, 
making the test suite slower. The main benefit is that it is easy to add such 
checks to the code and they can provide guardrails during file writer 
development.

An alternative approach would be implement this as a query option, which would 
make tests simpler and faster, but would make the backend changes less trivial.

There are already macros that seem to have the same purpose (e.g. 
FILE_CHECK_GE),, but they are implemented as simple DCHECKs (see 
https://github.com/apache/impala/blob/15e8ce4f273945ce548fe677ee0140dea8068e6d/be/src/util/debug-util.h#L153I)


> Create a "strict mode" for file parsing in debug builds
> -------------------------------------------------------
>
>                 Key: IMPALA-7756
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7756
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend, Infrastructure
>            Reporter: Csaba Ringhofer
>            Priority: Minor
>              Labels: parquet
>
> Errors in data files can be handled in several ways:
> a. the error can lead to a non-ok STATUS returned to the user
> b. some errors are ignored intentionally to support files written by buggy 
> writers from the past or to avoid the performance cost of checking / 
> complexity of bubbling up the status
> c. some errors lead to hitting a DCHECK - this means that it will be ignored 
> in release builds, but will cause crash in debug builds
> c. is problematic because fuzz tests can generate such corrupt files, leading 
> to crashed (and flaky) tests. On the other side, ignoring errors can mean 
> that bugs in writers are not detected by Impala's tests.
> c. could be replaced with DCHECK like macros  that are switched off if a flag 
> (e.g. 'strict_parsers') is false. The flag would have no effect in release 
> builds, and would be true by default during tests, while some specific tests 
> (fuzz tests + parser tests for corrupt files) would start impalads with the 
> flag turned off.
> The drawback would be that related tests would have to be custom cluster 
> tests, making the test suite slower. The main benefit is that it is easy to 
> add such checks to the code and they can provide guardrails during file 
> writer development.
> An alternative approach would be to implement this as a query option, which 
> would make tests simpler and faster, but would make the backend changes less 
> trivial.
> There are already macros that seem to have the same purpose (e.g. 
> FILE_CHECK_GE),, but they are implemented as simple DCHECKs (see 
> https://github.com/apache/impala/blob/15e8ce4f273945ce548fe677ee0140dea8068e6d/be/src/util/debug-util.h#L153I)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to