[
https://issues.apache.org/jira/browse/IMPALA-12254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Smith updated IMPALA-12254:
-----------------------------------
Component/s: Backend
> Inconsistent empty string handling in CSV parser
> ------------------------------------------------
>
> Key: IMPALA-12254
> URL: https://issues.apache.org/jira/browse/IMPALA-12254
> Project: IMPALA
> Issue Type: Bug
> Components: Backend, be
> Reporter: Felix Bier
> Priority: Minor
>
> h3. Steps to reproduce:
> CSV table:
> {code:java}
> ,
> ,, {code}
> DDL:
>
>
> {code:java}
> CREATE EXTERNAL TABLE tab1
> (
> col_1 VARCHAR(256),
> col_2 VARCHAR(256),
> col_3 VARCHAR(256)
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
> LOCATION '/tmp/tab1'; {code}
> h3. Expected behavior:
> Querying tab1 should consistently either give all nulls or all empty strings.
> h3. Actual behavior:
> Some fields are NULL and some fields are empty string.
>
> {code:java}
> [localhost.localdomain:21050] default> select col1 is NULL, col2 is NULL,
> col3 is NULL from tab1;
> +---------------+---------------+---------------+
> | col_1 is null | col_2 is null | col_3 is null |
> +---------------+---------------+---------------+
> | false | true | false |
> | false | false | true |
> | false | false | true |
> +---------------+---------------+---------------+
> {code}
> h3. Analysis:
> Problem 1:
>
> DelimitedTextParser has function FillColumns for filling missing
> columns. This function creates a local variable for representing the fill
> value:
> {code:java}
> char* dummy = NULL;
> if (last_column == NULL) last_column = &dummy; {code}
> The last_column pointer is passed to AddColumns, which will derefence the
> pointer and perform increment. As last_column points to dummy, the increment
> is applied to dummy, so after the call to AddColumns, dummy is NULL +1. Each
> further iteration in the loop in FillColumns will further increment dummy.
> This results in non-null pointers being set as the column value, which
> results in WriteSlot not entering the NULL case and writing an empty string
> instead.
>
> Problem 2:
> {color:#268bd2}ParseFieldLocations{color} makes calls to AddColumn with
> len=0, but non-null
> column pointers. These are interpreted as empty string instead of SQL-NULL
> in the result.
>
> h3. Suggested fix:
>
> Problem 1: The pointer increment in AddColumn could be guarded so that it is
> only performed on non-NULL pointers:
> {code:java}
> diff --git a/be/src/exec/delimited-text-parser.inline.h
> b/be/src/exec/delimited-text-parser.inline.h
> index 266fa06dd..868f4e9bc 100644
> --- a/be/src/exec/delimited-text-parser.inline.h
> +++ b/be/src/exec/delimited-text-parser.inline.h
> @@ -70,7 +70,9 @@ inline Status
> DelimitedTextParser<DELIMITED_TUPLES>::AddColumn(int64_t len,
> ++(*num_fields);
> }
> if (PROCESS_ESCAPES) current_column_has_escape_ = false;
> - *next_column_start += len + 1;
> + if (*next_column_start != NULL) {
> + *next_column_start += len + 1;
> + }
> ++column_idx_;
> return Status::OK();
> }
> {code}
> With this fix, querying the example tables gives:
> {code:java}
> [localhost.localdomain:21050] default> select col_1 is NULL, col_2 is NULL,
> col_3 is NULL from tab1;
> +---------------+---------------+---------------+
> | col_1 is null | col_2 is null | col_3 is null |
> +---------------+---------------+---------------+
> | false | true | true |
> | false | false | true |
> | false | false | true |
> +---------------+---------------+---------------+
> {code}
> Problem 2: Here it is not clear to what degree that is expected behavior. If
> this is not expected behavior, I would expect that some calls to AddColumn
> would have to be changed such that they pass a nullptr for the current column
> position if len=0.
> If the behavior is intended, the rule that is applied is not clear to me. For
> example, if the rule is that each empty token should be represented as empty
> string and each delimiter starts a new token, then the first two lines in the
> output (after applying fix for problem 1) make sense. First line in CSV file
> is empty, so it is interpreted as one empty token and then two NULL values
> are filled in. Second line contains "<empty token> <delimiter> <empty
> token>", so one NULL value gets filled in. But then the third line is not
> consistent, as it contains "<empty token> <delimiter> <empty token>
> <delimiter> <empty token> <eof>", so I would expect all three values to be
> non-NULL if this rule is followed.
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]