[jira] [Updated] (IMPALA-12254) Inconsistent empty string handling in CSV parser

Michael Smith (Jira) Tue, 16 Jul 2024 16:58:04 -0700


     [ 
https://issues.apache.org/jira/browse/IMPALA-12254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael Smith updated IMPALA-12254:
-----------------------------------
    Component/s: Backend

> Inconsistent empty string handling in CSV parser
> ------------------------------------------------
>
>                 Key: IMPALA-12254
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12254
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend, be
>            Reporter: Felix Bier
>            Priority: Minor
>
> h3. Steps to reproduce:
> CSV table:
> {code:java}
> ,
> ,, {code}
> DDL:
>  
>  
> {code:java}
> CREATE EXTERNAL TABLE tab1
> (
>    col_1 VARCHAR(256),
>    col_2 VARCHAR(256),
>    col_3 VARCHAR(256)
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
> LOCATION '/tmp/tab1'; {code}
> h3. Expected behavior:
> Querying tab1 should consistently either give all nulls or all empty strings.
> h3. Actual behavior:
> Some fields are NULL and some fields are empty string.
>  
> {code:java}
> [localhost.localdomain:21050] default> select col1 is NULL, col2 is NULL, 
> col3 is NULL from tab1;
> +---------------+---------------+---------------+
> | col_1 is null | col_2 is null | col_3 is null |
> +---------------+---------------+---------------+
> | false         | true          | false         |
> | false         | false         | true          |
> | false         | false         | true          |
> +---------------+---------------+---------------+
>  {code}
> h3. Analysis:
> Problem 1:
>  
> DelimitedTextParser has function FillColumns for filling missing
> columns. This function creates a local variable for representing the fill 
> value:
> {code:java}
>   char* dummy = NULL;
>   if (last_column == NULL) last_column = &dummy; {code}
> The last_column pointer is passed to AddColumns, which will derefence the 
> pointer and perform increment. As last_column points to dummy, the increment 
> is applied to dummy, so after the call to AddColumns, dummy is NULL +1. Each 
> further iteration in the loop in FillColumns will further increment dummy. 
> This results in non-null pointers being set as the column value, which 
> results in WriteSlot not entering the NULL case and writing an empty string 
> instead.
>  
> Problem 2:
> {color:#268bd2}ParseFieldLocations{color} makes calls to AddColumn with 
> len=0, but non-null
> column pointers. These are interpreted as empty string instead of SQL-NULL
> in the result.
>  
> h3. Suggested fix:
>  
> Problem 1:  The pointer increment in AddColumn could be guarded so that it is 
> only performed on non-NULL pointers:
> {code:java}
> diff --git a/be/src/exec/delimited-text-parser.inline.h 
> b/be/src/exec/delimited-text-parser.inline.h
> index 266fa06dd..868f4e9bc 100644
> --- a/be/src/exec/delimited-text-parser.inline.h
> +++ b/be/src/exec/delimited-text-parser.inline.h
> @@ -70,7 +70,9 @@ inline Status 
> DelimitedTextParser<DELIMITED_TUPLES>::AddColumn(int64_t len,
>      ++(*num_fields);
>    }
>    if (PROCESS_ESCAPES) current_column_has_escape_ = false;
> -  *next_column_start += len + 1;
> +  if (*next_column_start != NULL) {
> +    *next_column_start += len + 1;
> +  }
>    ++column_idx_;
>    return Status::OK();
>  }
> {code}
> With this fix, querying the example tables gives:
> {code:java}
> [localhost.localdomain:21050] default> select col_1 is NULL, col_2 is NULL, 
> col_3 is NULL from tab1;
> +---------------+---------------+---------------+
> | col_1 is null | col_2 is null | col_3 is null |
> +---------------+---------------+---------------+
> | false         | true          | true          |
> | false         | false         | true          |
> | false         | false         | true          |
> +---------------+---------------+---------------+
>  {code}
> Problem 2: Here it is not clear to what degree that is expected behavior. If 
> this is not expected behavior, I would expect that some calls to AddColumn 
> would have to be changed such that they pass a nullptr for the current column 
> position if len=0.
> If the behavior is intended, the rule that is applied is not clear to me. For 
> example, if the rule is that each empty token should be represented as empty 
> string and each delimiter starts a new token, then the first two lines in the 
> output (after applying fix for problem 1) make sense. First line in CSV file 
> is empty, so it is interpreted as one empty token and then two NULL values 
> are filled in. Second line contains "<empty token> <delimiter> <empty 
> token>", so one NULL value gets filled in. But then the third line is not 
> consistent, as it contains "<empty token> <delimiter> <empty token> 
> <delimiter> <empty token> <eof>", so I would expect all three values to be 
> non-NULL if this rule is followed.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (IMPALA-12254) Inconsistent empty string handling in CSV parser

Reply via email to