[ 
https://issues.apache.org/jira/browse/IMPALA-14116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18064262#comment-18064262
 ] 

Fang-Yu Rao commented on IMPALA-14116:
--------------------------------------

Thanks for the detailed explanation [~stigahuang]!

Now I understand the issue a bit more (I hope).
 * In {{HdfsOrcScanner::PrepareInListPredicate()}} 
([https://github.com/apache/impala/blob/be4778a/be/src/exec/orc/hdfs-orc-scanner.cc#L1282-L1296]),
 we construct '{{{}in_list{}}}' using {{{}GetSearchArgumentLiteral(){}}}.

{code:java}
bool HdfsOrcScanner::PrepareInListPredicate(uint64_t orc_column_id,
    const ColumnType& type, ScalarExprEvaluator* eval,
    orc::SearchArgumentBuilder* sarg) {
  std::vector<orc::Literal> in_list;
  // Initialize 'predicate_type' to avoid clang-tidy warning.
  orc::PredicateDataType predicate_type = orc::PredicateDataType::BOOLEAN;
  for (int i = 1; i < eval->root().children().size(); ++i) {
    // ORC reader only supports pushing down predicates whose constant parts 
are literal.
    // FE shouldn't push down any non-literal expr here.
    DCHECK(eval->root().GetChild(i)->IsLiteral())
        << "Non-literal constant expr cannot be used";
    in_list.emplace_back(GetSearchArgumentLiteral(eval, i, type, 
&predicate_type));
  }
  return PrepareInListPredicate(orc_column_id, type, in_list, sarg);
}
{code}
 
 * When we are given a null literal, {{GetSearchArgumentLiteral()}} returns 
{{{}orc::Literal(predicate_type){}}}. However, because of this, we will call 
the constructor of {{orc::Literal(bool)}} at 
[https://github.com/apache/orc/blob/v1.7.9/c%2B%2B/src/sargs/Literal.cc#L58C1-L66C4]
 during the execution of  {{{}PrepareInListPredicate(orc_column_id, type, 
in_list, sarg){}}}. This could be seen from the resolved stack trace 
("{{{}impalad!orc::Literal::Literal(bool) [Literal.cc : 65 + 0x5]{}}}").

{code:java}
  Literal::Literal(bool val) {
    mType = PredicateDataType::BOOLEAN;
    mValue.BooleanVal = val;
    mSize = sizeof(val);
    mIsNull = false;
    mPrecision = 0;
    mScale = 0;
    mHashCode = hashCode();
  }
{code}
 
 * Eventually, we will hit the check at 
[https://github.com/apache/orc/blob/v1.7.9/c%2B%2B/src/sargs/PredicateLeaf.cc#L146]
 as you pointed out.

> Consider erroring out earlier if NULL is on the IN-list of a table scan 
> against an ORC table
> --------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-14116
>                 URL: https://issues.apache.org/jira/browse/IMPALA-14116
>             Project: IMPALA
>          Issue Type: Improvement
>            Reporter: Fang-Yu Rao
>            Assignee: Fang-Yu Rao
>            Priority: Major
>         Attachments: resolved_crashed_thread.txt
>
>
> We found that currently if we include NULL on the IN-list of a table scan 
> against an ORC table, Impala daemons could crash. This could be reproduced in 
> the following.
> # Create the database and an ORC table under the database in impala-shell.
> {code}
> create database test_db_04;
> CREATE EXTERNAL TABLE test_db_04.test_tbl_01 (customer_id STRING) 
> PARTITIONED BY (ingest_date STRING) 
> WITH SERDEPROPERTIES ('serialization.format'='1') 
> STORED AS ORC;
> {code}
> # Insert a row into the ORC table just created via beeline.
> {code}
> INSERT INTO test_db_04.test_tbl_01 partition (ingest_date='2025-05-29') 
> values ('CUST001');
> {code}
> # Execute the following query via impala-shell.
> {code}
> SELECT ingest_date, customer_id
> FROM test_db_04.test_tbl_01 WHERE ingest_date > DATE '2024-09-30' AND 
> customer_id IN ('', NULL)
> GROUP BY 1, 2;
> {code}
> An Impala daemon would crash during the execution of the ORC table scan. The 
> stack trace of the crashed thread in the resolved minidump is also provided 
> in  [^resolved_crashed_thread.txt].
> We should consider erroring out earlier if NULL is on the IN-list of a table 
> scan against an ORC table to prevent any Impala daemon from crashing, maybe 
> during the query analysis.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to