[
https://issues.apache.org/jira/browse/IMPALA-14116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18064262#comment-18064262
]
Fang-Yu Rao commented on IMPALA-14116:
--------------------------------------
Thanks for the detailed explanation [~stigahuang]!
Now I understand the issue a bit more (I hope).
* In {{HdfsOrcScanner::PrepareInListPredicate()}}
([https://github.com/apache/impala/blob/be4778a/be/src/exec/orc/hdfs-orc-scanner.cc#L1282-L1296]),
we construct '{{{}in_list{}}}' using {{{}GetSearchArgumentLiteral(){}}}.
{code:java}
bool HdfsOrcScanner::PrepareInListPredicate(uint64_t orc_column_id,
const ColumnType& type, ScalarExprEvaluator* eval,
orc::SearchArgumentBuilder* sarg) {
std::vector<orc::Literal> in_list;
// Initialize 'predicate_type' to avoid clang-tidy warning.
orc::PredicateDataType predicate_type = orc::PredicateDataType::BOOLEAN;
for (int i = 1; i < eval->root().children().size(); ++i) {
// ORC reader only supports pushing down predicates whose constant parts
are literal.
// FE shouldn't push down any non-literal expr here.
DCHECK(eval->root().GetChild(i)->IsLiteral())
<< "Non-literal constant expr cannot be used";
in_list.emplace_back(GetSearchArgumentLiteral(eval, i, type,
&predicate_type));
}
return PrepareInListPredicate(orc_column_id, type, in_list, sarg);
}
{code}
* When we are given a null literal, {{GetSearchArgumentLiteral()}} returns
{{{}orc::Literal(predicate_type){}}}. However, because of this, we will call
the constructor of {{orc::Literal(bool)}} at
[https://github.com/apache/orc/blob/v1.7.9/c%2B%2B/src/sargs/Literal.cc#L58C1-L66C4]
during the execution of {{{}PrepareInListPredicate(orc_column_id, type,
in_list, sarg){}}}. This could be seen from the resolved stack trace
("{{{}impalad!orc::Literal::Literal(bool) [Literal.cc : 65 + 0x5]{}}}").
{code:java}
Literal::Literal(bool val) {
mType = PredicateDataType::BOOLEAN;
mValue.BooleanVal = val;
mSize = sizeof(val);
mIsNull = false;
mPrecision = 0;
mScale = 0;
mHashCode = hashCode();
}
{code}
* Eventually, we will hit the check at
[https://github.com/apache/orc/blob/v1.7.9/c%2B%2B/src/sargs/PredicateLeaf.cc#L146]
as you pointed out.
> Consider erroring out earlier if NULL is on the IN-list of a table scan
> against an ORC table
> --------------------------------------------------------------------------------------------
>
> Key: IMPALA-14116
> URL: https://issues.apache.org/jira/browse/IMPALA-14116
> Project: IMPALA
> Issue Type: Improvement
> Reporter: Fang-Yu Rao
> Assignee: Fang-Yu Rao
> Priority: Major
> Attachments: resolved_crashed_thread.txt
>
>
> We found that currently if we include NULL on the IN-list of a table scan
> against an ORC table, Impala daemons could crash. This could be reproduced in
> the following.
> # Create the database and an ORC table under the database in impala-shell.
> {code}
> create database test_db_04;
> CREATE EXTERNAL TABLE test_db_04.test_tbl_01 (customer_id STRING)
> PARTITIONED BY (ingest_date STRING)
> WITH SERDEPROPERTIES ('serialization.format'='1')
> STORED AS ORC;
> {code}
> # Insert a row into the ORC table just created via beeline.
> {code}
> INSERT INTO test_db_04.test_tbl_01 partition (ingest_date='2025-05-29')
> values ('CUST001');
> {code}
> # Execute the following query via impala-shell.
> {code}
> SELECT ingest_date, customer_id
> FROM test_db_04.test_tbl_01 WHERE ingest_date > DATE '2024-09-30' AND
> customer_id IN ('', NULL)
> GROUP BY 1, 2;
> {code}
> An Impala daemon would crash during the execution of the ORC table scan. The
> stack trace of the crashed thread in the resolved minidump is also provided
> in [^resolved_crashed_thread.txt].
> We should consider erroring out earlier if NULL is on the IN-list of a table
> scan against an ORC table to prevent any Impala daemon from crashing, maybe
> during the query analysis.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]