[ 
https://issues.apache.org/jira/browse/IMPALA-14116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18063966#comment-18063966
 ] 

Quanlong Huang edited comment on IMPALA-14116 at 3/9/26 6:13 AM:
-----------------------------------------------------------------

Looking deeper to this issue, I found another cause of the issue. The crash is 
due to the ORC lib throws an exception complaining about wrong literal type:
{code:cpp}
      case Operator::IN:
        ...
        for (auto literal : mLiterals) {
          if (static_cast<int>(literal.getType()) != static_cast<int>(mType)) {
            throw std::invalid_argument("leaf and literal types do not match!");
          }
        }
        break;{code}
[https://github.com/apache/orc/blob/v1.7.9/c%2B%2B/src/sargs/PredicateLeaf.cc#L146]

Checked the code where we create the NULL orc::Literal:
{code:cpp}
orc::Literal HdfsOrcScanner::GetSearchArgumentLiteral(ScalarExprEvaluator* eval,
    int child_idx, const ColumnType& dst_type, orc::PredicateDataType* 
predicate_type) {
    ...
    case TYPE_STRING: {
      if (UNLIKELY(!val)) return orc::Literal(predicate_type);  // should use 
*predicate_type here!
      const StringValue* sv = reinterpret_cast<StringValue*>(val);
      return orc::Literal(sv->Ptr(), sv->Len());{code}
[https://github.com/apache/impala/blob/caeacdf331136b25669e08a7b1cc8ce9e4c1122d/be/src/exec/orc/hdfs-orc-scanner.cc#L1195]

We are actually using a wrong constructor of orc::Literal. The correct one 
expects a PredicateDataType, but we are passing a pointer to it.
{code:cpp}
    /**
     * Create a literal of null value for a specific type
     */
    Literal(PredicateDataType type);{code}
[https://github.com/apache/orc/blob/95c927e3f9e015f2bc66b0220a5a797edc682ab4/c%2B%2B/include/orc/sargs/Literal.hh#L75]

Using the null pointer causing us using the boolean constructor for 
orc::Literal thus leads to the exception.


was (Author: stiga-huang):
Looking deeper to this issue, I found another cause of the issue. The crash is 
due to the ORC lib throws an exception complaining about wrong literal type:
{code:cpp}
      case Operator::IN:
        validateColumn();
        if (mLiterals.size() < 2) {
          throw std::invalid_argument("At least two literals are required!");
        }
        for (auto literal : mLiterals) {
          if (static_cast<int>(literal.getType()) != static_cast<int>(mType)) {
            throw std::invalid_argument("leaf and literal types do not match!");
          }
        }
        break;{code}
[https://github.com/apache/orc/blob/v1.7.9/c%2B%2B/src/sargs/PredicateLeaf.cc#L146]

Checked the code where we create the NULL orc::Literal:
{code:cpp}
    case TYPE_STRING: {
      if (UNLIKELY(!val)) return orc::Literal(predicate_type);  // should use 
*predicate_type here!
      const StringValue* sv = reinterpret_cast<StringValue*>(val);
      return orc::Literal(sv->Ptr(), sv->Len());{code}
[https://github.com/apache/impala/blob/caeacdf331136b25669e08a7b1cc8ce9e4c1122d/be/src/exec/orc/hdfs-orc-scanner.cc#L1195]

We are actually using a wrong constructor of orc::Literal. The correct one 
expects a PredicateDataType, but we are passing a pointer to it.
{code:cpp}
    /**
     * Create a literal of null value for a specific type
     */
    Literal(PredicateDataType type);{code}
[https://github.com/apache/orc/blob/95c927e3f9e015f2bc66b0220a5a797edc682ab4/c%2B%2B/include/orc/sargs/Literal.hh#L75]

Using the null pointer causing us using the boolean constructor for 
orc::Literal thus leads to the exception.

> Consider erroring out earlier if NULL is on the IN-list of a table scan 
> against an ORC table
> --------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-14116
>                 URL: https://issues.apache.org/jira/browse/IMPALA-14116
>             Project: IMPALA
>          Issue Type: Improvement
>            Reporter: Fang-Yu Rao
>            Assignee: Fang-Yu Rao
>            Priority: Major
>         Attachments: resolved_crashed_thread.txt
>
>
> We found that currently if we include NULL on the IN-list of a table scan 
> against an ORC table, Impala daemons could crash. This could be reproduced in 
> the following.
> # Create the database and an ORC table under the database in impala-shell.
> {code}
> create database test_db_04;
> CREATE EXTERNAL TABLE test_db_04.test_tbl_01 (customer_id STRING) 
> PARTITIONED BY (ingest_date STRING) 
> WITH SERDEPROPERTIES ('serialization.format'='1') 
> STORED AS ORC;
> {code}
> # Insert a row into the ORC table just created via beeline.
> {code}
> INSERT INTO test_db_04.test_tbl_01 partition (ingest_date='2025-05-29') 
> values ('CUST001');
> {code}
> # Execute the following query via impala-shell.
> {code}
> SELECT ingest_date, customer_id
> FROM test_db_04.test_tbl_01 WHERE ingest_date > DATE '2024-09-30' AND 
> customer_id IN ('', NULL)
> GROUP BY 1, 2;
> {code}
> An Impala daemon would crash during the execution of the ORC table scan. The 
> stack trace of the crashed thread in the resolved minidump is also provided 
> in  [^resolved_crashed_thread.txt].
> We should consider erroring out earlier if NULL is on the IN-list of a table 
> scan against an ORC table to prevent any Impala daemon from crashing, maybe 
> during the query analysis.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to