pengdou1990 created IMPALA-10809:
------------------------------------

             Summary: improve the performance of unnest operation
                 Key: IMPALA-10809
                 URL: https://issues.apache.org/jira/browse/IMPALA-10809
             Project: IMPALA
          Issue Type: Improvement
            Reporter: pengdou1990


h2. current situation

Impala's support for complex data types is not particularly friendly.

For example, if you need to expand rows containing Array type fields, you need 
to unnest the array fields first, and then do a nested loop join.

If you need to expand multiple array fields, you need to do multiple unnests, 
And perform multiple unest and nested loop joins, which puts a lot of 
computational pressure on the executor. 

DDL:
CREATE TABLE rawdata.users2 (                                     
  day INT,                                                        
  sampling_group INT,                                             
  user_id BIGINT,                                                 
  time TIMESTAMP,                                                 
  _offset BIGINT,                                                 
  event_id INT,                                                   
  month_id INT,                                                   
  week_id INT,                                                    
  distinct_id STRING,                                             
  event_bucket INT,                                               
  adresses_list_string ARRAY<STRING>,                             
  count_list_bigint ARRAY<BIGINT>                                 
)                                                                 
WITH SERDEPROPERTIES ('serialization.format'='1')                 
STORED AS PARQUET                                                 
LOCATION 'hdfs://localhost:20500/test-warehouse/rawdata.db/users2'
Query SQL:
SELECT
    `day`,
    list`.item,
    list1.item 
FROM
    rawdata.users2,
    rawdata.users2.adresses_list_string list1,
    rawdata.users2.count_list_bigint list2
Simplified Plan:
F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
|
07:EXCHANGE [UNPARTITIONED]
|
01:SUBPLAN
|
|--06:NESTED LOOP JOIN [CROSS JOIN]
|  |
|  |--04:UNNEST [users2.count_list_bigint clist]
|  |
|  05:NESTED LOOP JOIN [CROSS JOIN]
|  |
|  |--02:SINGULAR ROW SRC
|  |
|  03:UNNEST [users2.adresses_list_string list]
|
00:SCAN HDFS [rawdata.users2, RANDOM]
h2. Improve Solution

In actual use, I found that if some changes are made to the calculation logic 
of unnest, the calculation performance will be greatly improved:

At first, in FE construct and new plan type, named explode node, it and it's 
child node construct a pipeline operation

then, in BE, the raw was explode locally, and the fileds layout as childnode

the query sql and Plan greatly simplified:

Query SQL:
SELECT
    `day`,
    explode(adresses_list_string),
    explode(count_list_bigint) 
from
    rawdata.users2
the simplified Plan as this:
F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
|
02:EXCHANGE [UNPARTITIONED]
|
01:EXPLODE NODE [UNPARTITIONED] 
|
00:SCAN HDFS [rawdata.users2, RANDOM]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to