lgbo-ustc opened a new issue, #8675:
URL: https://github.com/apache/incubator-gluten/issues/8675

   ### Description
   
   Following query needs to scan the table `nation` multiple times.
   ```sql
   select n_regionkey , n_name, '0' as t, avg(n_nationkey) + count(n_name) as s 
from  nation where n_nationkey % 3 = 0 group by n_name,  n_regionkey 
   union all
   select n_regionkey , n_name, '1' as t, avg(n_nationkey) + count(n_name) as s 
from nation where n_nationkey % 2 = 1 group by n_name, n_regionkey 
   ```
   
   Following is the physical plan
   ```
   AdaptiveSparkPlan isFinalPlan=false
   +- Union
      :- HashAggregate(keys=[n_name#5, _groupingexpression#24L], 
functions=[avg(cast(n_nationkey#4L as double)), count(n_name#5)], 
output=[(n_regionkey + 1)#18L, n_name#5, t#0, s#1])
      :  +- Exchange hashpartitioning(n_name#5, _groupingexpression#24L, 5), 
ENSURE_REQUIREMENTS, [plan_id=38]
      :     +- HashAggregate(keys=[n_name#5, _groupingexpression#24L], 
functions=[partial_avg(cast(n_nationkey#4L as double)), 
partial_count(n_name#5)], output=[n_name#5, _groupingexpression#24L, sum#29, 
count#30L, count#31L])
      :        +- Project [n_nationkey#4L, n_name#5, (n_regionkey#6L + cast(1 
as bigint)) AS _groupingexpression#24L]
      :           +- Filter (isnotnull(n_nationkey#4L) AND ((n_nationkey#4L % 
cast(2 as bigint)) = cast(0 as bigint)))
      :              +- FileScan parquet 
tpch_pq.nation[n_nationkey#4L,n_name#5,n_regionkey#6L]
      +- HashAggregate(keys=[n_name#9, _groupingexpression#25L], 
functions=[avg(cast(n_nationkey#8L as double)), count(n_name#9)], 
output=[(n_regionkey + 1)#19L, n_name#9, t#2, s#3])
         +- Exchange hashpartitioning(n_name#9, _groupingexpression#25L, 5), 
ENSURE_REQUIREMENTS, [plan_id=40]
            +- HashAggregate(keys=[n_name#9, _groupingexpression#25L], 
functions=[partial_avg(cast(n_nationkey#8L as double)), 
partial_count(n_name#9)], output=[n_name#9, _groupingexpression#25L, sum#35, 
count#36L, count#37L])
               +- Project [n_nationkey#8L, n_name#9, (n_regionkey#10L + cast(1 
as bigint)) AS _groupingexpression#25L]
                  +- Filter (isnotnull(n_nationkey#8L) AND ((n_nationkey#8L % 
cast(2 as bigint)) = cast(1 as bigint)))
                     +- FileScan parquet 
tpch_pq.nation[n_nationkey#8L,n_name#9,n_regionkey#10L]
   ```
   
   When the table is too large, the `scan` takes up most of the execution time.
   
   We could rewrite the query as following, to make `nation` be scanned only 
once.
   ```sql
   
   select n_regionkey, n_name, t, avg(n_nationkey) + count(n_name) from (
     select st.n_regionkey as n_regionkey, st.n_name as n_name, st.t as t, 
st.n_nationkey as n_nationkey, st.gid as gid from (
       select st from (
           select explode(a) as st from (
             select array(
               if( n_nationkey % 3 = 0, named_struct('n_regionkey', 
n_regionkey, 'n_nationkey', n_nationkey, 'n_name', n_name, 't', '0', 'gid', 0), 
null), 
               if(n_nationkey % 2, named_struct('n_regionkey', n_regionkey, 
'n_nationkey', n_nationkey, 'n_name', n_name, 't', '1', 'gid', 1), null)) as a 
from (
               select * from nation where n_nationkey % 3 = 0 or n_nationkey % 2
             )
           )
       ) where st is not null
     )
   ) group by n_regionkey, n_nationkey, t, gid
   
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to