lgbo-ustc opened a new issue, #7905:
URL: https://github.com/apache/incubator-gluten/issues/7905

   ### Description
   
   Finding the top k rows of `row_number` is a common case. We compare 
following two ways
   
   #### use aggregation
   ```sql
   -- CH
   SELECT
       x,
       y[1]
   FROM
   (
       SELECT
           x,
           groupArray(1)(y) AS y
       FROM
       (
           SELECT
               rand() % 100000 AS x,
               rand() AS y
           FROM numbers(10000000)
       )
       GROUP BY x
   )
   FORMAT `Null`
   
   Query id: 46a5e846-5324-4da5-bd7e-51d873a5f6e2
   
   Ok.
   
   0 rows in set. Elapsed: 0.561 sec. Processed 10.00 million rows, 80.00 MB 
(17.82 million rows/s., 142.54 MB/s.)
   Peak memory usage: 13.01 MiB.
   ```
   
   #### use window function
   ```sql
   SELECT *
   FROM
   (
       SELECT
           x,
           y,
           row_number() OVER (PARTITION BY x ORDER BY y ASC) AS n
       FROM
       (
           SELECT
               rand() % 100000 AS x,
               rand() AS y
           FROM numbers(10000000)
       )
   )
   WHERE n <= 1
   FORMAT `Null`
   
   Query id: fd2ff438-2ab9-4644-ada4-3de48dec0eb4
   
   Ok.
   
   0 rows in set. Elapsed: 3.872 sec. Processed 10.00 million rows, 80.00 MB 
(2.58 million rows/s., 20.66 MB/s.)
   Peak memory usage: 79.58 MiB.
   ```
   
   The aggregation implement is more efficient .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to