alamb commented on issue #7460:
URL: https://github.com/apache/arrow-rs/issues/7460#issuecomment-2843506718

   Here is the `test.parquet` file being created by the benchmark: 
[test.zip](https://github.com/user-attachments/files/19985142/test.zip)
   
   The equivalent numbers are:
   
   * Selectivity is: `80147` / `100000` = `0.8`
   * Number of RowSelections = `67989`
   * Average run length of each RowSelection: `100000 / 32010` = `3.1`
   
   So in other words I think the filter benchmark doesn't quite match what is 
in the ClickBench file
   
   
   <details><summary>Details</summary>
   <p>
   
   
   ```sql
   > select count(*)  from '/tmp/test.parquet' where "utf8View" <> '';
   +----------+
   | count(*) |
   +----------+
   | 80147    |
   +----------+
   1 row(s) fetched.
   Elapsed 0.015 seconds.
   
   > select count(*)  from '/tmp/test.parquet';
   +----------+
   | count(*) |
   +----------+
   | 100000   |
   +----------+
   1 row(s) fetched.
   Elapsed 0.004 seconds.
   
   >
   WITH
   hits as (
     SELECT
       "utf8View",
       row_number() OVER () as rn
     FROM
       '/tmp/test.parquet'
   )
   ,results as (
     SELECT
       rn,
       "utf8View",
       "utf8View" <> '',
       ("utf8View" <> '') = (LAG("utf8View" <> '', 1) OVER ()) as 
"filter_same_as_previous"
     FROM
      hits
   )
   SELECT
     filter_same_as_previous, COUNT(*)
   FROM results
   GROUP BY
     filter_same_as_previous
   --LIMIT 10
   ;
   
   +-------------------------+----------+
   | filter_same_as_previous | count(*) |
   +-------------------------+----------+
   | NULL                    | 1        |
   | true                    | 67989    |
   | false                   | 32010    |
   +-------------------------+----------+
   3 row(s) fetched.
   Elapsed 0.017 seconds.
   ```
   
   
   
   </p>
   </details> 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to