Re: [I] arrow_reader_row_filter benchmark doesn't capture page cache improvements [arrow-rs]

via GitHub Wed, 30 Apr 2025 13:02:49 -0700


alamb commented on issue #7460:
URL: https://github.com/apache/arrow-rs/issues/7460#issuecomment-2843142141


   I did some analysis on `hits.parquet`:
   
   * Selectivity is: `13172392` / `99997497` = `0.132`
   * Average run length of each `RowSelection`: `99997497` / `14054784` = 
`7.114`
   
   
   <details><summary>Worked Details</summary>
   <p>
   
   ```sql
   > SELECT count("SearchPhrase") FROM hits_partitioned;
   +--------------------------------------+
   | count(hits_partitioned.SearchPhrase) |
   +--------------------------------------+
   | 99997497                             |
   +--------------------------------------+
   
   > SELECT count("SearchPhrase") FROM hits_partitioned WHERE "SearchPhrase" <> 
'';
   +--------------------------------------+
   | count(hits_partitioned.SearchPhrase) |
   +--------------------------------------+
   | 13172392                             |
   +--------------------------------------+
   ```
   
   So selectivity is: 13172392 / 99997497 = 0.1317272171
   
   To figure out the pattern of passing rows, we want to know how many 
contiguous rows there are in the matched filter
   
   ```
   (t)
   (t)
   (t)
   (f)
   (f)
   ```
   
   Would have two selections (3 true, 2 false)
   
   Here is the sql:
   
   ```sql
   SET datafusion.execution.target_partitions = 1;
   
   WITH
   hits as (
     SELECT
       "SearchPhrase",
       row_number() OVER () as rn
     FROM
       'hits.parquet'
   )
   ,results as (
     SELECT
       rn,
       "SearchPhrase",
       "SearchPhrase" <> '',
       ("SearchPhrase" <> '') = (LAG("SearchPhrase" <> '', 1) OVER ()) as 
"filter_same_as_previous"
     FROM
      hits
   )
   SELECT
     filter_same_as_previous, COUNT(*)
   FROM results
   GROUP BY
     filter_same_as_previous
   --LIMIT 10
   ;
   
   +-------------------------+----------+
   | filter_same_as_previous | count(*) |
   +-------------------------+----------+
   | NULL                    | 1        |
   | true                    | 85942712 |
   | false                   | 14054784 |
   +-------------------------+----------+
   ```
   
   </p>
   </details> 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] arrow_reader_row_filter benchmark doesn't capture page cache improvements [arrow-rs]

Reply via email to