alamb commented on issue #7460:
URL: https://github.com/apache/arrow-rs/issues/7460#issuecomment-2843142141

   I did some analysis on `hits.parquet`:
   
   * Selectivity is: `13172392` / `99997497` = `0.132`
   * Average run length of each `RowSelection`: `99997497` / `14054784` = 
`7.114`
   
   
   <details><summary>Worked Details</summary>
   <p>
   
   ```sql
   > SELECT count("SearchPhrase") FROM hits_partitioned;
   +--------------------------------------+
   | count(hits_partitioned.SearchPhrase) |
   +--------------------------------------+
   | 99997497                             |
   +--------------------------------------+
   
   > SELECT count("SearchPhrase") FROM hits_partitioned WHERE "SearchPhrase" <> 
'';
   +--------------------------------------+
   | count(hits_partitioned.SearchPhrase) |
   +--------------------------------------+
   | 13172392                             |
   +--------------------------------------+
   ```
   
   So selectivity is: 13172392 / 99997497 = 0.1317272171
   
   To figure out the pattern of passing rows, we want to know how many 
contiguous rows there are in the matched filter
   
   ```
   (t)
   (t)
   (t)
   (f)
   (f)
   ```
   
   Would have two selections (3 true, 2 false)
   
   Here is the sql:
   
   ```sql
   SET datafusion.execution.target_partitions = 1;
   
   WITH
   hits as (
     SELECT
       "SearchPhrase",
       row_number() OVER () as rn
     FROM
       'hits.parquet'
   )
   ,results as (
     SELECT
       rn,
       "SearchPhrase",
       "SearchPhrase" <> '',
       ("SearchPhrase" <> '') = (LAG("SearchPhrase" <> '', 1) OVER ()) as 
"filter_same_as_previous"
     FROM
      hits
   )
   SELECT
     filter_same_as_previous, COUNT(*)
   FROM results
   GROUP BY
     filter_same_as_previous
   --LIMIT 10
   ;
   
   +-------------------------+----------+
   | filter_same_as_previous | count(*) |
   +-------------------------+----------+
   | NULL                    | 1        |
   | true                    | 85942712 |
   | false                   | 14054784 |
   +-------------------------+----------+
   ```
   
   </p>
   </details> 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to