alamb commented on issue #7460: URL: https://github.com/apache/arrow-rs/issues/7460#issuecomment-2843142141
I did some analysis on `hits.parquet`: * Selectivity is: `13172392` / `99997497` = `0.132` * Average run length of each `RowSelection`: `99997497` / `14054784` = `7.114` <details><summary>Worked Details</summary> <p> ```sql > SELECT count("SearchPhrase") FROM hits_partitioned; +--------------------------------------+ | count(hits_partitioned.SearchPhrase) | +--------------------------------------+ | 99997497 | +--------------------------------------+ > SELECT count("SearchPhrase") FROM hits_partitioned WHERE "SearchPhrase" <> ''; +--------------------------------------+ | count(hits_partitioned.SearchPhrase) | +--------------------------------------+ | 13172392 | +--------------------------------------+ ``` So selectivity is: 13172392 / 99997497 = 0.1317272171 To figure out the pattern of passing rows, we want to know how many contiguous rows there are in the matched filter ``` (t) (t) (t) (f) (f) ``` Would have two selections (3 true, 2 false) Here is the sql: ```sql SET datafusion.execution.target_partitions = 1; WITH hits as ( SELECT "SearchPhrase", row_number() OVER () as rn FROM 'hits.parquet' ) ,results as ( SELECT rn, "SearchPhrase", "SearchPhrase" <> '', ("SearchPhrase" <> '') = (LAG("SearchPhrase" <> '', 1) OVER ()) as "filter_same_as_previous" FROM hits ) SELECT filter_same_as_previous, COUNT(*) FROM results GROUP BY filter_same_as_previous --LIMIT 10 ; +-------------------------+----------+ | filter_same_as_previous | count(*) | +-------------------------+----------+ | NULL | 1 | | true | 85942712 | | false | 14054784 | +-------------------------+----------+ ``` </p> </details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org