alamb commented on issue #7460: URL: https://github.com/apache/arrow-rs/issues/7460#issuecomment-2843506718
Here is the `test.parquet` file being created by the benchmark: [test.zip](https://github.com/user-attachments/files/19985142/test.zip) The equivalent numbers are: * Selectivity is: `80147` / `100000` = `0.8` * Number of RowSelections = `67989` * Average run length of each RowSelection: `100000 / 32010` = `3.1` So in other words I think the filter benchmark doesn't quite match what is in the ClickBench file <details><summary>Details</summary> <p> ```sql > select count(*) from '/tmp/test.parquet' where "utf8View" <> ''; +----------+ | count(*) | +----------+ | 80147 | +----------+ 1 row(s) fetched. Elapsed 0.015 seconds. > select count(*) from '/tmp/test.parquet'; +----------+ | count(*) | +----------+ | 100000 | +----------+ 1 row(s) fetched. Elapsed 0.004 seconds. > WITH hits as ( SELECT "utf8View", row_number() OVER () as rn FROM '/tmp/test.parquet' ) ,results as ( SELECT rn, "utf8View", "utf8View" <> '', ("utf8View" <> '') = (LAG("utf8View" <> '', 1) OVER ()) as "filter_same_as_previous" FROM hits ) SELECT filter_same_as_previous, COUNT(*) FROM results GROUP BY filter_same_as_previous --LIMIT 10 ; +-------------------------+----------+ | filter_same_as_previous | count(*) | +-------------------------+----------+ | NULL | 1 | | true | 67989 | | false | 32010 | +-------------------------+----------+ 3 row(s) fetched. Elapsed 0.017 seconds. ``` </p> </details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org