alamb commented on issue #7460:
URL: https://github.com/apache/arrow-rs/issues/7460#issuecomment-2843142141
I did some analysis on `hits.parquet`:
* Selectivity is: `13172392` / `99997497` = `0.132`
* Average run length of each `RowSelection`: `99997497` / `14054784` =
`7.114`
<details><summary>Worked Details</summary>
<p>
```sql
> SELECT count("SearchPhrase") FROM hits_partitioned;
+--------------------------------------+
| count(hits_partitioned.SearchPhrase) |
+--------------------------------------+
| 99997497 |
+--------------------------------------+
> SELECT count("SearchPhrase") FROM hits_partitioned WHERE "SearchPhrase" <>
'';
+--------------------------------------+
| count(hits_partitioned.SearchPhrase) |
+--------------------------------------+
| 13172392 |
+--------------------------------------+
```
So selectivity is: 13172392 / 99997497 = 0.1317272171
To figure out the pattern of passing rows, we want to know how many
contiguous rows there are in the matched filter
```
(t)
(t)
(t)
(f)
(f)
```
Would have two selections (3 true, 2 false)
Here is the sql:
```sql
SET datafusion.execution.target_partitions = 1;
WITH
hits as (
SELECT
"SearchPhrase",
row_number() OVER () as rn
FROM
'hits.parquet'
)
,results as (
SELECT
rn,
"SearchPhrase",
"SearchPhrase" <> '',
("SearchPhrase" <> '') = (LAG("SearchPhrase" <> '', 1) OVER ()) as
"filter_same_as_previous"
FROM
hits
)
SELECT
filter_same_as_previous, COUNT(*)
FROM results
GROUP BY
filter_same_as_previous
--LIMIT 10
;
+-------------------------+----------+
| filter_same_as_previous | count(*) |
+-------------------------+----------+
| NULL | 1 |
| true | 85942712 |
| false | 14054784 |
+-------------------------+----------+
```
</p>
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]