alamb commented on issue #20325:
URL: https://github.com/apache/datafusion/issues/20325#issuecomment-3892947875
Another thing I did was to look at the actual pattern of selected rows
You can see the overall predicate is quite selective and each selection is
often a few rows (2, 3) but sometimes has many (e.g. 44, 34)
- Selected runs are short (p50=3, p95=33); skipped runs are long (p50=233,
p95=1358).
Here is a Q10 Filter Pattern Visualization
```
Predicate: "MobilePhoneModel" <> ''
Source sample: first 1,000,000 rows of hits_0.parquet
Rows: total=1000000 selected=19643 (1.964%) skipped=980357 (98.036%)
selected_runs=2438 avg_len=8.06 min=1 max=236
skipped_runs=2439 avg_len=401.95 min=1 max=5771
selected_run_len quantiles (rows): p50=3 p90=19 p95=33 p99=77
skipped_run_len quantiles (rows): p50=233 p90=980 p95=1358 p99=2297
Legend: S=selected run, .=skipped run
Scale: bar width ~= run_len/50 rows (min 1, cap 60)
First 120 runs:
1 skipped len=1065 rows=[1..1065] .....................
2 selected len=2 rows=[1066..1067] S
3 skipped len=10 rows=[1068..1077] .
4 selected len=44 rows=[1078..1121] S
5 skipped len=133 rows=[1122..1254] ..
6 selected len=34 rows=[1255..1288] S
7 skipped len=8 rows=[1289..1296] .
8 selected len=2 rows=[1297..1298] S
9 skipped len=152 rows=[1299..1450] ...
10 selected len=6 rows=[1451..1456] S
11 skipped len=400 rows=[1457..1856] ........
12 selected len=2 rows=[1857..1858] S
13 skipped len=1145 rows=[1859..3003] ......................
14 selected len=26 rows=[3004..3029] S
15 skipped len=971 rows=[3030..4000] ...................
16 selected len=6 rows=[4001..4006] S
17 skipped len=32 rows=[4007..4038] .
18 selected len=4 rows=[4039..4042] S
19 skipped len=271 rows=[4043..4313] .....
20 selected len=8 rows=[4314..4321] S
21 skipped len=20 rows=[4322..4341] .
22 selected len=2 rows=[4342..4343] S
23 skipped len=166 rows=[4344..4509] ...
24 selected len=24 rows=[4510..4533] S
25 skipped len=1512 rows=[4534..6045] ..............................
26 selected len=8 rows=[6046..6053] S
27 skipped len=1283 rows=[6054..7336] .........................
28 selected len=4 rows=[7337..7340] S
29 skipped len=398 rows=[7341..7738] .......
30 selected len=8 rows=[7739..7746] S
31 skipped len=147 rows=[7747..7893] ..
32 selected len=1 rows=[7894..7894] S
33 skipped len=375 rows=[7895..8269] .......
34 selected len=36 rows=[8270..8305] S
35 skipped len=148 rows=[8306..8453] ..
36 selected len=2 rows=[8454..8455] S
37 skipped len=26 rows=[8456..8481] .
38 selected len=2 rows=[8482..8483] S
39 skipped len=303 rows=[8484..8786] ......
40 selected len=38 rows=[8787..8824] S
41 skipped len=428 rows=[8825..9252] ........
42 selected len=20 rows=[9253..9272] S
43 skipped len=42 rows=[9273..9314] .
44 selected len=6 rows=[9315..9320] S
45 skipped len=1010 rows=[9321..10330] ....................
...
85 skipped len=803 rows=[17475..18277] ................
86 selected len=6 rows=[18278..18283] S
87 skipped len=132 rows=[18284..18415] ..
88 selected len=26 rows=[18416..18441] S
89 skipped len=170 rows=[18442..18611] ...
90 selected len=4 rows=[18612..18615] S
91 skipped len=34 rows=[18616..18649] .
92 selected len=3 rows=[18650..18652] S
93 skipped len=36 rows=[18653..18688] .
94 selected len=2 rows=[18689..18690] S
95 skipped len=766 rows=[18691..19456] ...............
96 selected len=6 rows=[19457..19462] S
97 skipped len=301 rows=[19463..19763] ......
98 selected len=4 rows=[19764..19767] S
99 skipped len=517 rows=[19768..20284] ..........
100 selected len=6 rows=[20285..20290] S
101 skipped len=25 rows=[20291..20315] .
102 selected len=2 rows=[20316..20317] S
103 skipped len=749 rows=[20318..21066] ..............
104 selected len=20 rows=[21067..21086] S
105 skipped len=688 rows=[21087..21774] .............
106 selected len=6 rows=[21775..21780] S
107 skipped len=376 rows=[21781..22156] .......
108 selected len=2 rows=[22157..22158] S
109 skipped len=309 rows=[22159..22467] ......
110 selected len=2 rows=[22468..22469] S
111 skipped len=211 rows=[22470..22680] ....
112 selected len=20 rows=[22681..22700] S
113 skipped len=902 rows=[22701..23602] ..................
114 selected len=4 rows=[23603..23606] S
115 skipped len=38 rows=[23607..23644] .
116 selected len=4 rows=[23645..23648] S
117 skipped len=60 rows=[23649..23708] .
118 selected len=4 rows=[23709..23712] S
119 skipped len=1264 rows=[23713..24976] .........................
120 selected len=10 rows=[24977..24986] S
```
- Filter pattern (from prior analysis files):
- Sample in `hits_0.parquet` first 1,000,000 rows: selected `1.964%`,
skipped `98.036%`.
- Selected runs are short (p50=3, p95=33); skipped runs are long (p50=233,
p95=1358).
- Across `hits_1..hits_99`, pattern is not uniform:
- selected_pct mean `5.599708%`, sd `2.290834`, min `0.2681%`
(`hits_48`), max `13.0079%` (`hits_17`).
- Conclusion: substantial cross-file variation; static thresholds are
likely brittle.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]