alamb commented on issue #20325:
URL: https://github.com/apache/datafusion/issues/20325#issuecomment-3892947875

   Another thing I did was to look at the actual pattern of selected rows 
   
   You can see the overall predicate is quite selective and each selection is 
often a few rows (2, 3) but sometimes has many (e.g. 44, 34)
     - Selected runs are short (p50=3, p95=33); skipped runs are long (p50=233, 
p95=1358).
   
   
   Here is a Q10 Filter Pattern Visualization
   ```
   Predicate: "MobilePhoneModel" <> ''
   Source sample: first 1,000,000 rows of hits_0.parquet
   
   Rows: total=1000000 selected=19643 (1.964%) skipped=980357 (98.036%)
   selected_runs=2438 avg_len=8.06 min=1 max=236
   skipped_runs=2439 avg_len=401.95 min=1 max=5771
   
   selected_run_len quantiles (rows): p50=3 p90=19 p95=33 p99=77
   skipped_run_len quantiles (rows): p50=233 p90=980 p95=1358 p99=2297
   
   Legend: S=selected run, .=skipped run
   Scale: bar width ~= run_len/50 rows (min 1, cap 60)
   
   First 120 runs:
      1 skipped  len=1065  rows=[1..1065] .....................
      2 selected len=2     rows=[1066..1067] S
      3 skipped  len=10    rows=[1068..1077] .
      4 selected len=44    rows=[1078..1121] S
      5 skipped  len=133   rows=[1122..1254] ..
      6 selected len=34    rows=[1255..1288] S
      7 skipped  len=8     rows=[1289..1296] .
      8 selected len=2     rows=[1297..1298] S
      9 skipped  len=152   rows=[1299..1450] ...
     10 selected len=6     rows=[1451..1456] S
     11 skipped  len=400   rows=[1457..1856] ........
     12 selected len=2     rows=[1857..1858] S
     13 skipped  len=1145  rows=[1859..3003] ......................
     14 selected len=26    rows=[3004..3029] S
     15 skipped  len=971   rows=[3030..4000] ...................
     16 selected len=6     rows=[4001..4006] S
     17 skipped  len=32    rows=[4007..4038] .
     18 selected len=4     rows=[4039..4042] S
     19 skipped  len=271   rows=[4043..4313] .....
     20 selected len=8     rows=[4314..4321] S
     21 skipped  len=20    rows=[4322..4341] .
     22 selected len=2     rows=[4342..4343] S
     23 skipped  len=166   rows=[4344..4509] ...
     24 selected len=24    rows=[4510..4533] S
     25 skipped  len=1512  rows=[4534..6045] ..............................
     26 selected len=8     rows=[6046..6053] S
     27 skipped  len=1283  rows=[6054..7336] .........................
     28 selected len=4     rows=[7337..7340] S
     29 skipped  len=398   rows=[7341..7738] .......
     30 selected len=8     rows=[7739..7746] S
     31 skipped  len=147   rows=[7747..7893] ..
     32 selected len=1     rows=[7894..7894] S
     33 skipped  len=375   rows=[7895..8269] .......
     34 selected len=36    rows=[8270..8305] S
     35 skipped  len=148   rows=[8306..8453] ..
     36 selected len=2     rows=[8454..8455] S
     37 skipped  len=26    rows=[8456..8481] .
     38 selected len=2     rows=[8482..8483] S
     39 skipped  len=303   rows=[8484..8786] ......
     40 selected len=38    rows=[8787..8824] S
     41 skipped  len=428   rows=[8825..9252] ........
     42 selected len=20    rows=[9253..9272] S
     43 skipped  len=42    rows=[9273..9314] .
     44 selected len=6     rows=[9315..9320] S
     45 skipped  len=1010  rows=[9321..10330] ....................
   ...
     85 skipped  len=803   rows=[17475..18277] ................
     86 selected len=6     rows=[18278..18283] S
     87 skipped  len=132   rows=[18284..18415] ..
     88 selected len=26    rows=[18416..18441] S
     89 skipped  len=170   rows=[18442..18611] ...
     90 selected len=4     rows=[18612..18615] S
     91 skipped  len=34    rows=[18616..18649] .
     92 selected len=3     rows=[18650..18652] S
     93 skipped  len=36    rows=[18653..18688] .
     94 selected len=2     rows=[18689..18690] S
     95 skipped  len=766   rows=[18691..19456] ...............
     96 selected len=6     rows=[19457..19462] S
     97 skipped  len=301   rows=[19463..19763] ......
     98 selected len=4     rows=[19764..19767] S
     99 skipped  len=517   rows=[19768..20284] ..........
    100 selected len=6     rows=[20285..20290] S
    101 skipped  len=25    rows=[20291..20315] .
    102 selected len=2     rows=[20316..20317] S
    103 skipped  len=749   rows=[20318..21066] ..............
    104 selected len=20    rows=[21067..21086] S
    105 skipped  len=688   rows=[21087..21774] .............
    106 selected len=6     rows=[21775..21780] S
    107 skipped  len=376   rows=[21781..22156] .......
    108 selected len=2     rows=[22157..22158] S
    109 skipped  len=309   rows=[22159..22467] ......
    110 selected len=2     rows=[22468..22469] S
    111 skipped  len=211   rows=[22470..22680] ....
    112 selected len=20    rows=[22681..22700] S
    113 skipped  len=902   rows=[22701..23602] ..................
    114 selected len=4     rows=[23603..23606] S
    115 skipped  len=38    rows=[23607..23644] .
    116 selected len=4     rows=[23645..23648] S
    117 skipped  len=60    rows=[23649..23708] .
    118 selected len=4     rows=[23709..23712] S
    119 skipped  len=1264  rows=[23713..24976] .........................
    120 selected len=10    rows=[24977..24986] S
   ```
   
   - Filter pattern (from prior analysis files):
     - Sample in `hits_0.parquet` first 1,000,000 rows: selected `1.964%`, 
skipped `98.036%`.
     - Selected runs are short (p50=3, p95=33); skipped runs are long (p50=233, 
p95=1358).
     - Across `hits_1..hits_99`, pattern is not uniform:
       - selected_pct mean `5.599708%`, sd `2.290834`, min `0.2681%` 
(`hits_48`), max `13.0079%` (`hits_17`).
       - Conclusion: substantial cross-file variation; static thresholds are 
likely brittle.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to