2010YOUY01 opened a new pull request, #173:
URL: https://github.com/apache/sedona-db/pull/173
# Rationale
It's useful to see how many files or row groups are pruned by spatial
filter. This PR extends `DataSourceExec`'s metrics in `GeoParquetFileOpener`
related to spatial predicate pruning:
```rust
#[derive(Clone)]
struct GeoParquetFileOpenerMetrics {
/// How many file ranges are pruned by [`SpatialFilter`]
///
/// Note on "file range": an opener may read only part of a file rather
than the
/// entire file; that portion is referred to as the "file range". See
[`PartitionedFile`]
/// for details.
files_ranges_spatial_pruned: Count,
/// How many file ranges are matched by [`SpatialFilter`]
files_ranges_spatial_matched: Count,
/// How many row groups are pruned by [`SpatialFilter`]
row_groups_spatial_pruned: Count,
/// How many row groups are matched by [`SpatialFilter`]
row_groups_spatial_matched: Count,
}
```
<details>
<summary>Demo</summary>
See `*_spatial_*` entries in metrics:
```sh
Sedona CLI v0.2.0
> CREATE EXTERNAL TABLE test
STORED AS PARQUET
LOCATION
'/Users/yongting/Code/sedona-db/submodules/sedona-testing/data/parquet/geoparquet-1.1.0.parquet';
0 row(s)/0 column(s) fetched.
Elapsed 0.031 seconds.
// Spatial predicate that pruned the entire file
> EXPLAIN ANALYZE
SELECT *
FROM test
WHERE ST_Intersects(
geometry,
ST_SetSRID(
ST_GeomFromText('POLYGON((-10 84, -10 88, 10 88, 10 84, -10
84))'),
4326
)
);
┌───────────────────┬─────────────────────────────────────────────────────────────────────────────────────────┐
│ plan_type ┆ plan
│
│ utf8 ┆ utf8
│
╞═══════════════════╪═════════════════════════════════════════════════════════════════════════════════════════╡
│ Plan with Metrics ┆ CoalesceBatchesExec: target_batch_size=8192,
metrics=[output_rows=0, elapsed_compute=1. │
│ ┆ 377µs]
│
│ ┆ FilterExec: st_intersects(geometry@5,
01030000000100000005...), metrics=[output_rows= │
│ ┆ 0, elapsed_compute=14ns]
│
│ ┆ RepartitionExec: partitioning=RoundRobinBatch(14),
input_partitions=1, metrics=[fet │
│ ┆ ch_time=2.498667ms, repartition_time=1ns,
send_time=14ns] │
│ ┆ DataSourceExec: file_groups={1 group:
[[Users/yongting/Code/sedona-db/submodules/ │
│ ┆
sedona-testing/data/parquet/geoparquet-1.1.0.parquet]]}, projection=[pop_est,
continent │
│ ┆ , name, iso_a3, gdp_md_est, geometry, bbox],
file_type=parquet, metrics=[output_rows=0, │
│ ┆ elapsed_compute=1ns, batches_splitted=0,
bytes_scanned=0, file_open_errors=0, file_sca │
│ ┆ n_errors=0, files_ranges_pruned_statistics=0,
files_ranges_spatial_matched=0, files_ran │
│ ┆ ges_spatial_pruned=1, num_predicate_creation_errors=0,
page_index_rows_matched=0, page_ │
│ ┆ index_rows_pruned=0, predicate_evaluation_errors=0,
pushdown_rows_matched=0, pushdown_r │
│ ┆ ows_pruned=0, row_groups_matched_bloom_filter=0,
row_groups_matched_statistics=0, row_g │
│ ┆ roups_pruned_bloom_filter=0,
row_groups_pruned_statistics=0, row_groups_spatial_matched │
│ ┆ =0, row_groups_spatial_pruned=0,
bloom_filter_eval_time=2ns, metadata_load_time=820.626 │
│ ┆ µs, page_index_eval_time=126ns,
row_pushdown_eval_time=2ns, statistics_eval_time=2ns, t │
│ ┆ ime_elapsed_opening=2.100209ms,
time_elapsed_processing=1.899918ms, time_elapsed_scanni │
│ ┆ ng_total=3.709µs,
time_elapsed_scanning_until_data=3.708µs] │
│ ┆
│
└───────────────────┴─────────────────────────────────────────────────────────────────────────────────────────┘
1 row(s)/2 column(s) fetched.
Elapsed 0.046 seconds.
// spatial predicate can not skip file/row group
> EXPLAIN ANALYZE
SELECT *
FROM test
WHERE ST_Intersects(
geometry,
ST_SetSRID(
ST_GeomFromText(
'POLYGON((-180 -18.28799, -180 83.23324, 180 83.23324,
180 -18.28799, -180 -18.28799))'
),
4326
)
);
┌───────────────────┬─────────────────────────────────────────────────────────────────────────────────────────┐
│ plan_type ┆ plan
│
│ utf8 ┆ utf8
│
╞═══════════════════╪═════════════════════════════════════════════════════════════════════════════════════════╡
│ Plan with Metrics ┆ CoalesceBatchesExec: target_batch_size=8192,
metrics=[output_rows=5, elapsed_compute=11 │
│ ┆ 4.079µs]
│
│ ┆ FilterExec: st_intersects(geometry@5,
01030000000100000005...), metrics=[output_rows= │
│ ┆ 5, elapsed_compute=919.596µs]
│
│ ┆ RepartitionExec: partitioning=RoundRobinBatch(14),
input_partitions=1, metrics=[fet │
│ ┆ ch_time=7.449916ms, repartition_time=1ns,
send_time=28.347µs] │
│ ┆ DataSourceExec: file_groups={1 group:
[[Users/yongting/Code/sedona-db/submodules/ │
│ ┆
sedona-testing/data/parquet/geoparquet-1.1.0.parquet]]}, projection=[pop_est,
continent │
│ ┆ , name, iso_a3, gdp_md_est, geometry, bbox],
file_type=parquet, metrics=[output_rows=5, │
│ ┆ elapsed_compute=1ns, batches_splitted=0,
bytes_scanned=21777, file_open_errors=0, file │
│ ┆ _scan_errors=0, files_ranges_pruned_statistics=0,
files_ranges_spatial_matched=1, files │
│ ┆ _ranges_spatial_pruned=0,
num_predicate_creation_errors=0, page_index_rows_matched=0, p │
│ ┆ age_index_rows_pruned=0,
predicate_evaluation_errors=0, pushdown_rows_matched=0, pushdo │
│ ┆ wn_rows_pruned=0, row_groups_matched_bloom_filter=0,
row_groups_matched_statistics=0, r │
│ ┆ ow_groups_pruned_bloom_filter=0,
row_groups_pruned_statistics=0, row_groups_spatial_mat │
│ ┆ ched=1, row_groups_spatial_pruned=0,
bloom_filter_eval_time=2ns, metadata_load_time=1.0 │
│ ┆ 62084ms, page_index_eval_time=376ns,
row_pushdown_eval_time=2ns, statistics_eval_time=2 │
│ ┆ ns, time_elapsed_opening=2.791916ms,
time_elapsed_processing=5.083834ms, time_elapsed_s │
│ ┆ canning_total=4.124333ms,
time_elapsed_scanning_until_data=3.912625ms] │
│ ┆
│
└───────────────────┴─────────────────────────────────────────────────────────────────────────────────────────┘
1 row(s)/2 column(s) fetched.
Elapsed 0.035 seconds.
```
</details>
# Implementation
Included a struct to hold spatial pruning related metrics isnide
`GeoParquetFileOpener`, and update those metrics during spatial filtering.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]