2010YOUY01 opened a new pull request, #173:
URL: https://github.com/apache/sedona-db/pull/173

   # Rationale
   
   It's useful to see how many files or row groups are pruned by spatial 
filter. This PR extends `DataSourceExec`'s metrics in `GeoParquetFileOpener` 
related to spatial predicate pruning:
   
   ```rust
   #[derive(Clone)]
   struct GeoParquetFileOpenerMetrics {
       /// How many file ranges are pruned by [`SpatialFilter`]
       ///
       /// Note on "file range": an opener may read only part of a file rather 
than the
       /// entire file; that portion is referred to as the "file range". See 
[`PartitionedFile`]
       /// for details.
       files_ranges_spatial_pruned: Count,
       /// How many file ranges are matched by [`SpatialFilter`]
       files_ranges_spatial_matched: Count,
       /// How many row groups are pruned by [`SpatialFilter`]
       row_groups_spatial_pruned: Count,
       /// How many row groups are matched by [`SpatialFilter`]
       row_groups_spatial_matched: Count,
   }
   ```
   
   <details>
   <summary>Demo</summary>
   
   See `*_spatial_*` entries in metrics:
   ```sh
   Sedona CLI v0.2.0
   > CREATE EXTERNAL TABLE test
   STORED AS PARQUET
   LOCATION 
'/Users/yongting/Code/sedona-db/submodules/sedona-testing/data/parquet/geoparquet-1.1.0.parquet';
   0 row(s)/0 column(s) fetched.
   Elapsed 0.031 seconds.
   // Spatial predicate that pruned the entire file
   >         EXPLAIN ANALYZE
           SELECT *
           FROM test
           WHERE ST_Intersects(
               geometry,
               ST_SetSRID(
                   ST_GeomFromText('POLYGON((-10 84, -10 88, 10 88, 10 84, -10 
84))'),
                   4326
               )
           );
   
┌───────────────────┬─────────────────────────────────────────────────────────────────────────────────────────┐
   │     plan_type     ┆                                           plan         
                                 │
   │        utf8       ┆                                           utf8         
                                 │
   
╞═══════════════════╪═════════════════════════════════════════════════════════════════════════════════════════╡
   │ Plan with Metrics ┆ CoalesceBatchesExec: target_batch_size=8192, 
metrics=[output_rows=0, elapsed_compute=1. │
   │                   ┆ 377µs]                                                 
                                 │
   │                   ┆   FilterExec: st_intersects(geometry@5, 
01030000000100000005...), metrics=[output_rows= │
   │                   ┆ 0, elapsed_compute=14ns]                               
                                 │
   │                   ┆     RepartitionExec: partitioning=RoundRobinBatch(14), 
input_partitions=1, metrics=[fet │
   │                   ┆ ch_time=2.498667ms, repartition_time=1ns, 
send_time=14ns]                               │
   │                   ┆       DataSourceExec: file_groups={1 group: 
[[Users/yongting/Code/sedona-db/submodules/ │
   │                   ┆ 
sedona-testing/data/parquet/geoparquet-1.1.0.parquet]]}, projection=[pop_est, 
continent │
   │                   ┆ , name, iso_a3, gdp_md_est, geometry, bbox], 
file_type=parquet, metrics=[output_rows=0, │
   │                   ┆  elapsed_compute=1ns, batches_splitted=0, 
bytes_scanned=0, file_open_errors=0, file_sca │
   │                   ┆ n_errors=0, files_ranges_pruned_statistics=0, 
files_ranges_spatial_matched=0, files_ran │
   │                   ┆ ges_spatial_pruned=1, num_predicate_creation_errors=0, 
page_index_rows_matched=0, page_ │
   │                   ┆ index_rows_pruned=0, predicate_evaluation_errors=0, 
pushdown_rows_matched=0, pushdown_r │
   │                   ┆ ows_pruned=0, row_groups_matched_bloom_filter=0, 
row_groups_matched_statistics=0, row_g │
   │                   ┆ roups_pruned_bloom_filter=0, 
row_groups_pruned_statistics=0, row_groups_spatial_matched │
   │                   ┆ =0, row_groups_spatial_pruned=0, 
bloom_filter_eval_time=2ns, metadata_load_time=820.626 │
   │                   ┆ µs, page_index_eval_time=126ns, 
row_pushdown_eval_time=2ns, statistics_eval_time=2ns, t │
   │                   ┆ ime_elapsed_opening=2.100209ms, 
time_elapsed_processing=1.899918ms, time_elapsed_scanni │
   │                   ┆ ng_total=3.709µs, 
time_elapsed_scanning_until_data=3.708µs]                             │
   │                   ┆                                                        
                                 │
   
└───────────────────┴─────────────────────────────────────────────────────────────────────────────────────────┘
   1 row(s)/2 column(s) fetched.
   Elapsed 0.046 seconds.
   // spatial predicate can not skip file/row group
   >         EXPLAIN ANALYZE
           SELECT *
           FROM test
           WHERE ST_Intersects(
               geometry,
               ST_SetSRID(
                   ST_GeomFromText(
                       'POLYGON((-180 -18.28799, -180 83.23324, 180 83.23324, 
180 -18.28799, -180 -18.28799))'
                   ),
                   4326
               )
           );
   
┌───────────────────┬─────────────────────────────────────────────────────────────────────────────────────────┐
   │     plan_type     ┆                                           plan         
                                 │
   │        utf8       ┆                                           utf8         
                                 │
   
╞═══════════════════╪═════════════════════════════════════════════════════════════════════════════════════════╡
   │ Plan with Metrics ┆ CoalesceBatchesExec: target_batch_size=8192, 
metrics=[output_rows=5, elapsed_compute=11 │
   │                   ┆ 4.079µs]                                               
                                 │
   │                   ┆   FilterExec: st_intersects(geometry@5, 
01030000000100000005...), metrics=[output_rows= │
   │                   ┆ 5, elapsed_compute=919.596µs]                          
                                 │
   │                   ┆     RepartitionExec: partitioning=RoundRobinBatch(14), 
input_partitions=1, metrics=[fet │
   │                   ┆ ch_time=7.449916ms, repartition_time=1ns, 
send_time=28.347µs]                           │
   │                   ┆       DataSourceExec: file_groups={1 group: 
[[Users/yongting/Code/sedona-db/submodules/ │
   │                   ┆ 
sedona-testing/data/parquet/geoparquet-1.1.0.parquet]]}, projection=[pop_est, 
continent │
   │                   ┆ , name, iso_a3, gdp_md_est, geometry, bbox], 
file_type=parquet, metrics=[output_rows=5, │
   │                   ┆  elapsed_compute=1ns, batches_splitted=0, 
bytes_scanned=21777, file_open_errors=0, file │
   │                   ┆ _scan_errors=0, files_ranges_pruned_statistics=0, 
files_ranges_spatial_matched=1, files │
   │                   ┆ _ranges_spatial_pruned=0, 
num_predicate_creation_errors=0, page_index_rows_matched=0, p │
   │                   ┆ age_index_rows_pruned=0, 
predicate_evaluation_errors=0, pushdown_rows_matched=0, pushdo │
   │                   ┆ wn_rows_pruned=0, row_groups_matched_bloom_filter=0, 
row_groups_matched_statistics=0, r │
   │                   ┆ ow_groups_pruned_bloom_filter=0, 
row_groups_pruned_statistics=0, row_groups_spatial_mat │
   │                   ┆ ched=1, row_groups_spatial_pruned=0, 
bloom_filter_eval_time=2ns, metadata_load_time=1.0 │
   │                   ┆ 62084ms, page_index_eval_time=376ns, 
row_pushdown_eval_time=2ns, statistics_eval_time=2 │
   │                   ┆ ns, time_elapsed_opening=2.791916ms, 
time_elapsed_processing=5.083834ms, time_elapsed_s │
   │                   ┆ canning_total=4.124333ms, 
time_elapsed_scanning_until_data=3.912625ms]                  │
   │                   ┆                                                        
                                 │
   
└───────────────────┴─────────────────────────────────────────────────────────────────────────────────────────┘
   1 row(s)/2 column(s) fetched.
   Elapsed 0.035 seconds.
   ```
   
   </details>
   
   # Implementation
   
   Included a struct to hold spatial pruning related metrics isnide 
`GeoParquetFileOpener`, and update those metrics during spatial filtering.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to