BlakeOrth opened a new pull request, #18146:
URL: https://github.com/apache/datafusion/pull/18146

   
   
   ## Which issue does this PR close?
   
    - https://github.com/apache/datafusion/issues/17211
   
   It's not yet clear to me if this will fully close the above issue, or if 
it's just the first step. I think there may be more work to do, so I'm not 
going to have this auto-close the issue.
   
   ## Rationale for this change
   
   tl;dr of the issue: normalizing the access pattern(s) for objects for 
partitioned tables should not only reduce the number of requests to a backing 
object store, but will also allow any existing and/or future caching mechanisms 
to apply equally to both directory-partitioned and flat tables.
   
   List request on `main`:
   ```sql
   DataFusion CLI v50.2.0
   > \object_store_profiling summary
   ObjectStore Profile mode set to Summary
   > CREATE EXTERNAL TABLE overture_partitioned
   STORED AS PARQUET LOCATION 
's3://overturemaps-us-west-2/release/2025-09-24.0/';
   0 row(s) fetched.
   Elapsed 37.236 seconds.
   
   Object Store Profiling
   Instrumented Object Store: instrument_mode: Summary, inner: 
AmazonS3(overturemaps-us-west-2)
   Summaries:
   +-----------+----------+-----+-----+-----+-----+-------+
   | Operation | Metric   | min | max | avg | sum | count |
   +-----------+----------+-----+-----+-----+-----+-------+
   | List      | duration |     |     |     |     | 1     |
   | List      | size     |     |     |     |     | 1     |
   +-----------+----------+-----+-----+-----+-----+-------+
   Instrumented Object Store: instrument_mode: Summary, inner: 
AmazonS3(overturemaps-us-west-2)
   Summaries:
   
+-----------+----------+-----------+-----------+-------------+-------------+-------+
   | Operation | Metric   | min       | max       | avg         | sum         | 
count |
   
+-----------+----------+-----------+-----------+-------------+-------------+-------+
   | Get       | duration | 0.044411s | 0.338399s | 0.104535s   | 162.133179s | 
1551  |
   | Get       | size     | 8 B       | 1285059 B | 338457.56 B | 524947683 B | 
1551  |
   | List      | duration |           |           |             |             | 
3     |
   | List      | size     |           |           |             |             | 
3     |
   
+-----------+----------+-----------+-----------+-------------+-------------+-------+
   > select count(*) from overture_partitioned;
   +------------+
   | count(*)   |
   +------------+
   | 4219677254 |
   +------------+
   1 row(s) fetched.
   Elapsed 40.061 seconds.
   
   Object Store Profiling
   Instrumented Object Store: instrument_mode: Summary, inner: 
AmazonS3(overturemaps-us-west-2)
   Summaries:
   
+-----------+----------+-----------+-----------+-------------+-------------+-------+
   | Operation | Metric   | min       | max       | avg         | sum         | 
count |
   
+-----------+----------+-----------+-----------+-------------+-------------+-------+
   | Get       | duration | 0.042554s | 0.453125s | 0.103147s   | 159.980835s | 
1551  |
   | Get       | size     | 8 B       | 1285059 B | 338457.56 B | 524947683 B | 
1551  |
   | List      | duration | 0.043498s | 0.196298s | 0.092462s   | 2.034174s   | 
22    |
   | List      | size     |           |           |             |             | 
22    |
   
+-----------+----------+-----------+-----------+-------------+-------------+-------+
   > select count(*) from overture_partitioned;
   +------------+
   | count(*)   |
   +------------+
   | 4219677254 |
   +------------+
   1 row(s) fetched.
   Elapsed 0.924 seconds.
   
   Object Store Profiling
   Instrumented Object Store: instrument_mode: Summary, inner: 
AmazonS3(overturemaps-us-west-2)
   Summaries:
   
+-----------+----------+-----------+-----------+-----------+-----------+-------+
   | Operation | Metric   | min       | max       | avg       | sum       | 
count |
   
+-----------+----------+-----------+-----------+-----------+-----------+-------+
   | List      | duration | 0.040526s | 0.161407s | 0.092792s | 2.041431s | 22  
  |
   | List      | size     |           |           |           |           | 22  
  |
   
+-----------+----------+-----------+-----------+-----------+-----------+-------+
   >
   ```
   
   List requests for this PR:
   ```sql
   DataFusion CLI v50.2.0
   > \object_store_profiling summary
   ObjectStore Profile mode set to Summary
   > CREATE EXTERNAL TABLE overture_partitioned
   STORED AS PARQUET LOCATION 
's3://overturemaps-us-west-2/release/2025-09-24.0/';
   0 row(s) fetched.
   Elapsed 33.962 seconds.
   
   Object Store Profiling
   Instrumented Object Store: instrument_mode: Summary, inner: 
AmazonS3(overturemaps-us-west-2)
   Summaries:
   +-----------+----------+-----+-----+-----+-----+-------+
   | Operation | Metric   | min | max | avg | sum | count |
   +-----------+----------+-----+-----+-----+-----+-------+
   | List      | duration |     |     |     |     | 1     |
   | List      | size     |     |     |     |     | 1     |
   +-----------+----------+-----+-----+-----+-----+-------+
   Instrumented Object Store: instrument_mode: Summary, inner: 
AmazonS3(overturemaps-us-west-2)
   Summaries:
   
+-----------+----------+-----------+-----------+-------------+-------------+-------+
   | Operation | Metric   | min       | max       | avg         | sum         | 
count |
   
+-----------+----------+-----------+-----------+-------------+-------------+-------+
   | Get       | duration | 0.043832s | 0.342730s | 0.110505s   | 171.393509s | 
1551  |
   | Get       | size     | 8 B       | 1285059 B | 338457.56 B | 524947683 B | 
1551  |
   | List      | duration |           |           |             |             | 
3     |
   | List      | size     |           |           |             |             | 
3     |
   
+-----------+----------+-----------+-----------+-------------+-------------+-------+
   > select count(*) from overture_partitioned;
   +------------+
   | count(*)   |
   +------------+
   | 4219677254 |
   +------------+
   1 row(s) fetched.
   Elapsed 38.119 seconds.
   
   Object Store Profiling
   Instrumented Object Store: instrument_mode: Summary, inner: 
AmazonS3(overturemaps-us-west-2)
   Summaries:
   
+-----------+----------+-----------+-----------+-------------+-------------+-------+
   | Operation | Metric   | min       | max       | avg         | sum         | 
count |
   
+-----------+----------+-----------+-----------+-------------+-------------+-------+
   | Get       | duration | 0.043186s | 0.296394s | 0.099681s   | 154.605286s | 
1551  |
   | Get       | size     | 8 B       | 1285059 B | 338457.56 B | 524947683 B | 
1551  |
   | List      | duration |           |           |             |             | 
1     |
   | List      | size     |           |           |             |             | 
1     |
   
+-----------+----------+-----------+-----------+-------------+-------------+-------+
   > select count(*) from overture_partitioned;
   +------------+
   | count(*)   |
   +------------+
   | 4219677254 |
   +------------+
   1 row(s) fetched.
   Elapsed 0.815 seconds.
   
   Object Store Profiling
   Instrumented Object Store: instrument_mode: Summary, inner: 
AmazonS3(overturemaps-us-west-2)
   Summaries:
   +-----------+----------+-----+-----+-----+-----+-------+
   | Operation | Metric   | min | max | avg | sum | count |
   +-----------+----------+-----+-----+-----+-----+-------+
   | List      | duration |     |     |     |     | 1     |
   | List      | size     |     |     |     |     | 1     |
   +-----------+----------+-----+-----+-----+-----+-------+
   >
   ```
   
   List operations
   | Action | `main` | this PR |
   | ---- | ---- | ---- |
   | Create Table | 3 | 3 |
   | Query | 22 | 1 |
   
   ## What changes are included in this PR?
   
    - Refactored helpers related to listing, discovering, and pruning objects 
based on partitions to normalize the strategy between partitioned and flat 
tables
   
   ## Are these changes tested?
   
   Yes. The internal methods that have been modified are covered by existing 
tests.
   
   ## Are there any user-facing changes?
   
   No
   
   ## Additional Notes
   
   I want to surface that I believe there is a chance for a performance 
_regression_ for certain queries against certain tables. One performance 
related mechanism the existing code implements, but this code currently omits, 
is (potentially) reducing the number of partitions listed based on query 
filters. In order for the existing code to exercise this optimization the query 
filters must contain all the path elements of a subdirectory as column filters. 
E.g.
   
   Given a table with a directory-partitioning structure like: 
   ```
   path/to/table/a=1/b=2/c=3/data.parquet
   ```
   This query:
   ```sql
   select count(*) from table where a=1 and b=2;
   ```
   Will result in listing the following path:
   ```
   LIST: path/to/table/a=1/b=2/
   ```
   
   Whereas this query:
   ```sql
   select count(*) from table where b=2;
   ```
   Will result in listing the following path:
   ```
   LIST: path/to/table/
   ```
   
   I believe the real-world impact of this omission is likely minimal, at least 
when using high-latency storage such as S3 or other object stores, especially 
considering the existing implementation is likely to execute multiple 
sequential `LIST` operations due to its breadth-first search implementation. 
The most likely configuration for a table that would be negatively impacted 
would be a table that holds many thousands of underlying objects (most cloud 
stores return recursive list requests with page sizes of many hundreds to 
thousands of objects) with a relatively shallow partition structure. I may be 
able to find or build a dataset that fulfills these criteria to test this 
assertion if there's concern about it.
   
   I believe we could also augment the existing low-level `object_store` 
interactions to allow listing a prefix on a table, which would allow the same 
pruning of list operations with the code in this PR. The downside to this 
approach is it either complicates future caching efforts, or leads to cache 
fragmentation in a simpler cache implementation. I didn't include these changes 
in this PR to avoid the change set being too large.
   
   ##
   cc @alamb 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to