BlakeOrth commented on PR #18146:
URL: https://github.com/apache/datafusion/pull/18146#issuecomment-3423770018

   > Thank you @BlakeOrth
   > 
   > > tl;dr of the issue: normalizing the access pattern(s) for objects for 
partitioned tables should not only reduce the number of requests to a backing 
object store, but will also allow any existing and/or future caching mechanisms 
to apply equally to both directory-partitioned and flat tables.
   > 
   > I don't fully understand this. Is the idea that the current code will do 
something like
   > 
   > ```
   > LIST path/to/table/a=1/b=2/c=3/
   > ```
   > 
   > But if we aren' more clever the basic cache will just have a list like
   > 
   > ```
   > LIST path/to/table/
   > ```
   > 
   > (and thus not be able to satisfy the request)?
   > 
   > It seems to me that we may have to implement prefix listing on the files 
cache as well, to avoid causing regressions in existing functionality.
   
   @alamb So in the current code
   ```
   LIST path/to/table/a=1/b=2/c=3/
   ```
   This table cannot take advantage of any list file caching (at least as 
implemented) because the cache mechanisms don't exist for tables with partition 
columns. However, the current code _can_ reduce the number of `LIST` operations 
for this table given appropriate query filters.
   
   The code in this PR would enable a simple implementation of the list files 
cache to store a key for _all_ objects under 
   ```
   LIST path/to/table/
   ```
   and continue to appropriately filter cached results based on query filters. 
However, it would (again, as written) remove the ability to list specific 
prefixes based on query filters.
   
   > It seems to me that we may have to implement prefix listing on the files 
cache as well, to avoid causing regressions in existing functionality.
   
   If we implemented the ability to list a specific prefix in a table I think 
any cache would also need to be "prefix aware", otherwise we've more or less 
just made a lateral move where caching may apply to flat tables but not 
directory partitioned tables.
   
   Does that help clarify this a bit? I hope I understood your question 
correctly. If we need more clarification on something I can probably put 
together and annotate some queries against a hypothetical table to help make 
this all a bit more clear.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to