BlakeOrth commented on PR #18146:
URL: https://github.com/apache/datafusion/pull/18146#issuecomment-3487838551

   > Thus, what I suggest as action item is:
   >
   >    1. We get the CI green (by moving the test into core_integration)
   >    2. Merge this PR (after we have branched for 
https://github.com/apache/datafusion/issues/17558 )
   >
   > As follow on PRs then
   >
   >    1. Implement caching of LIST results (tracked by 
https://github.com/apache/datafusion/issues/17211)
   >    2. Try and be smarter about the prefixes used in LISTs when we have 
equality predicates on partition columns
   
   @alamb This path forward sounds good to me. For the follow-on PRs I believe 
the changes needed to implement the smarter prefixed listing are actually 
relatively simple, and will have implications for how caching is ultimately 
handled. I would personally recommend we implement that small optimization 
prior to implementing caching. Does that sound like a reasonable implementation 
order?
   
   > This is likely to work well for tables with fewer than 1000 files (the 
maximum number of results that comes back in a single LIST request) However, 
when there are many more files this PR will likely take longer as it will list 
ALL files present with sequential LIST operations whereas main will issue 
concurrent LIST operations)
   
   I agree with this assessment of the performance implications. I think there 
is an additional subtle performance improvement here, where this implementation 
allows better downstream concurrency in all cases. The previous implementation 
effectively removed any benefits of the files coming back as a stream because 
it had to complete at least initial list operation(s) fully prior to yielding 
any elements on the stream, whereas this implementation will (in most cases) 
begin yielding elements on the stream at the first request.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to