alamb commented on PR #18146:
URL: https://github.com/apache/datafusion/pull/18146#issuecomment-3491578094

   
   > @alamb This path forward sounds good to me. For the follow-on PRs I 
believe the changes needed to implement the smarter prefixed listing are 
actually relatively simple, and will have implications for how caching is 
ultimately handled. I would personally recommend we implement that small 
optimization prior to implementing caching. Does that sound like a reasonable 
implementation order?
   
   Yes for sure.
   
   Note you don't have to be the only one implementing this -- I think we could 
implement caching in parallel with smarter use of partitioning values for 
`LIST`ing
   
   > > This is likely to work well for tables with fewer than 1000 files (the 
maximum number of results that comes back in a single LIST request) However, 
when there are many more files this PR will likely take longer as it will list 
ALL files present with sequential LIST operations whereas main will issue 
concurrent LIST operations)
   > 
   > I agree with this assessment of the performance implications. I think 
there is an additional subtle performance improvement here, where this 
implementation allows better downstream concurrency in all cases. The previous 
implementation effectively removed any benefits of the files coming back as a 
stream because it had to complete at least initial list operation(s) fully 
prior to yielding any elements on the stream, whereas this implementation will 
(in most cases) begin yielding elements on the stream at the first request.
   
   That is an interesting point -- though this PR won't interleave IO and CPU 
the way the previous one did -- though realistically the amount of processing 
per response is pretty small
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to