[PR] Fix sequential metadata fetching in ListingTable causing high latency [datafusion]

via GitHub Thu, 27 Feb 2025 05:58:16 -0800


geoffreyclaude opened a new pull request, #14918:
URL: https://github.com/apache/datafusion/pull/14918


   ## Which issue does this PR close?
   
   - Closes #14916.
   
   ## Rationale for this change
   
   When scanning an exact list of remote Parquet files, the ListingTable was 
fetching file metadata (via head calls) sequentially. This was due to using 
`stream::iter(file_list).flatten()`, which processes each one-item stream in 
order. For remote blob stores, where each head call can take tens to hundreds 
of milliseconds, this sequential behavior significantly increased the time to 
create the physical plan.
   
   ## What changes are included in this PR?
   
   This commit replaces the sequential flattening with concurrent merging using 
`futures::stream::select_all(file_list)`. With this change, the `head` requests 
are executed in parallel (up to the configured `meta_fetch_concurrency` limit), 
reducing latency when creating the physical plan.
   
   ## Are these changes tested?
   
   Tests have been updated to ensure that metadata fetching occurs concurrently.
   
   ## Are there any user-facing changes?
   
   No user-facing changes besides reducing the latency in this particular 
situation.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] Fix sequential metadata fetching in ListingTable causing high latency [datafusion]

Reply via email to