Re: [I] [EPIC] Improve the performance of ListingTable [arrow-datafusion]

via GitHub Sat, 06 Apr 2024 02:11:05 -0700


alamb commented on issue #9964:
URL: 
https://github.com/apache/arrow-datafusion/issues/9964#issuecomment-2041025752


   > The recent change in https://github.com/apache/arrow-datafusion/pull/9912 
uses 10 random files to infer the partition columns, this means that we may 
fail to catch corrupted/manually-changed partitions on table creation 
(shouldn't be a common case). This is because ObjectStore only provides list 
function to retrieve objects.
   
   I think `ObjectStore` only provides a LIST api as that is the functionality 
offered by S3, GCP, etc. 
   
   > The recent change in https://github.com/apache/arrow-datafusion/pull/9912 
uses 10 random files to infer the partition columns, this means that we may 
fail to catch corrupted/manually-changed partitions on table creation 
(shouldn't be a common case).
   
   In this particular case, I think there are different needs for different 
users (some might want to pay the cost for a full validation, but some might be 
happy with just a sanity check) One approach would be to add a config option to 
DataFusion that controls the maxumum number of paths to check when creating a 
listing table 🤔 
   
   > Curious to know how we feel about upstream changes for such non-critical 
changes in DF?
   
   In general I think upstream changes are great to propose if they help more 
than just DataFusion


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [EPIC] Improve the performance of ListingTable [arrow-datafusion]

Reply via email to