alamb commented on issue #9964: URL: https://github.com/apache/arrow-datafusion/issues/9964#issuecomment-2041025752
> The recent change in https://github.com/apache/arrow-datafusion/pull/9912 uses 10 random files to infer the partition columns, this means that we may fail to catch corrupted/manually-changed partitions on table creation (shouldn't be a common case). This is because ObjectStore only provides list function to retrieve objects. I think `ObjectStore` only provides a LIST api as that is the functionality offered by S3, GCP, etc. > The recent change in https://github.com/apache/arrow-datafusion/pull/9912 uses 10 random files to infer the partition columns, this means that we may fail to catch corrupted/manually-changed partitions on table creation (shouldn't be a common case). In this particular case, I think there are different needs for different users (some might want to pay the cost for a full validation, but some might be happy with just a sanity check) One approach would be to add a config option to DataFusion that controls the maxumum number of paths to check when creating a listing table 🤔 > Curious to know how we feel about upstream changes for such non-critical changes in DF? In general I think upstream changes are great to propose if they help more than just DataFusion -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
