tiago-ssantos opened a new issue, #5779: URL: https://github.com/apache/arrow-datafusion/issues/5779
### Describe the bug When inferring a [schema](https://github.com/apache/arrow-datafusion/blob/c8a3d589889dd1e67047de89db8b4ff56f90f04c/datafusion/core/src/datasource/listing/table.rs#L431), the [list_all_files](https://github.com/apache/arrow-datafusion/blob/c8a3d589889dd1e67047de89db8b4ff56f90f04c/datafusion/core/src/datasource/listing/url.rs#L143) uses an object store to list the files. No sorting is passed. When the object store is a LocalFileSystem, there isn't an insurance of any file sorting (the return list of a macOs has a different sort of windows). This means that the inferred schema can be different for the same set of files. We contact the object store (https://github.com/apache/arrow-rs/issues/3975) that point it out that the solution should be implemented in the caller of the method, applying a sort of any type, to maintain consistency between file systems. ### To Reproduce Having two parquet files in the filesystem with the schema: - file1.parquet ``` { "type" : "record", "name" : "root", "fields" : [ { "name" : "year", "type" : [ "null", "int" ], "default" : null }, { "name" : "description", "type" : [ "null", "string" ], "default" : null }, { "name" : "code", "type" : [ "null", "long" ], "default" : null } ] } ``` - file3.parquet ``` { "type" : "record", "name" : "root", "fields" : [ { "name" : "description", "type" : [ "null", "string" ], "default" : null }, { "name" : "code", "type" : [ "null", "long" ], "default" : null }, { "name" : "year", "type" : [ "null", "int" ], "default" : null } ] } ``` and executing: ``` #[tokio::test] async fn infer_schema() { let path = ListingTableUrl::parse("./files").unwrap(); let ctx = SessionContext::new(); let state = ctx.state(); let options = ListingOptions::new(Arc::new(ParquetFormat::default())); let schema = options.infer_schema(&state, &path).await.unwrap(); schema.fields.iter().for_each(|field| println!("{0}", field.name())); } ``` the result in macOs Ventura: ``` description code year ``` the first file pickup was the file3.parquet and using windows ``` year code description ``` the first file pickup was the file1.parquet ### Expected behavior The same schema independently the OS where the code is run. A sort should be forced or at least given the possibility of passing a sort function ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
