tiago-ssantos opened a new issue, #5779:
URL: https://github.com/apache/arrow-datafusion/issues/5779

   ### Describe the bug
   
   When inferring a 
[schema](https://github.com/apache/arrow-datafusion/blob/c8a3d589889dd1e67047de89db8b4ff56f90f04c/datafusion/core/src/datasource/listing/table.rs#L431),
 the 
[list_all_files](https://github.com/apache/arrow-datafusion/blob/c8a3d589889dd1e67047de89db8b4ff56f90f04c/datafusion/core/src/datasource/listing/url.rs#L143)
 uses an object store to list the files. No sorting is passed. 
   When the object store is a LocalFileSystem, there isn't an insurance of any 
file sorting (the return list of a macOs has a different sort of windows). This 
means that the inferred schema can be different for the same set of files.
   
   We contact the object store (https://github.com/apache/arrow-rs/issues/3975) 
that point it out that the solution should be implemented in the caller of the 
method, applying a sort of any type, to maintain consistency between file 
systems.
   
   ### To Reproduce
   
   Having two parquet files in the filesystem with the schema:
   - file1.parquet
   ```
   {
     "type" : "record",
     "name" : "root",
     "fields" : [ {
       "name" : "year",
       "type" : [ "null", "int" ],
       "default" : null
     }, {
       "name" : "description",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "code",
       "type" : [ "null", "long" ],
       "default" : null
     } ]
   }
   ```
   - file3.parquet
   ```
   {
     "type" : "record",
     "name" : "root",
     "fields" : [ {
       "name" : "description",
       "type" : [ "null", "string" ],
       "default" : null
     }, {
       "name" : "code",
       "type" : [ "null", "long" ],
       "default" : null
     }, {
       "name" : "year",
       "type" : [ "null", "int" ],
       "default" : null
     } ]
   }
   ```
   
   and executing:
   ```
   #[tokio::test]
   async fn infer_schema() {
       let path =  ListingTableUrl::parse("./files").unwrap();
       let ctx = SessionContext::new();
       let state = ctx.state();
       let options = ListingOptions::new(Arc::new(ParquetFormat::default()));
   
       let schema = options.infer_schema(&state, &path).await.unwrap();
   
       schema.fields.iter().for_each(|field|  println!("{0}", field.name()));
   }
   ```
   the result in macOs Ventura:
   ```
   description
   code
   year
   ```
   the first file pickup was the file3.parquet
   and using windows
   ```
   year
   code
   description
   ```
   the first file pickup was the file1.parquet
   
   
   ### Expected behavior
   
   The same schema independently the OS where the code is run. A sort should be 
forced or at least given the possibility of passing a sort function
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to