suremarc commented on PR #15683:
URL: https://github.com/apache/datafusion/pull/15683#issuecomment-2800524158

   > That is, users should ensure that the output ordering is correct.
   
   One of the users as of now is `ListingTable`, which I don't believe makes 
any such guarantees, so we would have to fix `ListingTable` to ensure the 
`output_ordering` is correct. But even if we did, I would worry about breaking 
third-party users of `FileScanConfig` that rely on DataFusion to perform this 
check. 
   
   > Based on current code and doc, my understanding is that the 
`output_ordering`, aka. `file_sort_order` refers to the file order, that is, if 
we specify the output_ordering, what we can ensure is that the data in a single 
file is ordered. 
https://datafusion.apache.org/user-guide/sql/ddl.html#cautions-when-using-the-with-order-clause
   > 
   > So why `expect no "output_ordering" clause in the physical_plan -> 
ParquetExec due to there being more files than partitions`? 🤔
   
   In the past we only added an `output_ordering` to the plan if each file 
group had at most 1 file. Otherwise concatenating files isn't guaranteed to 
preserve order. Later we relaxed this by looking at statistics to see if the 
files are nonoverlapping and ordered with respect to the sort keys. 
   
   I do think the current behavior is a potential pain point, because sometimes 
users know their files are ordered, but can't get DataFusion to avoid the sort. 
However, this has been the behavior in DataFusion for a long time (at least the 
2 years I have been using it), so if we were to change it, I would prefer if we 
do it in a way that makes this behavior change _very obvious_. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to