suremarc commented on PR #15683: URL: https://github.com/apache/datafusion/pull/15683#issuecomment-2800524158
> That is, users should ensure that the output ordering is correct. One of the users as of now is `ListingTable`, which I don't believe makes any such guarantees, so we would have to fix `ListingTable` to ensure the `output_ordering` is correct. But even if we did, I would worry about breaking third-party users of `FileScanConfig` that rely on DataFusion to perform this check. > Based on current code and doc, my understanding is that the `output_ordering`, aka. `file_sort_order` refers to the file order, that is, if we specify the output_ordering, what we can ensure is that the data in a single file is ordered. https://datafusion.apache.org/user-guide/sql/ddl.html#cautions-when-using-the-with-order-clause > > So why `expect no "output_ordering" clause in the physical_plan -> ParquetExec due to there being more files than partitions`? 🤔 In the past we only added an `output_ordering` to the plan if each file group had at most 1 file. Otherwise concatenating files isn't guaranteed to preserve order. Later we relaxed this by looking at statistics to see if the files are nonoverlapping and ordered with respect to the sort keys. I do think the current behavior is a potential pain point, because sometimes users know their files are ordered, but can't get DataFusion to avoid the sort. However, this has been the behavior in DataFusion for a long time (at least the 2 years I have been using it), so if we were to change it, I would prefer if we do it in a way that makes this behavior change _very obvious_. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org