[GitHub] [arrow] bkietz commented on a change in pull request #10693: ARROW-13224: [Python][Doc] Documentation missing for pyarrow.dataset.write_dataset

GitBox Wed, 14 Jul 2021 10:45:26 -0700


bkietz commented on a change in pull request #10693:
URL: https://github.com/apache/arrow/pull/10693#discussion_r669825752




##########
File path: docs/source/python/dataset.rst
##########
@@ -456,20 +456,163 @@ is materialized as columns when reading the data and can 
be used for filtering:
     dataset.to_table().to_pandas()
     dataset.to_table(filter=ds.field('year') == 2019).to_pandas()
 
+Another benefit of manually scheduling the files is that the order of the files
+controls the order of the data.  When performing an ordered read (or a read to
+a table) then the rows returned will match the order of the files given.  This
+only applies when the dataset is constructed with a list of files.  There
+are no order guarantees given when the files are instead discovered by scanning

Review comment:
       We don't guarantee order for selectors because ARROW-8163 (asynchronous 
fragment discovery) might not guarantee order. Lexicographic sorting *could* be 
maintained for synchronous discovery from a selector, but in general we'd want 
to push a fragment into scan as soon as it's yielded by 
`FileSystem::GetFileInfoGenerator`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] bkietz commented on a change in pull request #10693: ARROW-13224: [Python][Doc] Documentation missing for pyarrow.dataset.write_dataset

Reply via email to