[GitHub] [arrow] westonpace commented on a change in pull request #10693: ARROW-13224: [Python][Doc] Documentation missing for pyarrow.dataset.write_dataset

GitBox Tue, 13 Jul 2021 20:27:05 -0700


westonpace commented on a change in pull request #10693:
URL: https://github.com/apache/arrow/pull/10693#discussion_r669252727




##########
File path: docs/source/python/dataset.rst
##########
@@ -456,20 +456,163 @@ is materialized as columns when reading the data and can 
be used for filtering:
     dataset.to_table().to_pandas()
     dataset.to_table(filter=ds.field('year') == 2019).to_pandas()
 
+Another benefit of manually scheduling the files is that the order of the files
+controls the order of the data.  When performing an ordered read (or a read to
+a table) then the rows returned will match the order of the files given.  This
+only applies when the dataset is constructed with a list of files.  There
+are no order guarantees given when the files are instead discovered by scanning
+a directory.
 
-Manual scheduling
------------------
+Iterative (out of core or streaming) reads
+------------------------------------------
 
-..
-    Possible content:
-    - fragments (get_fragments)
-    - scan / scan tasks / iterators of record batches
+The previous examples have demonstrated how to read the data into a table.  
This is
+useful if the dataset is small or there is only a small amount of data that 
needs to
+be read.  The dataset API contains additional methods to read and process 
large amounts
+of data in a streaming fashion.
 
-The :func:`~Dataset.to_table` method loads all selected data into memory
-at once resulting in a pyarrow Table. Alternatively, a dataset can also be
-scanned one RecordBatch at a time in an iterative manner using the
-:func:`~Dataset.scan` method::
+The easiest way to do this is to use the method :meth:`Dataset.to_batches`.  
This
+method returns an iterator of record batches.  For example, we can use this 
method to
+calculate the average of a column without loading the entire column into 
memory:
 
-    for scan_task in dataset.scan(columns=[...], filter=...):
-        for record_batch in scan_task.execute():
-            # process the record batch
+.. ipython:: python

Review comment:
       All the scripts I added execute pretty rapidly as they are dealing with 
tables with less than 10 rows.  I'm not sure they add significantly to the 
build times.
   
   For a test I tried converting all ipython to code-block and saw no noticable 
difference in build times.  I'd prefer ipython just for the testing sake but 
I'm happy to go with whatever is decided in ARROW-13159.  Since it should be 
pretty easy to change after the fact (just a find-replace from ipython to 
code-block) I'd rather address it after ARROW-13159 is resolved if that is ok.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on a change in pull request #10693: ARROW-13224: [Python][Doc] Documentation missing for pyarrow.dataset.write_dataset

Reply via email to