[GitHub] [arrow] nealrichardson commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

GitBox Wed, 04 Aug 2021 06:20:47 -0700


nealrichardson commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r682608693




##########
File path: r/vignettes/dataset.Rmd
##########
@@ -259,47 +270,58 @@ See $.data for the source Arrow object
 ")
 ```
 
-This returns instantly and shows the manipulations you've made, without
+This code returns an output instantly and shows the manipulations you've made, 
without
 loading data from the files. Because the evaluation of these queries is 
deferred,
 you can build up a query that selects down to a small subset without generating
 intermediate datasets that would potentially be large.
 
 Second, all work is pushed down to the individual data files,
 and depending on the file format, chunks of data within the files. As a result,
-we can select a subset of data from a much larger dataset by collecting the
-smaller slices from each file--we don't have to load the whole dataset in 
memory
-in order to slice from it.
+you can select a subset of data from a much larger dataset by collecting the
+smaller slices from each file - you don't have to load the whole dataset in 
+memory to slice from it.
 
-Third, because of partitioning, we can ignore some files entirely.
+Third, because of partitioning, you can ignore some files entirely.
 In this example, by filtering `year == 2015`, all files corresponding to other 
years
-are immediately excluded: we don't have to load them in order to find that no
+are immediately excluded: you don't have to load them in order to find that no
 rows match the filter. Relatedly, since Parquet files contain row groups with
-statistics on the data within, there may be entire chunks of data we can
+statistics on the data within, there may be entire chunks of data you can
 avoid scanning because they have no rows where `total_amount > 100`.
 
 ## More dataset options
 
 There are a few ways you can control the Dataset creation to adapt to special 
use cases.
-For one, if you are working with a single file or a set of files that are not
-all in the same directory, you can provide a file path or a vector of multiple
-file paths to `open_dataset()`. This is useful if, for example, you have a
-single CSV file that is too big to read into memory. You could pass the file
-path to `open_dataset()`, use `group_by()` to partition the Dataset into
-manageable chunks, then use `write_dataset()` to write each chunk to a separate
-Parquet file---all without needing to read the full CSV file into R.
-
-You can specify a `schema` argument to `open_dataset()` to declare the columns
-and their data types. This is useful if you have data files that have different
-storage schema (for example, a column could be `int32` in one and `int8` in 
another)
-and you want to ensure that the resulting Dataset has a specific type.
-To be clear, it's not necessary to specify a schema, even in this example of
-mixed integer types, because the Dataset constructor will reconcile 
differences like these.
-The schema specification just lets you declare what you want the result to be.
+
+### Work with files in a directory
+
+If you are working with a single file or a set of files that are not all in 
the 
+same directory, you can provide a file path or a vector of multiple file paths 
+to `open_dataset()`. This is useful if, for example, you have a single CSV 
file 
+that is too big to read into memory. You could pass the file path to 
+`open_dataset()`, use `group_by()` to partition the Dataset into manageable 
chunks, 
+then use `write_dataset()` to write each chunk to a separate Parquet file - 
all 

Review comment:
       ```suggestion
   then use `write_dataset()` to write each chunk to a separate Parquet 
file—all 
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] nealrichardson commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Reply via email to