nealrichardson commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r682608693
##########
File path: r/vignettes/dataset.Rmd
##########
@@ -259,47 +270,58 @@ See $.data for the source Arrow object
")
```
-This returns instantly and shows the manipulations you've made, without
+This code returns an output instantly and shows the manipulations you've made,
without
loading data from the files. Because the evaluation of these queries is
deferred,
you can build up a query that selects down to a small subset without generating
intermediate datasets that would potentially be large.
Second, all work is pushed down to the individual data files,
and depending on the file format, chunks of data within the files. As a result,
-we can select a subset of data from a much larger dataset by collecting the
-smaller slices from each file--we don't have to load the whole dataset in
memory
-in order to slice from it.
+you can select a subset of data from a much larger dataset by collecting the
+smaller slices from each file - you don't have to load the whole dataset in
+memory to slice from it.
-Third, because of partitioning, we can ignore some files entirely.
+Third, because of partitioning, you can ignore some files entirely.
In this example, by filtering `year == 2015`, all files corresponding to other
years
-are immediately excluded: we don't have to load them in order to find that no
+are immediately excluded: you don't have to load them in order to find that no
rows match the filter. Relatedly, since Parquet files contain row groups with
-statistics on the data within, there may be entire chunks of data we can
+statistics on the data within, there may be entire chunks of data you can
avoid scanning because they have no rows where `total_amount > 100`.
## More dataset options
There are a few ways you can control the Dataset creation to adapt to special
use cases.
-For one, if you are working with a single file or a set of files that are not
-all in the same directory, you can provide a file path or a vector of multiple
-file paths to `open_dataset()`. This is useful if, for example, you have a
-single CSV file that is too big to read into memory. You could pass the file
-path to `open_dataset()`, use `group_by()` to partition the Dataset into
-manageable chunks, then use `write_dataset()` to write each chunk to a separate
-Parquet file---all without needing to read the full CSV file into R.
-
-You can specify a `schema` argument to `open_dataset()` to declare the columns
-and their data types. This is useful if you have data files that have different
-storage schema (for example, a column could be `int32` in one and `int8` in
another)
-and you want to ensure that the resulting Dataset has a specific type.
-To be clear, it's not necessary to specify a schema, even in this example of
-mixed integer types, because the Dataset constructor will reconcile
differences like these.
-The schema specification just lets you declare what you want the result to be.
+
+### Work with files in a directory
+
+If you are working with a single file or a set of files that are not all in
the
+same directory, you can provide a file path or a vector of multiple file paths
+to `open_dataset()`. This is useful if, for example, you have a single CSV
file
+that is too big to read into memory. You could pass the file path to
+`open_dataset()`, use `group_by()` to partition the Dataset into manageable
chunks,
+then use `write_dataset()` to write each chunk to a separate Parquet file -
all
Review comment:
```suggestion
then use `write_dataset()` to write each chunk to a separate Parquet
fileāall
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]