bkietz commented on a change in pull request #8041:
URL: https://github.com/apache/arrow/pull/8041#discussion_r476555171
##########
File path: r/vignettes/dataset.Rmd
##########
@@ -281,3 +284,79 @@ this would mean you could point to an S3 bucked of Parquet
data and a directory
of CSVs on the local file system and query them together as a single dataset.
To create a multi-source dataset, provide a list of datasets to
`open_dataset()`
instead of a file path, or simply concatenate them like `big_dataset <- c(ds1,
ds2)`.
+
+## Writing datasets
+
+As you can see, querying a large dataset can be quite fast, especially when it
is stored in an efficient binary columnar format like Parquet or Feather and
when it is partitioned into separate files based on the value of a column
commonly used in filtering. However, we don't always get our data delivered to
us that way. Sometimes we start with one giant CSV. Our first step in analyzing
data is cleaning is up and reshaping it into a more usable form.
+
+The `write_dataset()` function allows you to take a Dataset or other tabular
data object---an Arrow `Table` or `RecordBatch`, or an R `data.frame`---and
write it to a different file format, partitioned into multiple files.
+
+Assume we have a version of the NYC Taxi data as CSV:
+
+```r
+ds <- open_dataset("nyc-taxi/csv/", format = "csv")
+```
+
+We can write it to a new location and translate the files to the Feather format
+by calling `write_dataset()` on it:
+
+```r
+write_dataset(ds, "nyc-taxi/feather", format = "feather")
+```
+
+Next, let's imagine that the "payment_type" column is something we often
filter on,
+so we want to partition the day by that variable. By doing so, when we filter
+the resulting dataset on `payment_type == 3`, we would only have to look at the
+files that we know contain only rows where payment_type is 3.
Review comment:
```suggestion
so we want to partition the data by that variable. By doing so we ensure
that a filter like
`payment_type == 3` will touch only a subset of files where payment_type is
always 3.
```
##########
File path: r/vignettes/dataset.Rmd
##########
@@ -281,3 +284,79 @@ this would mean you could point to an S3 bucked of Parquet
data and a directory
of CSVs on the local file system and query them together as a single dataset.
To create a multi-source dataset, provide a list of datasets to
`open_dataset()`
instead of a file path, or simply concatenate them like `big_dataset <- c(ds1,
ds2)`.
+
+## Writing datasets
+
+As you can see, querying a large dataset can be quite fast, especially when it
is stored in an efficient binary columnar format like Parquet or Feather and
when it is partitioned into separate files based on the value of a column
commonly used in filtering. However, we don't always get our data delivered to
us that way. Sometimes we start with one giant CSV. Our first step in analyzing
data is cleaning is up and reshaping it into a more usable form.
Review comment:
```suggestion
As you can see, querying a large dataset can be made quite fast by storage
in an
efficient binary columnar format like Parquet or Feather and partitioning
based on
columns commonly used for filtering. However, we don't always get our data
delivered
to us that way. Sometimes we start with one giant CSV. Our first step in
analyzing data
is cleaning is up and reshaping it into a more usable form.
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]