[GitHub] [arrow] bkietz commented on a change in pull request #8041: ARROW-8001: [R][Dataset] Bindings for dataset writing

GitBox Tue, 25 Aug 2020 08:53:59 -0700


bkietz commented on a change in pull request #8041:
URL: https://github.com/apache/arrow/pull/8041#discussion_r476555171




##########
File path: r/vignettes/dataset.Rmd
##########
@@ -281,3 +284,79 @@ this would mean you could point to an S3 bucked of Parquet 
data and a directory
 of CSVs on the local file system and query them together as a single dataset.
 To create a multi-source dataset, provide a list of datasets to 
`open_dataset()`
 instead of a file path, or simply concatenate them like `big_dataset <- c(ds1, 
ds2)`.
+
+## Writing datasets
+
+As you can see, querying a large dataset can be quite fast, especially when it 
is stored in an efficient binary columnar format like Parquet or Feather and 
when it is partitioned into separate files based on the value of a column 
commonly used in filtering. However, we don't always get our data delivered to 
us that way. Sometimes we start with one giant CSV. Our first step in analyzing 
data is cleaning is up and reshaping it into a more usable form.
+
+The `write_dataset()` function allows you to take a Dataset or other tabular 
data object---an Arrow `Table` or `RecordBatch`, or an R `data.frame`---and 
write it to a different file format, partitioned into multiple files.
+
+Assume we have a version of the NYC Taxi data as CSV:
+
+```r
+ds <- open_dataset("nyc-taxi/csv/", format = "csv")
+```
+
+We can write it to a new location and translate the files to the Feather format
+by calling `write_dataset()` on it:
+
+```r
+write_dataset(ds, "nyc-taxi/feather", format = "feather")
+```
+
+Next, let's imagine that the "payment_type" column is something we often 
filter on,
+so we want to partition the day by that variable. By doing so, when we filter
+the resulting dataset on `payment_type == 3`, we would only have to look at the
+files that we know contain only rows where payment_type is 3.

Review comment:
       ```suggestion
   so we want to partition the data by that variable. By doing so we ensure 
that a filter like
   `payment_type == 3` will touch only a subset of files where payment_type is 
always 3.
   ```

##########
File path: r/vignettes/dataset.Rmd
##########
@@ -281,3 +284,79 @@ this would mean you could point to an S3 bucked of Parquet 
data and a directory
 of CSVs on the local file system and query them together as a single dataset.
 To create a multi-source dataset, provide a list of datasets to 
`open_dataset()`
 instead of a file path, or simply concatenate them like `big_dataset <- c(ds1, 
ds2)`.
+
+## Writing datasets
+
+As you can see, querying a large dataset can be quite fast, especially when it 
is stored in an efficient binary columnar format like Parquet or Feather and 
when it is partitioned into separate files based on the value of a column 
commonly used in filtering. However, we don't always get our data delivered to 
us that way. Sometimes we start with one giant CSV. Our first step in analyzing 
data is cleaning is up and reshaping it into a more usable form.

Review comment:
       ```suggestion
   As you can see, querying a large dataset can be made quite fast by storage 
in an
   efficient binary columnar format like Parquet or Feather and partitioning 
based on
   columns commonly used for filtering. However, we don't always get our data 
delivered
   to us that way. Sometimes we start with one giant CSV. Our first step in 
analyzing data
   is cleaning is up and reshaping it into a more usable form.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] bkietz commented on a change in pull request #8041: ARROW-8001: [R][Dataset] Bindings for dataset writing

Reply via email to