[GitHub] [arrow] jonkeane commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

GitBox Wed, 28 Jul 2021 08:47:48 -0700


jonkeane commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r678290277




##########
File path: r/vignettes/dataset.Rmd
##########
@@ -8,10 +8,10 @@ vignette: >
 ---
 
 Apache Arrow lets you work efficiently with large, multi-file datasets.
-The `arrow` R package provides a `dplyr` interface to Arrow Datasets,

Review comment:
       Something we've discussed elsewhere (I remember this came up when 
working on a blog post, for example), is we should come up with a style for 
naming packages like this. I've come around to like `{pkg}` since it's clear + 
unambiguous at least in the R community now what that is. It looks like 
tidyverse doesn't use anything (or adds " package" at the end most of the time, 
if we consider that a marker), so maybe we should follow that lead (like you 
did here). Or we could stick with backticks. 
   
   I don't have super strong opinions, but would be happy to pick one and go 
with it

##########
File path: r/vignettes/dataset.Rmd
##########
@@ -77,39 +79,44 @@ feel free to grab only a year or two of data.
 
 If you don't have the taxi data downloaded, the vignette will still run and 
will
 yield previously cached output for reference. To be explicit about which 
version
-is running, let's check whether we're running with live data:
+is running, let's check whether you're running with live data:
 
 ```{r}
 dir.exists("nyc-taxi")
 ```
 
-## Getting started
+## Opening the dataset
 
-Because `dplyr` is not necessary for many Arrow workflows,
+Because dplyr is not necessary for many Arrow workflows,
 it is an optional (`Suggests`) dependency. So, to work with Datasets,
-we need to load both `arrow` and `dplyr`.
+you need to load both arrow and dplyr.
 
 ```{r}
 library(arrow, warn.conflicts = FALSE)
 library(dplyr, warn.conflicts = FALSE)
 ```
 
-The first step is to create our Dataset object, pointing at the directory of 
data.
+The first step is to create a Dataset object, pointing at the directory of 
data.
 
 ```{r, eval = file.exists("nyc-taxi")}
 ds <- open_dataset("nyc-taxi", partitioning = c("year", "month"))
 ```
 
-The default file format for `open_dataset()` is Parquet; if we had a directory
-of Arrow format files, we could include `format = "arrow"` in the call.
-Other supported formats include: `"feather"` (an alias for `"arrow"`, as 
Feather
-v2 is the Arrow file format), `"csv"`, `"tsv"` (for tab-delimited), and 
`"text"`
-for generic text-delimited files. For text files, you can pass any parsing
-options (`delim`, `quote`, etc.) to `open_dataset()` that you would otherwise
-pass to `read_csv_arrow()`.
+The file format for `open_dataset()` is controlled by the `format` parameter, 
+which has a default value of `"parquet"`.  If you had a directory
+of Arrow format files, you could instead specify `format = "arrow"` in the 
call.
+
+Other supported formats include: 
+
+* `"feather"` or `"ipc"` (aliases for `"arrow"`, as Feather v2 is the Arrow 
file format)
+* `"csv"` (comma-delimited files) and `"tsv"` (tab-delimited files)
+* `"text"` (generic text-delimited files - use the `delimiter` argument to 
specify which to use)
 
-The `partitioning` argument lets us specify how the file paths provide 
information
-about how the dataset is chunked into different files. Our files in this 
example
+For text files, you can pass any parsing options (`delim`, `quote`, etc.) to 
+`open_dataset()` that you would otherwise pass to `read_csv_arrow()`.

Review comment:
       I could be misremembered (or remembering the issue but not that it was 
resolved), but I thought there were _some_ options that weren't compatible with 
the dataset version of csv reading. We don't have to list them here, but if 
that is true + those are documented elsewhere (like the docs for 
`read_csv_arrow()` or `open_dataset()`), maybe a link to that elsewhere that 
contains those caveats would be nice.

##########
File path: r/vignettes/dataset.Rmd
##########
@@ -20,34 +20,36 @@ and what is on the immediate development roadmap.
 The [New York City taxi trip record 
data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
 is widely used in big data exercises and competitions.
 For demonstration purposes, we have hosted a Parquet-formatted version
-of about 10 years of the trip data in a public Amazon S3 bucket.
+of about ten years of the trip data in a public Amazon S3 bucket.
 
 The total file size is around 37 gigabytes, even in the efficient Parquet file
-format. That's bigger than memory on most people's computers, so we can't just
+format. That's bigger than memory on most people's computers, so you can't just
 read it all in and stack it into a single data frame.
 
-In Windows and macOS binary packages, S3 support is included.
-On Linux when installing from source, S3 support is not enabled by default,
+In Windows (for R > 3.6) and macOS binary packages, S3 support is included.
+On Linux, when installing from source, S3 support is not enabled by default,
 and it has additional system requirements.
 See `vignette("install", package = "arrow")` for details.
-To see if your `arrow` installation has S3 support, run
+To see if your arrow installation has S3 support, run:
 
 ```{r}
 arrow::arrow_with_s3()
 ```
 
-Even with S3 support enabled network, speed will be a bottleneck unless your
+Even with S3 support enabled, network speed will be a bottleneck unless your
 machine is located in the same AWS region as the data. So, for this vignette,
 we assume that the NYC taxi dataset has been downloaded locally in a "nyc-taxi"

Review comment:
       ```suggestion
   we assume that the NYC taxi dataset has been downloaded locally in an 
"nyc-taxi"
   ```

##########
File path: r/vignettes/dataset.Rmd
##########
@@ -118,12 +125,12 @@ have file paths like
 ...
 ```
 
-By providing a character vector to `partitioning`, we're saying that the first
-path segment gives the value for `year` and the second segment is `month`.
+By providing a character vector to `partitioning`, you're saying that the first
+path segment gives the value for `year`, and the second segment is `month`.
 Every row in `2009/01/data.parquet` has a value of 2009 for `year`
-and 1 for `month`, even though those columns may not actually be present in 
the file.
+and 1 for `month`, even though those columns may not be present in the file.
 
-Indeed, when we look at the dataset, we see that in addition to the columns 
present
+Indeed, when you look at the dataset, you can see that in addition to the 
columns present
 in every file, there are also columns `year` and `month`.

Review comment:
       This might be beating a dead horse, but maybe it would be good to repeat 
"even though they are not present in the files themselves" here too? 

##########
File path: r/vignettes/dataset.Rmd
##########
@@ -159,37 +166,37 @@ See $metadata for additional Schema metadata
 
 The other form of partitioning currently supported is 
[Hive](https://hive.apache.org/)-style,
 in which the partition variable names are included in the path segments.
-If we had saved our files in paths like
+If you had saved your files in paths like:
 
 ```
 year=2009/month=01/data.parquet
 year=2009/month=02/data.parquet
 ...
 ```
 
-we would not have had to provide the names in `partitioning`:
-we could have just called `ds <- open_dataset("nyc-taxi")` and the partitions
+you would not have had to provide the names in `partitioning`;
+you could have just called `ds <- open_dataset("nyc-taxi")` and the partitions
 would have been detected automatically.
 
 ## Querying the dataset
 
-Up to this point, we haven't loaded any data: we have walked directories to 
find
-files, we've parsed file paths to identify partitions, and we've read the
-headers of the Parquet files to inspect their schemas so that we can make sure
-they all line up.
+Up to this point, you haven't loaded any data.  You've walked directories to 
find
+files, you've parsed file paths to identify partitions, and you've read the
+headers of the Parquet files to inspect their schemas so that you can make sure
+they all are as expected.
 
-In the current release, `arrow` supports the dplyr verbs `mutate()`, 
+In the current release, arrow supports the dplyr verbs `mutate()`, 
 `transmute()`, `select()`, `rename()`, `relocate()`, `filter()`, and 
 `arrange()`. Aggregation is not yet supported, so before you call `summarise()`
 or other verbs with aggregate functions, use `collect()` to pull the selected
 subset of the data into an in-memory R data frame.
 
-If you attempt to call unsupported `dplyr` verbs or unimplemented functions in
-your query on an Arrow Dataset, the `arrow` package raises an error. However,
-for `dplyr` queries on `Table` objects (which are typically smaller in size) 
the
-package automatically calls `collect()` before processing that `dplyr` verb.
+Suppose you attempt to call unsupported dplyr verbs or unimplemented functions
+in your query on an Arrow Dataset. In that case, the arrow package raises an 
error. However,
+for dplyr queries on Arrow Table objects (typically smaller in size than 
Datasets), the
+package automatically calls `collect()` before processing that dplyr verb.
 
-Here's an example. Suppose I was curious about tipping behavior among the
+Here's an example. Suppose that you are curious about tipping behavior among 
the

Review comment:
       ```suggestion
   Here's an example: Suppose that you are curious about tipping behavior among 
the
   ```
   
   Minor, and stylistic, feel free to disregard

##########
File path: r/vignettes/dataset.Rmd
##########
@@ -313,27 +330,29 @@ instead of a file path, or simply concatenate them like 
`big_dataset <- c(ds1, d
 
 As you can see, querying a large dataset can be made quite fast by storage in 
an
 efficient binary columnar format like Parquet or Feather and partitioning 
based on
-columns commonly used for filtering. However, we don't always get our data 
delivered
-to us that way. Sometimes we start with one giant CSV. Our first step in 
analyzing data
+columns commonly used for filtering. However, data isn't always stored that 
way.
+Sometimes you might start with one giant CSV. The first step in analyzing data 
 is cleaning is up and reshaping it into a more usable form.
 
-The `write_dataset()` function allows you to take a Dataset or other tabular 
data object---an Arrow `Table` or `RecordBatch`, or an R `data.frame`---and 
write it to a different file format, partitioned into multiple files.
+The `write_dataset()` function allows you to take a Dataset or another tabular 
+data object - an Arrow Table or RecordBatch, or an R data frame - and write

Review comment:
       ```suggestion
   data object - an Arrow Table or RecordBatch, or an R data.frame - and write
   ```
   
   This is another we should probably standardize on. This isn't perfect, but 
as a metric: I see 47 results for `data.frame` in arrow/r/man but only 5 
results for `data frame`

##########
File path: r/vignettes/dataset.Rmd
##########
@@ -118,12 +125,12 @@ have file paths like
 ...
 ```
 
-By providing a character vector to `partitioning`, we're saying that the first
-path segment gives the value for `year` and the second segment is `month`.
+By providing a character vector to `partitioning`, you're saying that the first

Review comment:
       I wonder if it might be clearer / easier to wade through if we use 
`c("year", "month")` instead of "a character vector"? That way the values are 
right here, it will be obvious to many R users what that is, even if they don't 
have "character vector" in their vocabulary.  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jonkeane commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Reply via email to