[GitHub] [arrow] nealrichardson commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

GitBox Thu, 05 Aug 2021 04:53:49 -0700


nealrichardson commented on a change in pull request #10765:
URL: https://github.com/apache/arrow/pull/10765#discussion_r682603777




##########
File path: r/STYLE.md
##########
@@ -0,0 +1,38 @@
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# Style
+
+This is a style guide to writing documentation for arrow.
+
+## Coding style
+
+Please use the [tidyverse coding style](https://style.tidyverse.org/).
+
+## Referring to external packages
+
+When referring to external packages, include a link to the package at the 
first mention, and subsequently refer to it in plain text, e.g.

Review comment:
       ```suggestion
   When referring to external packages in documentation, include a link to the 
package at the first mention, and subsequently refer to it in plain text, e.g.
   ```

##########
File path: r/STYLE.md
##########
@@ -0,0 +1,38 @@
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# Style
+
+This is a style guide to writing documentation for arrow.
+
+## Coding style
+
+Please use the [tidyverse coding style](https://style.tidyverse.org/).
+
+## Referring to external packages
+
+When referring to external packages, include a link to the package at the 
first mention, and subsequently refer to it in plain text, e.g.
+
+* "The arrow R package provides a [dplyr](https://dplyr.tidyverse.org/) 
interface to Arrow Datasets.  This vignette introduces Datasets and shows how 
to use dplyr to analyze them."
+
+## Data frames
+
+When referring to the concept, use the phrase "data frame", whereas when 
referring to an object of that class or when the class is important, write 
`data.frame`, e.g.
+
+* "You can call `write_dataset()` on tabular data objects such as Arrow Tables 
or RecordBatchs, or R data frames. If working with data frames you might want 
to use a `tibble` instead of a `data.frame` to take advantage of the default 
behaviour of partitioning data based on grouped variables."

Review comment:
       ```suggestion
   * "You can call `write_dataset()` on tabular data objects such as Arrow 
Tables or RecordBatches, or R data frames. If working with data frames you 
might want to use a `tibble` instead of a `data.frame` to take advantage of the 
default behaviour of partitioning data based on grouped variables."
   ```

##########
File path: r/vignettes/dataset.Rmd
##########
@@ -159,37 +171,37 @@ See $metadata for additional Schema metadata
 
 The other form of partitioning currently supported is 
[Hive](https://hive.apache.org/)-style,
 in which the partition variable names are included in the path segments.
-If we had saved our files in paths like
+If you had saved your files in paths like:
 
 ```
 year=2009/month=01/data.parquet
 year=2009/month=02/data.parquet
 ...
 ```
 
-we would not have had to provide the names in `partitioning`:
-we could have just called `ds <- open_dataset("nyc-taxi")` and the partitions
+you would not have had to provide the names in `partitioning`;
+you could have just called `ds <- open_dataset("nyc-taxi")` and the partitions
 would have been detected automatically.
 
 ## Querying the dataset
 
-Up to this point, we haven't loaded any data: we have walked directories to 
find
-files, we've parsed file paths to identify partitions, and we've read the
-headers of the Parquet files to inspect their schemas so that we can make sure
-they all line up.
+Up to this point, you haven't loaded any data.  You've walked directories to 
find

Review comment:
       ```suggestion
   Up to this point, you haven't loaded any data. You've walked directories to 
find
   ```

##########
File path: r/vignettes/dataset.Rmd
##########
@@ -159,37 +171,37 @@ See $metadata for additional Schema metadata
 
 The other form of partitioning currently supported is 
[Hive](https://hive.apache.org/)-style,
 in which the partition variable names are included in the path segments.
-If we had saved our files in paths like
+If you had saved your files in paths like:
 
 ```
 year=2009/month=01/data.parquet
 year=2009/month=02/data.parquet
 ...
 ```
 
-we would not have had to provide the names in `partitioning`:
-we could have just called `ds <- open_dataset("nyc-taxi")` and the partitions
+you would not have had to provide the names in `partitioning`;
+you could have just called `ds <- open_dataset("nyc-taxi")` and the partitions
 would have been detected automatically.
 
 ## Querying the dataset
 
-Up to this point, we haven't loaded any data: we have walked directories to 
find
-files, we've parsed file paths to identify partitions, and we've read the
-headers of the Parquet files to inspect their schemas so that we can make sure
-they all line up.
+Up to this point, you haven't loaded any data.  You've walked directories to 
find
+files, you've parsed file paths to identify partitions, and you've read the
+headers of the Parquet files to inspect their schemas so that you can make sure
+they all are as expected.
 
-In the current release, `arrow` supports the dplyr verbs `mutate()`, 
+In the current release, arrow supports the dplyr verbs `mutate()`, 
 `transmute()`, `select()`, `rename()`, `relocate()`, `filter()`, and 
 `arrange()`. Aggregation is not yet supported, so before you call `summarise()`
 or other verbs with aggregate functions, use `collect()` to pull the selected
 subset of the data into an in-memory R data frame.
 
-If you attempt to call unsupported `dplyr` verbs or unimplemented functions in
-your query on an Arrow Dataset, the `arrow` package raises an error. However,
-for `dplyr` queries on `Table` objects (which are typically smaller in size) 
the
-package automatically calls `collect()` before processing that `dplyr` verb.
+Suppose you attempt to call unsupported dplyr verbs or unimplemented functions
+in your query on an Arrow Dataset. In that case, the arrow package raises an 
error. However,
+for dplyr queries on Arrow Table objects (typically smaller in size than 
Datasets), the

Review comment:
       ```suggestion
   for dplyr queries on Arrow Table objects (which are already in memory), the
   ```

##########
File path: r/vignettes/dataset.Rmd
##########
@@ -228,12 +240,11 @@ cat("
 ")
 ```
 
-We just selected a subset out of a dataset with around 2 billion rows, computed
-a new column, and aggregated on it in under 2 seconds on my laptop. How does
+You've just selected a subset out of a dataset with around 2 billion rows, 
computed
+a new column, and aggregated it in under 2 seconds on most modern laptops. How 
does

Review comment:
       ```suggestion
   a new column, and aggregated it in under 2 seconds on a modern laptop. How 
does
   ```

##########
File path: r/vignettes/dataset.Rmd
##########
@@ -259,47 +270,58 @@ See $.data for the source Arrow object
 ")
 ```
 
-This returns instantly and shows the manipulations you've made, without
+This code returns an output instantly and shows the manipulations you've made, 
without
 loading data from the files. Because the evaluation of these queries is 
deferred,
 you can build up a query that selects down to a small subset without generating
 intermediate datasets that would potentially be large.
 
 Second, all work is pushed down to the individual data files,
 and depending on the file format, chunks of data within the files. As a result,
-we can select a subset of data from a much larger dataset by collecting the
-smaller slices from each file--we don't have to load the whole dataset in 
memory
-in order to slice from it.
+you can select a subset of data from a much larger dataset by collecting the
+smaller slices from each file - you don't have to load the whole dataset in 

Review comment:
       em-dash
   ```suggestion
   smaller slices from each file—you don't have to load the whole dataset in 
   ```

##########
File path: r/vignettes/dataset.Rmd
##########
@@ -259,47 +270,58 @@ See $.data for the source Arrow object
 ")
 ```
 
-This returns instantly and shows the manipulations you've made, without
+This code returns an output instantly and shows the manipulations you've made, 
without
 loading data from the files. Because the evaluation of these queries is 
deferred,
 you can build up a query that selects down to a small subset without generating
 intermediate datasets that would potentially be large.
 
 Second, all work is pushed down to the individual data files,
 and depending on the file format, chunks of data within the files. As a result,
-we can select a subset of data from a much larger dataset by collecting the
-smaller slices from each file--we don't have to load the whole dataset in 
memory
-in order to slice from it.
+you can select a subset of data from a much larger dataset by collecting the
+smaller slices from each file - you don't have to load the whole dataset in 
+memory to slice from it.
 
-Third, because of partitioning, we can ignore some files entirely.
+Third, because of partitioning, you can ignore some files entirely.
 In this example, by filtering `year == 2015`, all files corresponding to other 
years
-are immediately excluded: we don't have to load them in order to find that no
+are immediately excluded: you don't have to load them in order to find that no
 rows match the filter. Relatedly, since Parquet files contain row groups with
-statistics on the data within, there may be entire chunks of data we can
+statistics on the data within, there may be entire chunks of data you can
 avoid scanning because they have no rows where `total_amount > 100`.
 
 ## More dataset options
 
 There are a few ways you can control the Dataset creation to adapt to special 
use cases.
-For one, if you are working with a single file or a set of files that are not
-all in the same directory, you can provide a file path or a vector of multiple
-file paths to `open_dataset()`. This is useful if, for example, you have a
-single CSV file that is too big to read into memory. You could pass the file
-path to `open_dataset()`, use `group_by()` to partition the Dataset into
-manageable chunks, then use `write_dataset()` to write each chunk to a separate
-Parquet file---all without needing to read the full CSV file into R.
-
-You can specify a `schema` argument to `open_dataset()` to declare the columns
-and their data types. This is useful if you have data files that have different
-storage schema (for example, a column could be `int32` in one and `int8` in 
another)
-and you want to ensure that the resulting Dataset has a specific type.
-To be clear, it's not necessary to specify a schema, even in this example of
-mixed integer types, because the Dataset constructor will reconcile 
differences like these.
-The schema specification just lets you declare what you want the result to be.
+
+### Work with files in a directory
+
+If you are working with a single file or a set of files that are not all in 
the 
+same directory, you can provide a file path or a vector of multiple file paths 
+to `open_dataset()`. This is useful if, for example, you have a single CSV 
file 
+that is too big to read into memory. You could pass the file path to 
+`open_dataset()`, use `group_by()` to partition the Dataset into manageable 
chunks, 
+then use `write_dataset()` to write each chunk to a separate Parquet file - 
all 

Review comment:
       ```suggestion
   then use `write_dataset()` to write each chunk to a separate Parquet 
file—all 
   ```

##########
File path: r/vignettes/dataset.Rmd
##########
@@ -313,27 +330,29 @@ instead of a file path, or simply concatenate them like 
`big_dataset <- c(ds1, d
 
 As you can see, querying a large dataset can be made quite fast by storage in 
an
 efficient binary columnar format like Parquet or Feather and partitioning 
based on
-columns commonly used for filtering. However, we don't always get our data 
delivered
-to us that way. Sometimes we start with one giant CSV. Our first step in 
analyzing data
+columns commonly used for filtering. However, data isn't always stored that 
way.
+Sometimes you might start with one giant CSV. The first step in analyzing data 
 is cleaning is up and reshaping it into a more usable form.
 
-The `write_dataset()` function allows you to take a Dataset or other tabular 
data object---an Arrow `Table` or `RecordBatch`, or an R `data.frame`---and 
write it to a different file format, partitioned into multiple files.
+The `write_dataset()` function allows you to take a Dataset or another tabular 
+data object - an Arrow Table or RecordBatch, or an R data frame - and write

Review comment:
       more em-dashes
   
   ```suggestion
   data object—an Arrow Table or RecordBatch, or an R data frame—and write
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] nealrichardson commented on a change in pull request #10765: ARROW-13399: [R] Update dataset.Rmd vignette

Reply via email to