thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1022660802


##########
r/vignettes/dataset.Rmd:
##########
@@ -1,157 +1,95 @@
 ---
-title: "Working with Arrow Datasets and dplyr"
+title: "Working with multi-file data sets"
+description: >
+  Learn how to use Datasets to read, write, and analyze 
+  multi-file larger-than-memory data
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Working with Arrow Datasets and dplyr}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-Apache Arrow lets you work efficiently with large, multi-file datasets.
-The arrow R package provides a [dplyr](https://dplyr.tidyverse.org/) interface 
to Arrow Datasets,
-and other tools for interactive exploration of Arrow data.
-
-This vignette introduces Datasets and shows how to use dplyr to analyze them.
-
-## Example: NYC taxi data
-
-The [New York City taxi trip record 
data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
-is widely used in big data exercises and competitions.
-For demonstration purposes, we have hosted a Parquet-formatted version
-of about ten years of the trip data in a public Amazon S3 bucket.
-
-The total file size is around 37 gigabytes, even in the efficient Parquet file
-format. That's bigger than memory on most people's computers, so you can't just
-read it all in and stack it into a single data frame.
-
-In Windows and macOS binary packages, S3 support is included.
-On Linux, when installing from source, S3 support is not enabled by default,
-and it has additional system requirements.
-See `vignette("install", package = "arrow")` for details.
-To see if your arrow installation has S3 support, run:
+Apache Arrow lets you work efficiently with multi-file data sets even when 
that data set is too large to be loaded into memory. With the help of Arrow 
Dataset objects you can analyze this kind of data using familiar  
[`dplyr`](https://dplyr.tidyverse.org/) syntax. This article introduces 
Datasets and shows you how to analyze them with `dplyr` and `arrow`: we'll 
start by ensuring both packages are loaded
 
 ```{r}
-arrow::arrow_with_s3()
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
 ```
 
-Even with S3 support enabled, network speed will be a bottleneck unless your
-machine is located in the same AWS region as the data. So, for this vignette,
-we assume that the NYC taxi dataset has been downloaded locally in an 
"nyc-taxi"
-directory.
+## Example: NYC taxi data
 
-### Retrieving data from a public Amazon S3 bucket
+The primary motivation for multi-file Datasets is to allow users to analyze 
extremely large datasets. As an example, consider the [New York City taxi trip 
record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) 
that is widely used in big data exercises and competitions. To demonstrate the 
capabilities of Apache Arrow we host a Parquet-formatted version this data in a 
public Amazon S3 bucket: in its full form, our version of the data set is one 
very large table with about 1.7 billion rows and 24 columns, where each row 
corresponds to a single taxi ride sometime between 2009 and 2022. A [data 
dictionary](https://arrow-user2022.netlify.app/packages-and-data.html#data) for 
this version of the NYC taxi data is also available. 
 
-If your arrow build has S3 support, you can sync the data locally with:
+This data set is comprised of 158 distinct Parquet files, each corresponding 
to a month of data. A single file is typically around 400-500MB in size, and 
the full data set is about 70GB in size. It is not a small data set -- it is 
slow to download and does not fit in memory on a typical machine 🙂  -- so we 
also host a "tiny" version of the NYC taxi data that is formatted in exactly 
the same way but includes only one out of every thousand entries in the 
original data set (i.e., individual files are <1MB in size, and the "tiny" data 
set is only 70MB) 

Review Comment:
   This change makes sense - same dataset but smaller version. 



##########
r/vignettes/dataset.Rmd:
##########
@@ -1,157 +1,95 @@
 ---
-title: "Working with Arrow Datasets and dplyr"
+title: "Working with multi-file data sets"
+description: >
+  Learn how to use Datasets to read, write, and analyze 
+  multi-file larger-than-memory data
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Working with Arrow Datasets and dplyr}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-Apache Arrow lets you work efficiently with large, multi-file datasets.
-The arrow R package provides a [dplyr](https://dplyr.tidyverse.org/) interface 
to Arrow Datasets,
-and other tools for interactive exploration of Arrow data.
-
-This vignette introduces Datasets and shows how to use dplyr to analyze them.
-
-## Example: NYC taxi data
-
-The [New York City taxi trip record 
data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
-is widely used in big data exercises and competitions.
-For demonstration purposes, we have hosted a Parquet-formatted version
-of about ten years of the trip data in a public Amazon S3 bucket.
-
-The total file size is around 37 gigabytes, even in the efficient Parquet file
-format. That's bigger than memory on most people's computers, so you can't just
-read it all in and stack it into a single data frame.
-
-In Windows and macOS binary packages, S3 support is included.
-On Linux, when installing from source, S3 support is not enabled by default,
-and it has additional system requirements.
-See `vignette("install", package = "arrow")` for details.
-To see if your arrow installation has S3 support, run:
+Apache Arrow lets you work efficiently with multi-file data sets even when 
that data set is too large to be loaded into memory. With the help of Arrow 
Dataset objects you can analyze this kind of data using familiar  
[`dplyr`](https://dplyr.tidyverse.org/) syntax. This article introduces 
Datasets and shows you how to analyze them with `dplyr` and `arrow`: we'll 
start by ensuring both packages are loaded
 
 ```{r}
-arrow::arrow_with_s3()
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
 ```
 
-Even with S3 support enabled, network speed will be a bottleneck unless your
-machine is located in the same AWS region as the data. So, for this vignette,
-we assume that the NYC taxi dataset has been downloaded locally in an 
"nyc-taxi"
-directory.
+## Example: NYC taxi data
 
-### Retrieving data from a public Amazon S3 bucket
+The primary motivation for multi-file Datasets is to allow users to analyze 
extremely large datasets. As an example, consider the [New York City taxi trip 
record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) 
that is widely used in big data exercises and competitions. To demonstrate the 
capabilities of Apache Arrow we host a Parquet-formatted version this data in a 
public Amazon S3 bucket: in its full form, our version of the data set is one 
very large table with about 1.7 billion rows and 24 columns, where each row 
corresponds to a single taxi ride sometime between 2009 and 2022. A [data 
dictionary](https://arrow-user2022.netlify.app/packages-and-data.html#data) for 
this version of the NYC taxi data is also available. 
 
-If your arrow build has S3 support, you can sync the data locally with:
+This data set is comprised of 158 distinct Parquet files, each corresponding 
to a month of data. A single file is typically around 400-500MB in size, and 
the full data set is about 70GB in size. It is not a small data set -- it is 
slow to download and does not fit in memory on a typical machine 🙂  -- so we 
also host a "tiny" version of the NYC taxi data that is formatted in exactly 
the same way but includes only one out of every thousand entries in the 
original data set (i.e., individual files are <1MB in size, and the "tiny" data 
set is only 70MB) 
 
-```{r, eval = FALSE}
-arrow::copy_files("s3://voltrondata-labs-datasets/nyc-taxi", "nyc-taxi")
-# Alternatively, with GCS:
-arrow::copy_files("gs://voltrondata-labs-datasets/nyc-taxi", "nyc-taxi")
-```
+If you have Amazon S3 and/or Google Cloud Storage support enabled in `arrow` 
(true for most users; see links at the end of this article if you need to 
troubleshoot this), you can connect to the "tiny taxi data" with either of the 
following commands:

Review Comment:
   Why either of these commands, what's the difference between the two?



##########
r/vignettes/dataset.Rmd:
##########
@@ -186,34 +124,9 @@ month: int32
 ")
 ```
 
-## Querying the dataset
-
-Up to this point, you haven't loaded any data. You've walked directories to 
find
-files, you've parsed file paths to identify partitions, and you've read the
-headers of the Parquet files to inspect their schemas so that you can make sure
-they all are as expected.
+## Querying Datasets
 
-In the current release, arrow supports the dplyr verbs:
-
- * `mutate()` and `transmute()`,
- * `select()`, `rename()`, and `relocate()`,
- * `filter()`,
- * `arrange()`,
- * `union()` and `union_all()`,
- * `left_join()`, `right_join()`, `full_join()`, `inner_join()`, and 
`anti_join()`,
- * `group_by()` and `summarise()`.
-
-At any point in a chain, you can use `collect()` to pull the selected subset of
-the data into an in-memory R data frame. 
-
-Suppose you attempt to call unsupported dplyr verbs or unimplemented functions
-in your query on an Arrow Dataset. In that case, the arrow package raises an 
error. However,
-for dplyr queries on Arrow Table objects (which are already in memory), the
-package automatically calls `collect()` before processing that dplyr verb.
-
-Here's an example: suppose that you are curious about tipping behavior among 
the
-longest taxi rides. Let's find the median tip percentage for rides with
-fares greater than $100 in 2015, broken down by the number of passengers:
+Now that we have a Dataset object that refers to out data, we can construct 
`dplyr`-style queries. This is possible because `arrow` supplies a back end 
that allows users to manipulate tabular Arrow data using `dplyr` verbs. Here's 
an example: suppose you are curious about tipping behavior in the longest taxi 
rides. Let's find the median tip percentage for rides with fares greater than 
$100 in 2015, broken down by the number of passengers:

Review Comment:
   ```suggestion
   Now that we have a Dataset object that refers to our data, we can construct 
`dplyr`-style queries. This is possible because `arrow` supplies a back end 
that allows users to manipulate tabular Arrow data using `dplyr` verbs. Here's 
an example: suppose you are curious about tipping behavior in the longest taxi 
rides. Let's find the median tip percentage for rides with fares greater than 
$100 in 2015, broken down by the number of passengers:
   ```



##########
r/vignettes/dataset.Rmd:
##########
@@ -548,4 +451,11 @@ Most file formats have magic numbers which are written at 
the end.  This means a
 partial file write can safely be detected and discarded.  The CSV file format 
does
 not have any such concept and a partially written CSV file may be detected as 
valid.
 
+## Further reading
+
+- To learn about cloud storage, see the [cloud storage article](./fs.html).
+- To learn about `dplyr` with `arrow`, see the [data wrangling 
article](./data_wrangling.html).
+- To learn about reading and writing data, see the [read/write 
article](./read_write.html).
+- To manually enable cloud support on Linux, see the article on [installation 
on Linux](./install.html).
+- To learn about schemas and metadata, see the [metadata 
article](./metadata.html).

Review Comment:
   Do you reckon it might be worth linking to the cookbook chapter on datasets 
here too?



##########
r/vignettes/dataset.Rmd:
##########
@@ -1,157 +1,95 @@
 ---
-title: "Working with Arrow Datasets and dplyr"
+title: "Working with multi-file data sets"
+description: >
+  Learn how to use Datasets to read, write, and analyze 
+  multi-file larger-than-memory data
 output: rmarkdown::html_vignette
-vignette: >
-  %\VignetteIndexEntry{Working with Arrow Datasets and dplyr}
-  %\VignetteEngine{knitr::rmarkdown}
-  %\VignetteEncoding{UTF-8}
 ---
 
-Apache Arrow lets you work efficiently with large, multi-file datasets.
-The arrow R package provides a [dplyr](https://dplyr.tidyverse.org/) interface 
to Arrow Datasets,
-and other tools for interactive exploration of Arrow data.
-
-This vignette introduces Datasets and shows how to use dplyr to analyze them.
-
-## Example: NYC taxi data
-
-The [New York City taxi trip record 
data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
-is widely used in big data exercises and competitions.
-For demonstration purposes, we have hosted a Parquet-formatted version
-of about ten years of the trip data in a public Amazon S3 bucket.
-
-The total file size is around 37 gigabytes, even in the efficient Parquet file
-format. That's bigger than memory on most people's computers, so you can't just
-read it all in and stack it into a single data frame.
-
-In Windows and macOS binary packages, S3 support is included.
-On Linux, when installing from source, S3 support is not enabled by default,
-and it has additional system requirements.
-See `vignette("install", package = "arrow")` for details.
-To see if your arrow installation has S3 support, run:
+Apache Arrow lets you work efficiently with multi-file data sets even when 
that data set is too large to be loaded into memory. With the help of Arrow 
Dataset objects you can analyze this kind of data using familiar  
[`dplyr`](https://dplyr.tidyverse.org/) syntax. This article introduces 
Datasets and shows you how to analyze them with `dplyr` and `arrow`: we'll 
start by ensuring both packages are loaded
 
 ```{r}
-arrow::arrow_with_s3()
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
 ```
 
-Even with S3 support enabled, network speed will be a bottleneck unless your
-machine is located in the same AWS region as the data. So, for this vignette,
-we assume that the NYC taxi dataset has been downloaded locally in an 
"nyc-taxi"
-directory.
+## Example: NYC taxi data
 
-### Retrieving data from a public Amazon S3 bucket
+The primary motivation for multi-file Datasets is to allow users to analyze 
extremely large datasets. As an example, consider the [New York City taxi trip 
record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) 
that is widely used in big data exercises and competitions. To demonstrate the 
capabilities of Apache Arrow we host a Parquet-formatted version this data in a 
public Amazon S3 bucket: in its full form, our version of the data set is one 
very large table with about 1.7 billion rows and 24 columns, where each row 
corresponds to a single taxi ride sometime between 2009 and 2022. A [data 
dictionary](https://arrow-user2022.netlify.app/packages-and-data.html#data) for 
this version of the NYC taxi data is also available. 
 
-If your arrow build has S3 support, you can sync the data locally with:
+This data set is comprised of 158 distinct Parquet files, each corresponding 
to a month of data. A single file is typically around 400-500MB in size, and 
the full data set is about 70GB in size. It is not a small data set -- it is 
slow to download and does not fit in memory on a typical machine 🙂  -- so we 
also host a "tiny" version of the NYC taxi data that is formatted in exactly 
the same way but includes only one out of every thousand entries in the 
original data set (i.e., individual files are <1MB in size, and the "tiny" data 
set is only 70MB) 
 
-```{r, eval = FALSE}
-arrow::copy_files("s3://voltrondata-labs-datasets/nyc-taxi", "nyc-taxi")
-# Alternatively, with GCS:
-arrow::copy_files("gs://voltrondata-labs-datasets/nyc-taxi", "nyc-taxi")
-```
+If you have Amazon S3 and/or Google Cloud Storage support enabled in `arrow` 
(true for most users; see links at the end of this article if you need to 
troubleshoot this), you can connect to the "tiny taxi data" with either of the 
following commands:
 
-If your arrow build doesn't have S3 support, you can download the files
-with the additional code shown below.  Since these are large files, 
-you may need to increase R's download timeout from the default of 60 seconds, 
e.g.
-`options(timeout = 300)`.
-
-```{r, eval = FALSE}
-bucket <- "https://voltrondata-labs-datasets.s3.us-east-2.amazonaws.com";
-for (year in 2009:2022) {
-  if (year == 2022) {
-    # We only have through Feb 2022 there
-    months <- 1:2
-  } else {
-    months <- 1:12
-  }
-  for (month in months) {
-    dataset_path <- file.path("nyc-taxi", paste0("year=", year), 
paste0("month=", month))
-    dir.create(dataset_path, recursive = TRUE)
-    try(download.file(
-      paste(bucket, dataset_path, "part-0.parquet", sep = "/"),
-      file.path(dataset_path, "part-0.parquet"),
-      mode = "wb"
-    ), silent = TRUE)
-  }
-}
+```r
+bucket <- s3_bucket("voltrondata-labs-datasets/nyc-taxi-tiny")
+bucket <- gs_bucket("voltrondata-labs-datasets/nyc-taxi-tiny", anonymous = 
TRUE)
 ```
 
-Note that these download steps in the vignette are not executed: if you want 
to run
-with live data, you'll have to do it yourself separately.
-Given the size, if you're running this locally and don't have a fast 
connection,
-feel free to grab only a year or two of data.
+If you want to use the full data set, replace `nyc-taxi-tiny` with `nyc-taxi` 
in the code above. Apart from size -- and with it the cost in time, bandwidth 
usage, and CPU cycles -- there is no difference in the two versions of the 
data: you can test your code using the tiny taxi data and then check how it 
scales using the full data set.
 
-If you don't have the taxi data downloaded, the vignette will still run and 
will
-yield previously cached output for reference. To be explicit about which 
version
-is running, let's check whether you're running with live data:
+To make a local copy of the data set stored in the `bucket` to a folder called 
`"nyc-taxi"`, use the `copy_files()` function:
 
-```{r}
-dir.exists("nyc-taxi")
+```r
+copy_files(from = bucket, to = "nyc-taxi")
 ```
 
-## Opening the dataset
+For the purposes of this article, we assume that the NYC taxi dataset (either 
the full data or the tiny version) has been downloaded locally and exists in an 
`"nyc-taxi"` directory. 
 
-Because dplyr is not necessary for many Arrow workflows,
-it is an optional (`Suggests`) dependency. So, to work with Datasets,
-you need to load both arrow and dplyr.
+## Opening Datasets
 
-```{r}
-library(arrow, warn.conflicts = FALSE)
-library(dplyr, warn.conflicts = FALSE)
-```
-
-The first step is to create a Dataset object, pointing at the directory of 
data.
+The first step in the process is to create a Dataset object that points at the 
data directory:
 
 ```{r, eval = file.exists("nyc-taxi")}
 ds <- open_dataset("nyc-taxi")
 ```
 
-The file format for `open_dataset()` is controlled by the `format` parameter, 
-which has a default value of `"parquet"`.  If you had a directory
-of Arrow format files, you could instead specify `format = "arrow"` in the 
call.
+It is important to note that when we do this, the data values are not loaded 
into memory. Instead, Arrow scans the data directory to find relevant files, 
parses the file paths looking for a "Hive-style partitioning" (see below), and 
reads headers of the data files to construct a Schema that contains metadata 
describing the structure of the data. For more information about Schemas see 
the [metadata article](./metadata.html).
 
-Other supported formats include: 
+Two questions naturally follow from this: what kind of files does 
`open_dataset()` look for, and what structure does it expect to find in the 
file paths? Let's start by looking at the file types.
 
-* `"feather"` or `"ipc"` (aliases for `"arrow"`, as Feather v2 is the Arrow 
file format)
+By default `open_dataset()` looks for Parquet files but you can override this 
using the `format` argument. For example if the data were encoded as CSV files 
we could set `format = "csv"` to connect to the data. The Arrow Dataset 
interface supports several file formats including: 

Review Comment:
   +1 for explicitly mentioning parquet is default and the others are not



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to