[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

GitBox Wed, 26 Oct 2022 04:33:48 -0700


thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1005556357



##########
r/vignettes/cloud_storage.Rmd:
##########
@@ -0,0 +1,329 @@
+---
+title: "Using cloud storage (S3, GCS)"
+description: >
+  Learn how to work with data sets stored in an 
+  Amazon S3 bucket or on Google Cloud Storage 
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{Using cloud storage (S3, GCS)}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+Working with data stored in cloud storage systems like [Amazon Simple Storage 
Service](https://docs.aws.amazon.com/s3/) (S3) and [Google Cloud 
Storage](https://cloud.google.com/storage/docs) (GCS) is a very common task. 
Because of this, the Arrow C++ library provides a toolkit aimed to make it as 
simple to work with cloud storage as it is to work with the local filesystem.

Review Comment:
   TIL why S3 is called S3 :D



##########
r/vignettes/cloud_storage.Rmd:
##########
@@ -0,0 +1,329 @@
+---
+title: "Using cloud storage (S3, GCS)"
+description: >
+  Learn how to work with data sets stored in an 
+  Amazon S3 bucket or on Google Cloud Storage 
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{Using cloud storage (S3, GCS)}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+Working with data stored in cloud storage systems like [Amazon Simple Storage 
Service](https://docs.aws.amazon.com/s3/) (S3) and [Google Cloud 
Storage](https://cloud.google.com/storage/docs) (GCS) is a very common task. 
Because of this, the Arrow C++ library provides a toolkit aimed to make it as 
simple to work with cloud storage as it is to work with the local filesystem.
+
+To make this work, the Arrow C++ library contains a general-purpose interface 
for file systems, and the `arrow` package exposes this interface to R users. 
For instance, if you want to you can create a `LocalFileSystem` object that 
allows you to interact with the local file system in the usual ways: copying, 
moving, and deleting files, obtaining information about files and folders, and 
so on (see `help("FileSystem", package = "arrow")` for details). In general you 
probably don't need this functionality because you already have tools for 
working with your local file system, but this interface becomes much more 
useful in the context of remote file systems. Currently there is a specific 
implementation for Amazon S3 provided by the `S3FileSystem` class, and another 
one for Google Cloud Storage provided by `GcsFileSystem`.
+
+This vignette provides an overview of working with both S3 and GCS data using 
the Arrow toolkit. 
+
+## S3 and GCS support on Linux
+
+Before you start, make sure that your `arrow` install has support for S3 
and/or GCS enabled. For most users this will be true by default, because the 
Windows and MacOS binary packages hosted on CRAN include S3 and GCS support. 
You can check whether support is enabled via helper functions:
+
+```r
+arrow_with_s3()
+arrow_with_gcs()
+```
+
+If these return `TRUE` then the relevant support is enabled.
+
+In some cases you may find that your system does not have support enabled. The 
most common case for this occurs on Linux when installing `arrow` from source. 
In this situation S3 and GCS support is not always enabled by default, and 
there are additional system requirements involved. See 
`vignette("install_linux", package = "arrow")` for details on how to resolve 
this.
+
+## Connecting to cloud storage
+
+One way of working with filesystems is to create `?FileSystem` objects. 
+`?S3FileSystem` objects can be created with the `s3_bucket()` function, which
+automatically detects the bucket's AWS region. Similarly, `?GcsFileSystem` 
objects
+can be created with the `gs_bucket()` function. The resulting
+`FileSystem` will consider paths relative to the bucket's path (so for example
+you don't need to prefix the bucket path when listing a directory).
+
+With a `FileSystem` object, you can point to specific files in it with the 
`$path()` method
+and pass the result to file readers and writers (`read_parquet()`, 
`write_feather()`, et al.).
+
+Often the reason users work with cloud storage in real world analysis is to 
access large data sets. An example of this is discussed in `vignette("dataset", 
package = "arrow")`, but new users may prefer to work with a much smaller data 
set while learning how the `arrow` cloud storage interface works. To that end, 
the examples in this vignette rely on a multi-file Parquet dataset that stores 
a copy of the `diamonds` data made available through the 
[`ggplot2`](https://ggplot2.tidyverse.org/) package, documented in 
`help("diamonds", package = "ggplot2")`. The cloud storage version of this data 
set consists of 5 Parquet files totaling less than 1MB in size.
+
+The diamonds data set is hosted on both S3 and GCS, in a bucket named 
`voltrondata-labs-datasets`. To create an S3FileSystem object that refers to 
that bucket, use the following command:
+
+```r
+bucket <- s3_bucket("voltrondata-labs-datasets")
+```
+
+To do this for the GCS version of the data, the command is as follows:
+
+```r
+bucket <- gs_bucket("voltrondata-labs-datasets", anonymous = TRUE)
+```
+
+Note that `anonymous = TRUE` is required for GCS if credentials have not been 
configured. 
+
+<!-- TODO: update GCS note above if ARROW-17097 is addressed -->
+
+Within this bucket there is a folder called `diamonds`. We can call 
`bucket$ls("diamonds")` to list the files stored in this folder, or 
`bucket$ls("diamonds", recursive = TRUE)` to recursively search subfolders. 
Note that on GCS, you should always set `recursive = TRUE` because directories 
often don't appear in the results.
+
+Here's what we get when we list the files stored in the GCS bucket:
+
+``` r
+bucket$ls("diamonds", recursive = TRUE)
+```
+
+``` r
+## [1] "diamonds/cut=Fair/part-0.parquet"     
+## [2] "diamonds/cut=Good/part-0.parquet"     
+## [3] "diamonds/cut=Ideal/part-0.parquet"    
+## [4] "diamonds/cut=Premium/part-0.parquet"  
+## [5] "diamonds/cut=Very Good/part-0.parquet"
+```
+
+There are 5 Parquet files here, one corresponding to each of the "cut" 
categories in the `diamonds` data set. We can specify the path to a specific 
file by calling `bucket$path()`:
+
+``` r
+parquet_good <- bucket$path("diamonds/cut=Good/part-0.parquet")
+```
+
+We can use `read_parquet()` to read from this path directly into R:
+
+``` r
+diamonds_good <- read_parquet(parquet_good)
+diamonds_good
+```
+
+``` r
+## # A tibble: 4,906 × 9
+##    carat color clarity depth table price     x     y     z
+##    <dbl> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
+##  1  0.23 E     VS1      56.9    65   327  4.05  4.07  2.31
+##  2  0.31 J     SI2      63.3    58   335  4.34  4.35  2.75
+##  3  0.3  J     SI1      64      55   339  4.25  4.28  2.73
+##  4  0.3  J     SI1      63.4    54   351  4.23  4.29  2.7 
+##  5  0.3  J     SI1      63.8    56   351  4.23  4.26  2.71
+##  6  0.3  I     SI2      63.3    56   351  4.26  4.3   2.71
+##  7  0.23 F     VS1      58.2    59   402  4.06  4.08  2.37
+##  8  0.23 E     VS1      64.1    59   402  3.83  3.85  2.46
+##  9  0.31 H     SI1      64      54   402  4.29  4.31  2.75
+## 10  0.26 D     VS2      65.2    56   403  3.99  4.02  2.61
+## # … with 4,896 more rows
+## # ℹ Use `print(n = ...)` to see more rows
+```
+
+Note that this will be slower to read than if the file were local.
+
+<!-- though if you're running on a machine in the same AWS region as the file 
in S3,
+the cost of reading the data over the network should be much lower. -->
+
+
+<!--
+See `help(FileSystem)` for a list of options that 
`s3_bucket()`/`S3FileSystem$create()`
+and `gs_bucket()`/`GcsFileSystem$create()` can take.
+
+The object that `s3_bucket()` and `gs_bucket()` return is technically a 
`SubTreeFileSystem`, 
+which holds a path and a file system to which it corresponds. 
`SubTreeFileSystem`s can be 
+useful for holding a reference to a subdirectory somewhere (on S3, GCS, or 
elsewhere).
+
+One way to get a subtree is to call the `$cd()` method on a `FileSystem`
+
+```r
+june2019 <- bucket$cd("nyc-taxi/year=2019/month=6")
+df <- read_parquet(june2019$path("part-0.parquet"))
+```
+
+`SubTreeFileSystem` can also be made from a URI:
+
+```r
+june2019 <- 
SubTreeFileSystem$create("s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6")
+```
+-->
+
+
+
+## Connecting directly with a URI
+
+In most use cases, the easiest and most natural way to connect to cloud 
storage in `arrow` is to use the FileSystem objects returned by `s3_bucket()` 
and `gs_bucket()`, especially when multiple file operations are required. 
However, in some cases you may want to download a file directly by specifying 
the URI. This is permitted by `arrow`, and functions like `read_parquet()`, 
`write_feather()`, `open_dataset()` etc will all accept URIs to cloud resources 
hosted on S3 or GCS. The format of an S3 URI is as follows:
+
+```
+s3://[access_key:secret_key@]bucket/path[?region=]
+```
+
+For GCS, the URI format looks like this:
+
+```
+gs://[access_key:secret_key@]bucket/path
+gs://anonymous@bucket/path
+```
+
+For example, the Parquet file storing the "good cut" diamonds that we 
downloaded earlier in the vignette is available on both S3 and CGS. The 
relevant URIs are as follows:
+
+```r
+uri <- "s3://voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet"
+uri <- 
"gs://anonymous@voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet"
+```
+
+Note that "anonymous" is required on GCS for public buckets. Regardless of 
which version you use, you can pass this URI to `read_parquet()` as if the file 
were stored locally:
+
+```r
+df <- read_parquet(uri)
+```
+
+URIs accept additional options in the query parameters (the part after the `?`)
+that are passed down to configure the underlying file system. They are 
separated 
+by `&`. For example,
+
+```
+s3://voltrondata-labs-datasets/?endpoint_override=https%3A%2F%2Fstorage.googleapis.com&allow_bucket_creation=true
+```
+
+is equivalent to:
+
+```r
+bucket <- S3FileSystem$create(
+  endpoint_override="https://storage.googleapis.com";,
+  allow_bucket_creation=TRUE
+)
+bucket$path("voltrondata-labs-datasets/")
+```
+
+Both tell the `S3FileSystem` object that it should allow the creation of new 
buckets 
+and to talk to Google Storage instead of S3. The latter works because GCS 
implements an 
+S3-compatible API -- see [File systems that emulate 
S3](#file-systems-that-emulate-s3) 
+below -- but if you want better support for GCS you should refer to a 
`GcsFileSystem` 
+bt using a URI that starts with `gs://`. 

Review Comment:
   ```suggestion
   but using a URI that starts with `gs://`. 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Reply via email to