This is an automated email from the ASF dual-hosted git repository.
kszucs pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/master by this push:
new 42647dcd00 ARROW-16887: [R][Docs] Update Filesystem Vignette for GCS
(#13601)
42647dcd00 is described below
commit 42647dcd00d2ac4f92593c9ce54b05fe8322c91a
Author: Will Jones <[email protected]>
AuthorDate: Mon Jul 25 14:09:55 2022 -0700
ARROW-16887: [R][Docs] Update Filesystem Vignette for GCS (#13601)
This PR:
* Replaces all references to the Ursa Labs bucket with the new
`voltrondata-labs-datasets` bucket.
* Adds a new `gs_bucket()` function to R, which parallels the design of
the `s3_bucket()` function.
* Updates the `fs.Rmd` vignette to also discuss GCS. Left discussion about
authentication for a follow-up, partly because I am still confused how GCP auth
works for non-application use.
Authored-by: Will Jones <[email protected]>
Signed-off-by: Krisztián Szűcs <[email protected]>
---
cpp/src/arrow/filesystem/s3fs_test.cc | 2 +-
docs/source/python/dataset.rst | 8 +-
python/pyarrow/_s3fs.pyx | 4 +-
python/pyarrow/tests/test_fs.py | 12 ++-
r/.gitignore | 1 +
r/NAMESPACE | 1 +
r/R/filesystem.R | 44 +++++++-
r/_pkgdown.yml | 3 +-
r/man/FileSystem.Rd | 21 ++++
r/man/gs_bucket.Rd | 27 +++++
r/man/s3_bucket.Rd | 2 +-
r/tests/testthat/test-filesystem.R | 21 +++-
r/vignettes/dataset.Rmd | 98 ++++++++++--------
r/vignettes/fs.Rmd | 186 +++++++++++++++++++++++++---------
14 files changed, 319 insertions(+), 111 deletions(-)
diff --git a/cpp/src/arrow/filesystem/s3fs_test.cc
b/cpp/src/arrow/filesystem/s3fs_test.cc
index 7216af297a..1d89e2da71 100644
--- a/cpp/src/arrow/filesystem/s3fs_test.cc
+++ b/cpp/src/arrow/filesystem/s3fs_test.cc
@@ -322,7 +322,7 @@ TEST_F(S3OptionsTest, FromAssumeRole) {
class S3RegionResolutionTest : public AwsTestMixin {};
TEST_F(S3RegionResolutionTest, PublicBucket) {
- ASSERT_OK_AND_EQ("us-east-2", ResolveS3BucketRegion("ursa-labs-taxi-data"));
+ ASSERT_OK_AND_EQ("us-east-2",
ResolveS3BucketRegion("voltrondata-labs-datasets"));
// Taken from a registry of open S3-hosted datasets
// at https://github.com/awslabs/open-data-registry
diff --git a/docs/source/python/dataset.rst b/docs/source/python/dataset.rst
index 4808457355..2ac592d8d0 100644
--- a/docs/source/python/dataset.rst
+++ b/docs/source/python/dataset.rst
@@ -355,7 +355,7 @@ specifying a S3 path:
.. code-block:: python
- dataset = ds.dataset("s3://ursa-labs-taxi-data/", partitioning=["year",
"month"])
+ dataset = ds.dataset("s3://voltrondata-labs-datasets/nyc-taxi/")
Typically, you will want to customize the connection parameters, and then
a file system object can be created and passed to the ``filesystem`` keyword:
@@ -365,8 +365,7 @@ a file system object can be created and passed to the
``filesystem`` keyword:
from pyarrow import fs
s3 = fs.S3FileSystem(region="us-east-2")
- dataset = ds.dataset("ursa-labs-taxi-data/", filesystem=s3,
- partitioning=["year", "month"])
+ dataset = ds.dataset("voltrondata-labs-datasets/nyc-taxi/", filesystem=s3)
The currently available classes are :class:`~pyarrow.fs.S3FileSystem` and
:class:`~pyarrow.fs.HadoopFileSystem`. See the :ref:`filesystem` docs for more
@@ -387,8 +386,7 @@ useful for testing or benchmarking.
# By default, MinIO will listen for unencrypted HTTP traffic.
minio = fs.S3FileSystem(scheme="http", endpoint_override="localhost:9000")
- dataset = ds.dataset("ursa-labs-taxi-data/", filesystem=minio,
- partitioning=["year", "month"])
+ dataset = ds.dataset("voltrondata-labs-datasets/nyc-taxi/",
filesystem=minio)
Working with Parquet Datasets
diff --git a/python/pyarrow/_s3fs.pyx b/python/pyarrow/_s3fs.pyx
index d9335995dc..f668038e62 100644
--- a/python/pyarrow/_s3fs.pyx
+++ b/python/pyarrow/_s3fs.pyx
@@ -74,8 +74,8 @@ def resolve_s3_region(bucket):
Examples
--------
- >>> fs.resolve_s3_region('registry.opendata.aws')
- 'us-east-1'
+ >>> fs.resolve_s3_region('voltrondata-labs-datasets')
+ 'us-east-2'
"""
cdef:
c_string c_bucket
diff --git a/python/pyarrow/tests/test_fs.py b/python/pyarrow/tests/test_fs.py
index 41c242ff83..05ebf4ed4c 100644
--- a/python/pyarrow/tests/test_fs.py
+++ b/python/pyarrow/tests/test_fs.py
@@ -1616,15 +1616,17 @@ def test_s3_real_aws():
assert fs.region == default_region
fs = S3FileSystem(anonymous=True, region='us-east-2')
- entries = fs.get_file_info(FileSelector('ursa-labs-taxi-data'))
+ entries = fs.get_file_info(FileSelector(
+ 'voltrondata-labs-datasets/nyc-taxi'))
assert len(entries) > 0
- with fs.open_input_stream('ursa-labs-taxi-data/2019/06/data.parquet') as f:
+ key = 'voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/part-0.parquet'
+ with fs.open_input_stream(key) as f:
md = f.metadata()
assert 'Content-Type' in md
- assert md['Last-Modified'] == b'2020-01-17T16:26:28Z'
+ assert md['Last-Modified'] == b'2022-07-12T23:32:00Z'
# For some reason, the header value is quoted
# (both with AWS and Minio)
- assert md['ETag'] == b'"f1efd5d76cb82861e1542117bfa52b90-8"'
+ assert md['ETag'] == b'"4c6a76826a695c6ac61592bc30cda3df-16"'
@pytest.mark.s3
@@ -1653,7 +1655,7 @@ def test_s3_real_aws_region_selection():
@pytest.mark.s3
def test_resolve_s3_region():
from pyarrow.fs import resolve_s3_region
- assert resolve_s3_region('ursa-labs-taxi-data') == 'us-east-2'
+ assert resolve_s3_region('voltrondata-labs-datasets') == 'us-east-2'
assert resolve_s3_region('mf-nwp-models') == 'eu-west-1'
with pytest.raises(ValueError, match="Not a valid bucket name"):
diff --git a/r/.gitignore b/r/.gitignore
index 695e42b759..e607d2662f 100644
--- a/r/.gitignore
+++ b/r/.gitignore
@@ -18,6 +18,7 @@ vignettes/nyc-taxi/
arrow_*.tar.gz
arrow_*.tgz
extra-tests/files
+.deps
# C++ sources for an offline build. They're copied from the ../cpp directory,
so ignore them here.
/tools/cpp/
diff --git a/r/NAMESPACE b/r/NAMESPACE
index 733261f33c..17f404caa1 100644
--- a/r/NAMESPACE
+++ b/r/NAMESPACE
@@ -304,6 +304,7 @@ export(float)
export(float16)
export(float32)
export(float64)
+export(gs_bucket)
export(halffloat)
export(hive_partition)
export(infer_type)
diff --git a/r/R/filesystem.R b/r/R/filesystem.R
index 3cebbc30c8..2f0b1cfd58 100644
--- a/r/R/filesystem.R
+++ b/r/R/filesystem.R
@@ -155,6 +155,26 @@ FileSelector$create <- function(base_dir, allow_not_found
= FALSE, recursive = F
#' - `allow_bucket_deletion`: logical, if TRUE, the filesystem will delete
#' buckets if`$DeleteDir()` is called on the bucket level (default `FALSE`).
#'
+#' `GcsFileSystem$create()` optionally takes arguments:
+#'
+#' - `anonymous`: logical, default `FALSE`. If true, will not attempt to look
up
+#' credentials using standard GCS configuration methods.
+#' - `access_token`: optional string for authentication. Should be provided
along
+#' with `expiration`
+#' - `expiration`: optional date representing point at which `access_token`
will
+#' expire.
+#' - `json_credentials`: optional string for authentication. Point to a JSON
+#' credentials file downloaded from GCS.
+#' - `endpoint_override`: if non-empty, will connect to provided host name /
port,
+#' such as "localhost:9001", instead of default GCS ones. This is primarily
useful
+#' for testing purposes.
+#' - `scheme`: connection transport (default "https")
+#' - `default_bucket_location`: the default location (or "region") to create
new
+#' buckets in.
+#' - `retry_limit_seconds`: the maximum amount of time to spend retrying if
+#' the filesystem encounters errors. Default is 15 seconds.
+#' - `default_metadata`: default metadata to write in new objects.
+#'
#' @section Methods:
#'
#' - `$GetFileInfo(x)`: `x` may be a [FileSelector][FileSelector] or a
character
@@ -426,7 +446,7 @@ default_s3_options <- list(
#' relative path. Note that this function's success does not guarantee that you
#' are authorized to access the bucket's contents.
#' @examplesIf FALSE
-#' bucket <- s3_bucket("ursa-labs-taxi-data")
+#' bucket <- s3_bucket("voltrondata-labs-datasets")
#' @export
s3_bucket <- function(bucket, ...) {
assert_that(is.string(bucket))
@@ -448,6 +468,28 @@ s3_bucket <- function(bucket, ...) {
SubTreeFileSystem$create(fs_and_path$path, fs)
}
+#' Connect to a Google Cloud Storage (GCS) bucket
+#'
+#' `gs_bucket()` is a convenience function to create an `GcsFileSystem` object
+#' that holds onto its relative path
+#'
+#' @param bucket string GCS bucket name or path
+#' @param ... Additional connection options, passed to `GcsFileSystem$create()`
+#' @return A `SubTreeFileSystem` containing an `GcsFileSystem` and the bucket's
+#' relative path. Note that this function's success does not guarantee that you
+#' are authorized to access the bucket's contents.
+#' @examplesIf FALSE
+#' bucket <- gs_bucket("voltrondata-labs-datasets")
+#' @export
+gs_bucket <- function(bucket, ...) {
+ assert_that(is.string(bucket))
+ args <- list2(...)
+
+ fs <- exec(GcsFileSystem$create, !!!args)
+
+ SubTreeFileSystem$create(bucket, fs)
+}
+
#' @usage NULL
#' @format NULL
#' @rdname FileSystem
diff --git a/r/_pkgdown.yml b/r/_pkgdown.yml
index 8865421c0b..dfb0998ddf 100644
--- a/r/_pkgdown.yml
+++ b/r/_pkgdown.yml
@@ -90,7 +90,7 @@ navbar:
href: articles/install.html
- text: Working with Arrow Datasets and dplyr
href: articles/dataset.html
- - text: Working with Cloud Storage (S3)
+ - text: Working with Cloud Storage (S3, GCS)
href: articles/fs.html
- text: Apache Arrow in Python and R with reticulate
href: articles/python.html
@@ -198,6 +198,7 @@ reference:
- title: File systems
contents:
- s3_bucket
+ - gs_bucket
- FileSystem
- FileInfo
- FileSelector
diff --git a/r/man/FileSystem.Rd b/r/man/FileSystem.Rd
index 41d9e92514..f4f6cb57ff 100644
--- a/r/man/FileSystem.Rd
+++ b/r/man/FileSystem.Rd
@@ -56,6 +56,27 @@ buckets if \verb{$CreateDir()} is called on the bucket level
(default \code{FALS
\item \code{allow_bucket_deletion}: logical, if TRUE, the filesystem will
delete
buckets if\verb{$DeleteDir()} is called on the bucket level (default
\code{FALSE}).
}
+
+\code{GcsFileSystem$create()} optionally takes arguments:
+\itemize{
+\item \code{anonymous}: logical, default \code{FALSE}. If true, will not
attempt to look up
+credentials using standard GCS configuration methods.
+\item \code{access_token}: optional string for authentication. Should be
provided along
+with \code{expiration}
+\item \code{expiration}: optional date representing point at which
\code{access_token} will
+expire.
+\item \code{json_credentials}: optional string for authentication. Point to a
JSON
+credentials file downloaded from GCS.
+\item \code{endpoint_override}: if non-empty, will connect to provided host
name / port,
+such as "localhost:9001", instead of default GCS ones. This is primarily useful
+for testing purposes.
+\item \code{scheme}: connection transport (default "https")
+\item \code{default_bucket_location}: the default location (or "region") to
create new
+buckets in.
+\item \code{retry_limit_seconds}: the maximum amount of time to spend retrying
if
+the filesystem encounters errors. Default is 15 seconds.
+\item \code{default_metadata}: default metadata to write in new objects.
+}
}
\section{Methods}{
diff --git a/r/man/gs_bucket.Rd b/r/man/gs_bucket.Rd
new file mode 100644
index 0000000000..7dc39a42c3
--- /dev/null
+++ b/r/man/gs_bucket.Rd
@@ -0,0 +1,27 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/filesystem.R
+\name{gs_bucket}
+\alias{gs_bucket}
+\title{Connect to a Google Cloud Storage (GCS) bucket}
+\usage{
+gs_bucket(bucket, ...)
+}
+\arguments{
+\item{bucket}{string GCS bucket name or path}
+
+\item{...}{Additional connection options, passed to
\code{GcsFileSystem$create()}}
+}
+\value{
+A \code{SubTreeFileSystem} containing an \code{GcsFileSystem} and the bucket's
+relative path. Note that this function's success does not guarantee that you
+are authorized to access the bucket's contents.
+}
+\description{
+\code{gs_bucket()} is a convenience function to create an \code{GcsFileSystem}
object
+that holds onto its relative path
+}
+\examples{
+\dontshow{if (FALSE) (if (getRversion() >= "3.4") withAutoprint else force)(\{
# examplesIf}
+bucket <- gs_bucket("voltrondata-labs-datasets")
+\dontshow{\}) # examplesIf}
+}
diff --git a/r/man/s3_bucket.Rd b/r/man/s3_bucket.Rd
index 7baeb49b69..2ab7d4962e 100644
--- a/r/man/s3_bucket.Rd
+++ b/r/man/s3_bucket.Rd
@@ -23,6 +23,6 @@ relative path.
}
\examples{
\dontshow{if (FALSE) (if (getRversion() >= "3.4") withAutoprint else force)(\{
# examplesIf}
-bucket <- s3_bucket("ursa-labs-taxi-data")
+bucket <- s3_bucket("voltrondata-labs-datasets")
\dontshow{\}) # examplesIf}
}
diff --git a/r/tests/testthat/test-filesystem.R
b/r/tests/testthat/test-filesystem.R
index 1852634ac9..7957743a2a 100644
--- a/r/tests/testthat/test-filesystem.R
+++ b/r/tests/testthat/test-filesystem.R
@@ -147,7 +147,7 @@ test_that("FileSystem$from_uri", {
skip_on_cran()
skip_if_not_available("s3")
skip_if_offline()
- fs_and_path <- FileSystem$from_uri("s3://ursa-labs-taxi-data")
+ fs_and_path <- FileSystem$from_uri("s3://voltrondata-labs-datasets")
expect_r6_class(fs_and_path$fs, "S3FileSystem")
expect_identical(fs_and_path$fs$region, "us-east-2")
})
@@ -156,11 +156,11 @@ test_that("SubTreeFileSystem$create() with URI", {
skip_on_cran()
skip_if_not_available("s3")
skip_if_offline()
- fs <- SubTreeFileSystem$create("s3://ursa-labs-taxi-data")
+ fs <- SubTreeFileSystem$create("s3://voltrondata-labs-datasets")
expect_r6_class(fs, "SubTreeFileSystem")
expect_identical(
capture.output(print(fs)),
- "SubTreeFileSystem: s3://ursa-labs-taxi-data/"
+ "SubTreeFileSystem: s3://voltrondata-labs-datasets/"
)
})
@@ -187,6 +187,19 @@ test_that("s3_bucket", {
capture.output(print(bucket)),
"SubTreeFileSystem: s3://ursa-labs-r-test/"
)
- skip_on_os("windows") # FIXME
expect_identical(bucket$base_path, "ursa-labs-r-test/")
})
+
+test_that("gs_bucket", {
+ skip_on_cran()
+ skip_if_not_available("gcs")
+ skip_if_offline()
+ bucket <- gs_bucket("voltrondata-labs-datasets")
+ expect_r6_class(bucket, "SubTreeFileSystem")
+ expect_r6_class(bucket$base_fs, "GcsFileSystem")
+ expect_identical(
+ capture.output(print(bucket)),
+ "SubTreeFileSystem: gs://voltrondata-labs-datasets/"
+ )
+ expect_identical(bucket$base_path, "voltrondata-labs-datasets/")
+})
diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd
index 5c430c4be0..1a969f979c 100644
--- a/r/vignettes/dataset.Rmd
+++ b/r/vignettes/dataset.Rmd
@@ -44,7 +44,9 @@ directory.
If your arrow build has S3 support, you can sync the data locally with:
```{r, eval = FALSE}
-arrow::copy_files("s3://ursa-labs-taxi-data", "nyc-taxi")
+arrow::copy_files("s3://voltrondata-labs-datasets/nyc-taxi", "nyc-taxi")
+# Alternatively, with GCS:
+arrow::copy_files("gs://voltrondata-labs-datasets/nyc-taxi", "nyc-taxi")
```
If your arrow build doesn't have S3 support, you can download the files
@@ -53,7 +55,7 @@ you may need to increase R's download timeout from the
default of 60 seconds, e.
`options(timeout = 300)`.
```{r, eval = FALSE}
-bucket <- "https://ursa-labs-taxi-data.s3.us-east-2.amazonaws.com"
+bucket <- "https://voltrondata-labs-datasets.s3.us-east-2.amazonaws.com"
for (year in 2009:2019) {
if (year == 2019) {
# We only have through June 2019 there
@@ -64,8 +66,8 @@ for (year in 2009:2019) {
for (month in sprintf("%02d", months)) {
dir.create(file.path("nyc-taxi", year, month), recursive = TRUE)
try(download.file(
- paste(bucket, year, month, "data.parquet", sep = "/"),
- file.path("nyc-taxi", year, month, "data.parquet"),
+ paste(bucket, "nyc-taxi", paste0("year=", year), paste0("month=",
month), "data.parquet", sep = "/"),
+ file.path("nyc-taxi", paste0("year=", year), paste0("month=", month),
"data.parquet"),
mode = "wb"
), silent = TRUE)
}
@@ -99,7 +101,7 @@ library(dplyr, warn.conflicts = FALSE)
The first step is to create a Dataset object, pointing at the directory of
data.
```{r, eval = file.exists("nyc-taxi")}
-ds <- open_dataset("nyc-taxi", partitioning = c("year", "month"))
+ds <- open_dataset("nyc-taxi")
```
The file format for `open_dataset()` is controlled by the `format` parameter,
@@ -122,9 +124,18 @@ For text files, you can pass the following parsing options
to `open_dataset()`:
For more information on the usage of these parameters, see
`?read_delim_arrow()`.
-The `partitioning` argument lets you specify how the file paths provide
information
-about how the dataset is chunked into different files. The files in this
example
-have file paths like
+`open_dataset()` was able to automatically infer column values for `year` and
`month`
+--which are not present in the data files--based on the directory structure.
The
+Hive-style partitioning structure is self-describing, with file paths like
+
+```
+year=2009/month=1/data.parquet
+year=2009/month=2/data.parquet
+...
+```
+
+But sometimes the directory partitioning isn't self describing; that is, it
doesn't
+contain field names. For example, if instead we had file paths like
```
2009/01/data.parquet
@@ -132,12 +143,13 @@ have file paths like
...
```
-By providing `c("year", "month")` to the `partitioning` argument, you're
saying that the first
-path segment gives the value for `year`, and the second segment is `month`.
-Every row in `2009/01/data.parquet` has a value of 2009 for `year`
+then `open_dataset()` would need some hints as to how to use the file paths.
In this
+case, you could provide `c("year", "month")` to the `partitioning` argument,
+saying that the first path segment gives the value for `year`, and the second
+segment is `month`. Every row in `2009/01/data.parquet` has a value of 2009
for `year`
and 1 for `month`, even though those columns may not be present in the file.
-Indeed, when you look at the dataset, you can see that in addition to the
columns present
+In either case, when you look at the dataset, you can see that in addition to
the columns present
in every file, there are also columns `year` and `month` even though they are
not present in the files themselves.
```{r, eval = file.exists("nyc-taxi")}
@@ -145,29 +157,31 @@ ds
```
```{r, echo = FALSE, eval = !file.exists("nyc-taxi")}
cat("
-FileSystemDataset with 125 Parquet files
-vendor_id: string
-pickup_at: timestamp[us]
-dropoff_at: timestamp[us]
-passenger_count: int8
-trip_distance: float
-pickup_longitude: float
-pickup_latitude: float
-rate_code_id: null
-store_and_fwd_flag: string
-dropoff_longitude: float
-dropoff_latitude: float
+FileSystemDataset with 158 Parquet files
+vendor_name: string
+pickup_datetime: timestamp[ms]
+dropoff_datetime: timestamp[ms]
+passenger_count: int64
+trip_distance: double
+pickup_longitude: double
+pickup_latitude: double
+rate_code: string
+store_and_fwd: string
+dropoff_longitude: double
+dropoff_latitude: double
payment_type: string
-fare_amount: float
-extra: float
-mta_tax: float
-tip_amount: float
-tolls_amount: float
-total_amount: float
+fare_amount: double
+extra: double
+mta_tax: double
+tip_amount: double
+tolls_amount: double
+total_amount: double
+improvement_surcharge: double
+congestion_surcharge: double
+pickup_location_id: int64
+dropoff_location_id: int64
year: int32
month: int32
-
-See $metadata for additional Schema metadata
")
```
@@ -271,7 +285,7 @@ ds %>%
```{r, echo = FALSE, eval = !file.exists("nyc-taxi")}
cat("
FileSystemDataset (query)
-passenger_count: int8
+passenger_count: int64
median_tip_pct: double
n: int32
@@ -312,19 +326,20 @@ percentage of rows from each batch:
sampled_data <- ds %>%
filter(year == 2015) %>%
select(tip_amount, total_amount, passenger_count) %>%
- map_batches(~ sample_frac(as.data.frame(.), 1e-4)) %>%
- mutate(tip_pct = tip_amount / total_amount)
+ map_batches(~ as_record_batch(sample_frac(as.data.frame(.), 1e-4))) %>%
+ mutate(tip_pct = tip_amount / total_amount) %>%
+ collect()
str(sampled_data)
```
```{r, echo = FALSE, eval = !file.exists("nyc-taxi")}
cat("
-'data.frame': 15603 obs. of 4 variables:
- $ tip_amount : num 0 0 1.55 1.45 5.2 ...
- $ total_amount : num 5.8 16.3 7.85 8.75 26 ...
- $ passenger_count: int 1 1 1 1 1 6 5 1 2 1 ...
- $ tip_pct : num 0 0 0.197 0.166 0.2 ...
+tibble [10,918 × 4] (S3: tbl_df/tbl/data.frame)
+ $ tip_amount : num [1:10918] 3 0 4 1 1 6 0 1.35 0 5.9 ...
+ $ total_amount : num [1:10918] 18.8 13.3 20.3 15.8 13.3 ...
+ $ passenger_count: int [1:10918] 3 2 1 1 1 1 1 1 1 3 ...
+ $ tip_pct : num [1:10918] 0.1596 0 0.197 0.0633 0.0752 ...
")
```
@@ -345,7 +360,8 @@ ds %>%
as.data.frame() %>%
mutate(pred_tip_pct = predict(model, newdata = .)) %>%
filter(!is.nan(tip_pct)) %>%
- summarize(sse_partial = sum((pred_tip_pct - tip_pct)^2), n_partial = n())
+ summarize(sse_partial = sum((pred_tip_pct - tip_pct)^2), n_partial =
n()) %>%
+ as_record_batch()
}) %>%
summarize(mse = sum(sse_partial) / sum(n_partial)) %>%
pull(mse)
diff --git a/r/vignettes/fs.Rmd b/r/vignettes/fs.Rmd
index a0c92bb6be..6fb7e2d1af 100644
--- a/r/vignettes/fs.Rmd
+++ b/r/vignettes/fs.Rmd
@@ -1,8 +1,8 @@
---
-title: "Working with Cloud Storage (S3)"
+title: "Working with Cloud Storage (S3, GCS)"
output: rmarkdown::html_vignette
vignette: >
- %\VignetteIndexEntry{Working with Cloud Storage (S3)}
+ %\VignetteIndexEntry{Working with Cloud Storage (S3, GCS)}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
@@ -10,91 +10,152 @@ vignette: >
The Arrow C++ library includes a generic filesystem interface and specific
implementations for some cloud storage systems. This setup allows various
parts of the project to be able to read and write data with different storage
-backends. In the `arrow` R package, support has been enabled for AWS S3.
-This vignette provides an overview of working with S3 data using Arrow.
+backends. In the `arrow` R package, support has been enabled for AWS S3 and
+Google Cloud Storage (GCS). This vignette provides an overview of working with
+S3 and GCS data using Arrow.
-> In Windows and macOS binary packages, S3 support is included. On Linux when
-installing from source, S3 support is not enabled by default, and it has
+> In Windows and macOS binary packages, S3 and GCS support are included. On
Linux when
+installing from source, S3 and GCS support is not always enabled by default,
and it has
additional system requirements. See `vignette("install", package = "arrow")`
for details.
+## Creating a FileSystem object
+
+One way of working with filesystems is to create `?FileSystem` objects.
+`?S3FileSystem` objects can be created with the `s3_bucket()` function, which
+automatically detects the bucket's AWS region. Similarly, `?GcsFileSystem`
objects
+can be created with the `gs_bucket()` function. The resulting
+`FileSystem` will consider paths relative to the bucket's path (so for example
+you don't need to prefix the bucket path when listing a directory).
+
+With a `FileSystem` object, you can point to specific files in it with the
`$path()` method
+and pass the result to file readers and writers (`read_parquet()`,
`write_feather()`, et al.).
+For example, to read a parquet file from the example NYC taxi data
+(used in `vignette("dataset", package = "arrow")`):
+
+```r
+bucket <- s3_bucket("voltrondata-labs-datasets")
+# Or in GCS (anonymous = TRUE is required if credentials are not configured):
+bucket <- gs_bucket("voltrondata-labs-datasets", anonymous = TRUE)
+df <- read_parquet(bucket$path("nyc-taxi/year=2019/month=6/data.parquet"))
+```
+
+Note that this will be slower to read than if the file were local,
+though if you're running on a machine in the same AWS region as the file in S3,
+the cost of reading the data over the network should be much lower.
+
+You can list the files and/or directories in a bucket or subdirectory using
+the `$ls()` method:
+
+```r
+bucket$ls("nyc-taxi")
+# Or recursive:
+bucket$ls("nyc-taxi", recursive = TRUE)
+```
+
+**NOTE**: in GCS, you *should always* use `recursive = TRUE` as directories
often don't appear in
+`$ls()` results.
+
+<!-- TODO: update GCS note above if ARROW-17097 is addressed -->
+
+See `help(FileSystem)` for a list of options that
`s3_bucket()`/`S3FileSystem$create()`
+and `gs_bucket()`/`GcsFileSystem$create()` can take.
+
+The object that `s3_bucket()` and `gs_bucket()` return is technically a
`SubTreeFileSystem`,
+which holds a path and a file system to which it corresponds.
`SubTreeFileSystem`s can be
+useful for holding a reference to a subdirectory somewhere (on S3, GCS, or
elsewhere).
+
+One way to get a subtree is to call the `$cd()` method on a `FileSystem`
+
+```r
+june2019 <- bucket$cd("2019/06")
+df <- read_parquet(june2019$path("data.parquet"))
+```
+
+`SubTreeFileSystem` can also be made from a URI:
+
+```r
+june2019 <-
SubTreeFileSystem$create("s3://voltrondata-labs-datasets/nyc-taxi/2019/06")
+```
+
## URIs
-File readers and writers (`read_parquet()`, `write_feather()`, et al.)
-accept an S3 URI as the source or destination file,
-as do `open_dataset()` and `write_dataset()`.
+File readers and writers (`read_parquet()`, `write_feather()`, et al.) also
+accept a URI as the source or destination file, as do `open_dataset()` and
`write_dataset()`.
An S3 URI looks like:
```
s3://[access_key:secret_key@]bucket/path[?region=]
```
+A GCS URI looks like:
+
+```
+gs://[access_key:secret_key@]bucket/path
+gs://anonymous@bucket/path
+```
+
For example, one of the NYC taxi data files used in `vignette("dataset",
package = "arrow")` is found at
```
-s3://ursa-labs-taxi-data/2019/06/data.parquet
+s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet
+# Or in GCS (anonymous required on public buckets):
+gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet
```
Given this URI, you can pass it to `read_parquet()` just as if it were a local
file path:
```r
-df <- read_parquet("s3://ursa-labs-taxi-data/2019/06/data.parquet")
+df <-
read_parquet("s3://voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet")
+# Or in GCS:
+df <-
read_parquet("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet")
```
-Note that this will be slower to read than if the file were local,
-though if you're running on a machine in the same AWS region as the file in S3,
-the cost of reading the data over the network should be much lower.
-
-## Creating a FileSystem object
-
-Another way to connect to S3 is to create a `FileSystem` object once and pass
-that to the read/write functions.
-`S3FileSystem` objects can be created with the `s3_bucket()` function, which
-automatically detects the bucket's AWS region. Additionally, the resulting
-`FileSystem` will consider paths relative to the bucket's path (so for example
-you don't need to prefix the bucket path when listing a directory).
-This may be convenient when dealing with
-long URIs, and it's necessary for some options and authentication methods
-that aren't supported in the URI format.
+### URI options
-With a `FileSystem` object, you can point to specific files in it with the
`$path()` method.
-In the previous example, this would look like:
+URIs accept additional options in the query parameters (the part after the `?`)
+that are passed down to configure the underlying file system. They are
separated
+by `&`. For example,
-```r
-bucket <- s3_bucket("ursa-labs-taxi-data")
-df <- read_parquet(bucket$path("2019/06/data.parquet"))
+```
+s3://voltrondata-labs-datasets/?endpoint_override=https%3A%2F%2Fstorage.googleapis.com&allow_bucket_creation=true
```
-You can list the files and/or directories in an S3 bucket or subdirectory using
-the `$ls()` method:
+is equivlant to:
```r
-bucket$ls()
+fs <- S3FileSystem$create(
+ endpoint_override="https://storage.googleapis.com",
+ allow_bucket_creation=TRUE
+)
+fs$path("voltrondata-labs-datasets/")
```
-See `help(FileSystem)` for a list of options that `s3_bucket()` and
`S3FileSystem$create()`
-can take. `region`, `scheme`, and `endpoint_override` can be encoded as query
-parameters in the URI (though `region` will be auto-detected in `s3_bucket()`
or from the URI if omitted).
-`access_key` and `secret_key` can also be included,
-but other options are not supported in the URI.
+Both tell the `S3FileSystem` that it should allow the creation of new buckets
and to
+talk to Google Storage instead of S3. The latter works because GCS implements
an
+S3-compatible API--see [File systems that emulate
S3](#file-systems-that-emulate-s3)
+below--but for better support for GCS use the GCSFileSystem with `gs://`. Also
note
+that parameters in the URI need to be
+[percent encoded](https://en.wikipedia.org/wiki/Percent-encoding), which is
why
+`://` is written as `%3A%2F%2F`.
-The object that `s3_bucket()` returns is technically a `SubTreeFileSystem`,
which holds a path and a file system to which it corresponds.
`SubTreeFileSystem`s can be useful for holding a reference to a subdirectory
somewhere (on S3 or elsewhere).
+For S3, only the following options can be included in the URI as query
parameters
+are `region`, `scheme`, `endpoint_override`, `access_key`, `secret_key`,
`allow_bucket_creation`,
+and `allow_bucket_deletion`. For GCS, the supported parameters are `scheme`,
`endpoint_override`,
+and `retry_limit_seconds`.
-One way to get a subtree is to call the `$cd()` method on a `FileSystem`
+In GCS, a useful option is `retry_limit_seconds`, which sets the number of
seconds
+a request may spend retrying before returning an error. The current default is
+15 minutes, so in many interactive contexts it's nice to set a lower value:
-```r
-june2019 <- bucket$cd("2019/06")
-df <- read_parquet(june2019$path("data.parquet"))
```
-
-`SubTreeFileSystem` can also be made from a URI:
-
-```r
-june2019 <- SubTreeFileSystem$create("s3://ursa-labs-taxi-data/2019/06")
+gs://anonymous@voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=10
```
## Authentication
+### S3 Authentication
+
To access private S3 buckets, you need typically need two secret parameters:
a `access_key`, which is like a user id, and `secret_key`, which is like a
token
or password. There are a few options for passing these credentials:
@@ -110,6 +171,31 @@ or password. There are a few options for passing these
credentials:
- Use an
[AccessRole](https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html)
for temporary access by passing the `role_arn` identifier to
`S3FileSystem$create()` or `s3_bucket()`.
+### GCS Authentication
+
+The simplest way to authenticate with GCS is to run the
[gcloud](https://cloud.google.com/sdk/docs/)
+command to setup application default credentials:
+
+```
+gcloud auth application-default login
+```
+
+To manually configure credentials, you can pass either `access_token` and
`expiration`, for using
+temporary tokens generated elsewhere, or `json_credentials`, to reference a
downloaded
+credentials file.
+
+If you haven't configured credentials, then to access *public* buckets, you
+must pass `anonymous = TRUE` or `anonymous` as the user in a URI:
+
+```r
+bucket <- gs_bucket("voltrondata-labs-datasets", anonymous = TRUE)
+fs <- GcsFileSystem$create(anonymous = TRUE)
+df <-
read_parquet("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/data.parquet")
+```
+
+<!-- TODO(ARROW-16880): Describe what credentials to use for particular use
cases
+and how to integrate with gargle library. -->
+
## Using a proxy server
If you need to use a proxy server to connect to an S3 bucket, you can provide
@@ -117,7 +203,7 @@ a URI in the form `http://user:password@host:port` to
`proxy_options`. For
example, a local proxy server running on port 1316 can be used like this:
```r
-bucket <- s3_bucket("ursa-labs-taxi-data", proxy_options =
"http://localhost:1316")
+bucket <- s3_bucket("voltrondata-labs-datasets", proxy_options =
"http://localhost:1316")
```
## File systems that emulate S3