jonkeane commented on a change in pull request #10546:
URL: https://github.com/apache/arrow/pull/10546#discussion_r677509801



##########
File path: r/vignettes/fs.Rmd
##########
@@ -128,3 +128,74 @@ 
s3://minioadmin:minioadmin@?scheme=http&endpoint_override=localhost%3A9000
 
 Among other applications, this can be useful for testing out code locally 
before
 running on a remote S3 bucket.
+
+## Non-AWS S3 cloud alternatives (DigitalOcean, IBM, Alibaba, and others)
+
+*This section adapts some elements from [Analyzing Room Temperature 
Data](https://www.jaredlander.com/2021/03/analyzing-room-temperature-data/#getting-the-data)
 by Jared Lander.*
+
+If you are using any Amazon S3 Compliant Storage Provider, such as AWS, 
Alibaba, 
+Ceph, DigitalOcean, Dreamhost, IBM COS, Minio, or others, you can connect to 
it 
+with `arrow` by using the `S3FileSystem` function as for the case of using 
+MinIO locally. Please note that the use of DigitalOcean here is just an 
example, as 
+it can be any other S3 compatible service.
+
+At the begininning of this vignette we used:
+
+```r
+june2019 <- SubTreeFileSystem$create("s3://ursa-labs-taxi-data/2019/06")
+```
+
+Which connects to AWS, and the same can be adapted for other providers, For 
+instructional purposes, we provide 
[nyc-taxi.sfo3.digitaloceanspaces.com](https://nyc-taxi.sfo3.digitaloceanspaces.com),
 
+which is a public storage with the NYC taxi data used in
+[Working with Arrow Datasets and dplyr](dataset.html).
+
+To connect to this space, you only need to adapt the code from the previous
+section:
+
+```r
+space <- arrow::S3FileSystem$create(
+  anonymous = TRUE,
+  scheme = "https",
+  endpoint_override = "sfo3.digitaloceanspaces.com"
+)
+```
+
+The space that we are using space allows anonymous access, but if you were to 
+connect to a private space (i.e. with sensitive data), you would need to 
+provide a token, say:
+
+```r
+space <- arrow::S3FileSystem$create(
+  access_key = Sys.getenv('DO_ARROW_TAXI_TOKEN'),
+  secret_key = Sys.getenv('DO_ARROW_TAXI_SECRET'),
+  scheme = "https",
+  endpoint_override = "sfo3.digitaloceanspaces.com"
+)
+```

Review comment:
       Instead of using digital ocean as our "here is an example of an 
alternative" let's lean into the minio example that is already there. That has 
a few nice advantages: 
   * anyone can install minio without needing to put data in a service
   * we can test against minio to confirm that this works
   * we don't need to maintain a new data source

##########
File path: r/vignettes/fs.Rmd
##########
@@ -128,3 +128,74 @@ 
s3://minioadmin:minioadmin@?scheme=http&endpoint_override=localhost%3A9000
 
 Among other applications, this can be useful for testing out code locally 
before
 running on a remote S3 bucket.
+
+## Non-AWS S3 cloud alternatives (DigitalOcean, IBM, Alibaba, and others)
+
+*This section adapts some elements from [Analyzing Room Temperature 
Data](https://www.jaredlander.com/2021/03/analyzing-room-temperature-data/#getting-the-data)
 by Jared Lander.*
+
+If you are using any Amazon S3 Compliant Storage Provider, such as AWS, 
Alibaba, 
+Ceph, DigitalOcean, Dreamhost, IBM COS, Minio, or others, you can connect to 
it 
+with `arrow` by using the `S3FileSystem` function as for the case of using 
+MinIO locally. Please note that the use of DigitalOcean here is just an 
example, as 
+it can be any other S3 compatible service.

Review comment:
       The list of providers is great, we should add this to the collapsed 
section in the proposed reorganization.

##########
File path: r/vignettes/fs.Rmd
##########
@@ -128,3 +128,74 @@ 
s3://minioadmin:minioadmin@?scheme=http&endpoint_override=localhost%3A9000
 
 Among other applications, this can be useful for testing out code locally 
before
 running on a remote S3 bucket.
+
+## Non-AWS S3 cloud alternatives (DigitalOcean, IBM, Alibaba, and others)
+
+*This section adapts some elements from [Analyzing Room Temperature 
Data](https://www.jaredlander.com/2021/03/analyzing-room-temperature-data/#getting-the-data)
 by Jared Lander.*
+
+If you are using any Amazon S3 Compliant Storage Provider, such as AWS, 
Alibaba, 
+Ceph, DigitalOcean, Dreamhost, IBM COS, Minio, or others, you can connect to 
it 
+with `arrow` by using the `S3FileSystem` function as for the case of using 
+MinIO locally. Please note that the use of DigitalOcean here is just an 
example, as 
+it can be any other S3 compatible service.
+
+At the begininning of this vignette we used:
+
+```r
+june2019 <- SubTreeFileSystem$create("s3://ursa-labs-taxi-data/2019/06")
+```
+
+Which connects to AWS, and the same can be adapted for other providers, For 
+instructional purposes, we provide 
[nyc-taxi.sfo3.digitaloceanspaces.com](https://nyc-taxi.sfo3.digitaloceanspaces.com),
 
+which is a public storage with the NYC taxi data used in
+[Working with Arrow Datasets and dplyr](dataset.html).
+
+To connect to this space, you only need to adapt the code from the previous
+section:
+
+```r
+space <- arrow::S3FileSystem$create(
+  anonymous = TRUE,
+  scheme = "https",
+  endpoint_override = "sfo3.digitaloceanspaces.com"
+)
+```
+
+The space that we are using space allows anonymous access, but if you were to 
+connect to a private space (i.e. with sensitive data), you would need to 
+provide a token, say:
+
+```r
+space <- arrow::S3FileSystem$create(
+  access_key = Sys.getenv('DO_ARROW_TAXI_TOKEN'),
+  secret_key = Sys.getenv('DO_ARROW_TAXI_SECRET'),
+  scheme = "https",
+  endpoint_override = "sfo3.digitaloceanspaces.com"
+)
+```
+
+In order to list the files in the space, you can just type:
+
+```r
+space$ls('nyc-taxi', recursive = TRUE)
+```
+
+Just like AWS, one way to get a subtree is to call the `$cd()` method on a 
+`FileSystem`:
+
+```r
+june2019 <- space$path("nyc-taxi/2019/06")
+df <- read_parquet(june2019$path("data.parquet"))
+```
+
+From here, the same example from the [Working with Arrow Datasets and 
dplyr](dataset.html) vignette can be completed with a single change:
+
+```r 
+copy_files(space$path("nyc-taxi/"), "nyc-taxi")
+```
+
+Instead of:
+
+```
+copy_files("s3://ursa-labs-taxi-data", "nyc-taxi")
+```

Review comment:
       I don't think we necessarily need to re-hash all of the filesystem 
operations. How about instead of all of these, we add to the minio example 
above:
   
   1. the command needed to run minio (`minio server {path}`) so that someone 
could run that more easily without having to learn about how minio works
   1. copy a small subset of the taxi data (maybe a year or a few months from a 
single year)
   1. how to open a dataset (if the root of the minio path one is using is the 
root of the dataset, one might do `ds <- open_dataset(minio$path(""), 
partitioning = c("year", "month"))` which is a bit counter-intuitive, so we 
should get it out in the vignette.
   1. How to open a single parquet file?
   1. a more detailed description of the differences (and similarities) between 
`S3FileSystem$create` and the URI (in other words: show the correspondences 
between the elements of the URI and the arguments of `S3FileSystem$create`)

##########
File path: r/vignettes/fs.Rmd
##########
@@ -128,3 +128,74 @@ 
s3://minioadmin:minioadmin@?scheme=http&endpoint_override=localhost%3A9000
 
 Among other applications, this can be useful for testing out code locally 
before
 running on a remote S3 bucket.
+
+## Non-AWS S3 cloud alternatives (DigitalOcean, IBM, Alibaba, and others)
+
+*This section adapts some elements from [Analyzing Room Temperature 
Data](https://www.jaredlander.com/2021/03/analyzing-room-temperature-data/#getting-the-data)
 by Jared Lander.*
+
+If you are using any Amazon S3 Compliant Storage Provider, such as AWS, 
Alibaba, 
+Ceph, DigitalOcean, Dreamhost, IBM COS, Minio, or others, you can connect to 
it 
+with `arrow` by using the `S3FileSystem` function as for the case of using 
+MinIO locally. Please note that the use of DigitalOcean here is just an 
example, as 
+it can be any other S3 compatible service.
+
+At the begininning of this vignette we used:
+
+```r
+june2019 <- SubTreeFileSystem$create("s3://ursa-labs-taxi-data/2019/06")
+```
+
+Which connects to AWS, and the same can be adapted for other providers, For 
+instructional purposes, we provide 
[nyc-taxi.sfo3.digitaloceanspaces.com](https://nyc-taxi.sfo3.digitaloceanspaces.com),
 
+which is a public storage with the NYC taxi data used in
+[Working with Arrow Datasets and dplyr](dataset.html).

Review comment:
       This digital ocean bucket is one you created, yeah? I'm not sure that we 
want to create a new storage location that we need to maintain on top of what 
we have in s3 already. See below / in the comment at the top about how we 
should reorganize this to avoid that.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to