Will Jones created ARROW-17097:
----------------------------------
Summary: [C++] GCS: report common prefixes as directories
Key: ARROW-17097
URL: https://issues.apache.org/jira/browse/ARROW-17097
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Affects Versions: 8.0.0, 9.0.0
Reporter: Will Jones
Fix For: 10.0.0
I got confused at the behavior differences between S3 and GCS, only to realize
GCS only reports special directory markers as "directories" and not the common
prefixes. This can have the effect of making a directory look empty in GCS,
when it in fact has many folders (see example below).
We currently use the
[ListObjects|https://github.com/googleapis/google-cloud-cpp/blob/e0228233949873be9cba87ae4e37554e1ff0474d/google/cloud/storage/client.h#L974]
method, but perhaps it would be more appropriate to use the
[ListObjectsWithPrefix|https://github.com/googleapis/google-cloud-cpp/blob/e0228233949873be9cba87ae4e37554e1ff0474d/google/cloud/storage/client.h#L1006].
Since they are returned in the [same API
call|https://cloud.google.com/storage/docs/json_api/v1/objects/list], it
shouldn't add much overhead.
{code:r}
library(arrow)
bucket <- gs_bucket("voltrondata-labs-datasets", retry_limit_seconds = 3,
anonymous = TRUE)
s3_bucket <- s3_bucket("voltrondata-labs-datasets", endpoint_override =
"https://storage.googleapis.com")
# We did not create directory markers when uploading the data
# https://github.com/apache/arrow/pull/11842#discussion_r764204767
# The directory appears empty to GCSFileSystem...
bucket$ls("nyc-taxi")
#> character(0)
# ... but S3FileSystem knows otherwise!
s3_bucket$ls("nyc-taxi")
#> [1] "nyc-taxi/year=2009" "nyc-taxi/year=2010" "nyc-taxi/year=2011"
#> [4] "nyc-taxi/year=2012" "nyc-taxi/year=2013" "nyc-taxi/year=2014"
#> [7] "nyc-taxi/year=2015" "nyc-taxi/year=2016" "nyc-taxi/year=2017"
#> [10] "nyc-taxi/year=2018" "nyc-taxi/year=2019" "nyc-taxi/year=2020"
#> [13] "nyc-taxi/year=2021" "nyc-taxi/year=2022"
# Using GCS API, we only get files!
bucket$ls("nyc-taxi", recursive = TRUE)
#> [1] "nyc-taxi/year=2009/month=1/part-0.parquet"
#> [2] "nyc-taxi/year=2009/month=10/part-0.parquet"
#> ...
#> [157] "nyc-taxi/year=2022/month=1/part-0.parquet"
#> [158] "nyc-taxi/year=2022/month=2/part-0.parquet"
# Using S3 API, we can get directories!
s3_bucket$ls("nyc-taxi", recursive = TRUE)
#> [1] "nyc-taxi/year=2009"
#> [2] "nyc-taxi/year=2009/month=1"
#> [3] "nyc-taxi/year=2009/month=1/part-0.parquet"
#> [4] "nyc-taxi/year=2009/month=10"
#> [5] "nyc-taxi/year=2009/month=10/part-0.parquet"
#> [6] "nyc-taxi/year=2009/month=11"
#> ...
#> [329] "nyc-taxi/year=2022/month=2"
#> [330] "nyc-taxi/year=2022/month=2/part-0.parquet"
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)