[
https://issues.apache.org/jira/browse/ARROW-17097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kouhei Sutou updated ARROW-17097:
---------------------------------
Fix Version/s: 11.0.0
(was: 10.0.0)
> [C++] GCS: report common prefixes as directories
> ------------------------------------------------
>
> Key: ARROW-17097
> URL: https://issues.apache.org/jira/browse/ARROW-17097
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Affects Versions: 8.0.0, 9.0.0
> Reporter: Will Jones
> Priority: Major
> Fix For: 11.0.0
>
>
> I got confused at the behavior differences between S3 and GCS, only to
> realize GCS only reports special directory markers as "directories" and not
> the common prefixes. This can have the effect of making a directory look
> empty in GCS, when it in fact has many folders (see example below).
> We currently use the
> [ListObjects|https://github.com/googleapis/google-cloud-cpp/blob/e0228233949873be9cba87ae4e37554e1ff0474d/google/cloud/storage/client.h#L974]
> method, but perhaps it would be more appropriate to use the
> [ListObjectsWithPrefix|https://github.com/googleapis/google-cloud-cpp/blob/e0228233949873be9cba87ae4e37554e1ff0474d/google/cloud/storage/client.h#L1006].
> Since they are returned in the [same API
> call|https://cloud.google.com/storage/docs/json_api/v1/objects/list], it
> shouldn't add much overhead.
> {code:r}
> library(arrow)
> bucket <- gs_bucket("voltrondata-labs-datasets", retry_limit_seconds = 3,
> anonymous = TRUE)
> s3_bucket <- s3_bucket("voltrondata-labs-datasets", endpoint_override =
> "https://storage.googleapis.com")
> # We did not create directory markers when uploading the data
> # https://github.com/apache/arrow/pull/11842#discussion_r764204767
> # The directory appears empty to GCSFileSystem...
> bucket$ls("nyc-taxi")
> #> character(0)
> # ... but S3FileSystem knows otherwise!
> s3_bucket$ls("nyc-taxi")
> #> [1] "nyc-taxi/year=2009" "nyc-taxi/year=2010" "nyc-taxi/year=2011"
> #> [4] "nyc-taxi/year=2012" "nyc-taxi/year=2013" "nyc-taxi/year=2014"
> #> [7] "nyc-taxi/year=2015" "nyc-taxi/year=2016" "nyc-taxi/year=2017"
> #> [10] "nyc-taxi/year=2018" "nyc-taxi/year=2019" "nyc-taxi/year=2020"
> #> [13] "nyc-taxi/year=2021" "nyc-taxi/year=2022"
> # Using GCS API, we only get files!
> bucket$ls("nyc-taxi", recursive = TRUE)
> #> [1] "nyc-taxi/year=2009/month=1/part-0.parquet"
> #> [2] "nyc-taxi/year=2009/month=10/part-0.parquet"
> #> ...
> #> [157] "nyc-taxi/year=2022/month=1/part-0.parquet"
> #> [158] "nyc-taxi/year=2022/month=2/part-0.parquet"
> # Using S3 API, we can get directories!
> s3_bucket$ls("nyc-taxi", recursive = TRUE)
> #> [1] "nyc-taxi/year=2009"
> #> [2] "nyc-taxi/year=2009/month=1"
> #> [3] "nyc-taxi/year=2009/month=1/part-0.parquet"
> #> [4] "nyc-taxi/year=2009/month=10"
> #> [5] "nyc-taxi/year=2009/month=10/part-0.parquet"
> #> [6] "nyc-taxi/year=2009/month=11"
> #> ...
> #> [329] "nyc-taxi/year=2022/month=2"
> #> [330] "nyc-taxi/year=2022/month=2/part-0.parquet"
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)