[ 
https://issues.apache.org/jira/browse/ARROW-17097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-17097:
---------------------------------
    Fix Version/s: 11.0.0
                       (was: 10.0.0)

> [C++] GCS: report common prefixes as directories
> ------------------------------------------------
>
>                 Key: ARROW-17097
>                 URL: https://issues.apache.org/jira/browse/ARROW-17097
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>    Affects Versions: 8.0.0, 9.0.0
>            Reporter: Will Jones
>            Priority: Major
>             Fix For: 11.0.0
>
>
> I got confused at the behavior differences between S3 and GCS, only to 
> realize GCS only reports special directory markers as "directories" and not 
> the common prefixes. This can have the effect of making a directory look 
> empty in GCS, when it in fact has many folders (see example below).
> We currently use the 
> [ListObjects|https://github.com/googleapis/google-cloud-cpp/blob/e0228233949873be9cba87ae4e37554e1ff0474d/google/cloud/storage/client.h#L974]
>  method, but perhaps it would be more appropriate to use the 
> [ListObjectsWithPrefix|https://github.com/googleapis/google-cloud-cpp/blob/e0228233949873be9cba87ae4e37554e1ff0474d/google/cloud/storage/client.h#L1006].
>  Since they are returned in the [same API 
> call|https://cloud.google.com/storage/docs/json_api/v1/objects/list], it 
> shouldn't add much overhead.
> {code:r}
> library(arrow)
> bucket <- gs_bucket("voltrondata-labs-datasets", retry_limit_seconds = 3, 
> anonymous = TRUE)
> s3_bucket <- s3_bucket("voltrondata-labs-datasets", endpoint_override = 
> "https://storage.googleapis.com";)
> # We did not create directory markers when uploading the data
> # https://github.com/apache/arrow/pull/11842#discussion_r764204767
> # The directory appears empty to GCSFileSystem...
> bucket$ls("nyc-taxi")
> #> character(0)
> # ... but S3FileSystem knows otherwise!
> s3_bucket$ls("nyc-taxi")
> #>  [1] "nyc-taxi/year=2009" "nyc-taxi/year=2010" "nyc-taxi/year=2011"
> #>  [4] "nyc-taxi/year=2012" "nyc-taxi/year=2013" "nyc-taxi/year=2014"
> #>  [7] "nyc-taxi/year=2015" "nyc-taxi/year=2016" "nyc-taxi/year=2017"
> #> [10] "nyc-taxi/year=2018" "nyc-taxi/year=2019" "nyc-taxi/year=2020"
> #> [13] "nyc-taxi/year=2021" "nyc-taxi/year=2022"
> # Using GCS API, we only get files!
> bucket$ls("nyc-taxi", recursive = TRUE)
> #>   [1] "nyc-taxi/year=2009/month=1/part-0.parquet" 
> #>   [2] "nyc-taxi/year=2009/month=10/part-0.parquet"
> #> ...
> #> [157] "nyc-taxi/year=2022/month=1/part-0.parquet" 
> #> [158] "nyc-taxi/year=2022/month=2/part-0.parquet"
> # Using S3 API, we can get directories!
> s3_bucket$ls("nyc-taxi", recursive = TRUE)
> #>   [1] "nyc-taxi/year=2009"                        
> #>   [2] "nyc-taxi/year=2009/month=1"                
> #>   [3] "nyc-taxi/year=2009/month=1/part-0.parquet" 
> #>   [4] "nyc-taxi/year=2009/month=10"               
> #>   [5] "nyc-taxi/year=2009/month=10/part-0.parquet"
> #>   [6] "nyc-taxi/year=2009/month=11"               
> #> ...
> #> [329] "nyc-taxi/year=2022/month=2"                
> #> [330] "nyc-taxi/year=2022/month=2/part-0.parquet"
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to