[ 
https://issues.apache.org/jira/browse/ARROW-17097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567490#comment-17567490
 ] 

Carlos O'Ryan commented on ARROW-17097:
---------------------------------------

The details are a bit fuzzy at the moment.  At a high level, all approaches to 
simulate folders over GCS will fail, but will fail in different ways.  You can 
make something like `ListObjects()` return directory markers  for common 
prefixes, but then trying to call `GetFileInfo()` on those markers will fail.  
Or will need to be very expensive.  In hindsight, I should have written a 
design doc outlining the tradeoffs and the decisions, but I did not realize 
when I started the project that the API (and tests) that there would be so many.

 

> [C++] GCS: report common prefixes as directories
> ------------------------------------------------
>
>                 Key: ARROW-17097
>                 URL: https://issues.apache.org/jira/browse/ARROW-17097
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>    Affects Versions: 8.0.0, 9.0.0
>            Reporter: Will Jones
>            Priority: Major
>             Fix For: 10.0.0
>
>
> I got confused at the behavior differences between S3 and GCS, only to 
> realize GCS only reports special directory markers as "directories" and not 
> the common prefixes. This can have the effect of making a directory look 
> empty in GCS, when it in fact has many folders (see example below).
> We currently use the 
> [ListObjects|https://github.com/googleapis/google-cloud-cpp/blob/e0228233949873be9cba87ae4e37554e1ff0474d/google/cloud/storage/client.h#L974]
>  method, but perhaps it would be more appropriate to use the 
> [ListObjectsWithPrefix|https://github.com/googleapis/google-cloud-cpp/blob/e0228233949873be9cba87ae4e37554e1ff0474d/google/cloud/storage/client.h#L1006].
>  Since they are returned in the [same API 
> call|https://cloud.google.com/storage/docs/json_api/v1/objects/list], it 
> shouldn't add much overhead.
> {code:r}
> library(arrow)
> bucket <- gs_bucket("voltrondata-labs-datasets", retry_limit_seconds = 3, 
> anonymous = TRUE)
> s3_bucket <- s3_bucket("voltrondata-labs-datasets", endpoint_override = 
> "https://storage.googleapis.com";)
> # We did not create directory markers when uploading the data
> # https://github.com/apache/arrow/pull/11842#discussion_r764204767
> # The directory appears empty to GCSFileSystem...
> bucket$ls("nyc-taxi")
> #> character(0)
> # ... but S3FileSystem knows otherwise!
> s3_bucket$ls("nyc-taxi")
> #>  [1] "nyc-taxi/year=2009" "nyc-taxi/year=2010" "nyc-taxi/year=2011"
> #>  [4] "nyc-taxi/year=2012" "nyc-taxi/year=2013" "nyc-taxi/year=2014"
> #>  [7] "nyc-taxi/year=2015" "nyc-taxi/year=2016" "nyc-taxi/year=2017"
> #> [10] "nyc-taxi/year=2018" "nyc-taxi/year=2019" "nyc-taxi/year=2020"
> #> [13] "nyc-taxi/year=2021" "nyc-taxi/year=2022"
> # Using GCS API, we only get files!
> bucket$ls("nyc-taxi", recursive = TRUE)
> #>   [1] "nyc-taxi/year=2009/month=1/part-0.parquet" 
> #>   [2] "nyc-taxi/year=2009/month=10/part-0.parquet"
> #> ...
> #> [157] "nyc-taxi/year=2022/month=1/part-0.parquet" 
> #> [158] "nyc-taxi/year=2022/month=2/part-0.parquet"
> # Using S3 API, we can get directories!
> s3_bucket$ls("nyc-taxi", recursive = TRUE)
> #>   [1] "nyc-taxi/year=2009"                        
> #>   [2] "nyc-taxi/year=2009/month=1"                
> #>   [3] "nyc-taxi/year=2009/month=1/part-0.parquet" 
> #>   [4] "nyc-taxi/year=2009/month=10"               
> #>   [5] "nyc-taxi/year=2009/month=10/part-0.parquet"
> #>   [6] "nyc-taxi/year=2009/month=11"               
> #> ...
> #> [329] "nyc-taxi/year=2022/month=2"                
> #> [330] "nyc-taxi/year=2022/month=2/part-0.parquet"
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to