[ 
https://issues.apache.org/jira/browse/ARROW-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17422845#comment-17422845
 ] 

Carlos O'Ryan commented on ARROW-1231:
--------------------------------------

{quote}
* Yes, it sounds good to break this down.
* Hmm, we can help on that once you have the tests working locally 
{quote}

Thanks, I will proceed accordingly.

{quote}
* Which one is the lightest/easiest to deply and work with? (as an example, we 
use Minio for S3 testing)
{quote}

I write {{googleapis/storage-testbench}} so take what I say with a pinch of 
salt:

* {{googleapis/storage-testbench}} supports both the JSON API and the parts of 
the XML API that the C++ client library uses.
* We (the team that maintains the GCS C++ client library) use 
{{googleapis/storage-testbench}}.
* To use {{fsouza/fake-gcs-server}} we would need to disable the XML API which 
is a bit uncomfortable.
* {{fsouza/fake-gcs-server}} can persist data, {{googleapis/storage-testbench}} 
keeps all its data in memory.
* I imagine both are easy to install and run.  {{googleapis/storage-testbench}} 
is Python and can be installed via pip. {{fsouza/fake-gcs-server}} is golang, 
it has binaries for several platforms and docker images.

I would prefer to use {{googleapis/storage-testbench}}.

{quote}
* Is there a convention for emulating directories? We already do this for S3, 
as other tools also do.
{quote}

I think there is more than one convention.  The most common thing is to pretend 
that prefixes are directories, separated by {{/}}. You can ask GCS to list all 
the objects that start with {{foo/bar}}, and to only list the prefix for 
{{foo/bar/baz/}} if there are multiple objects named {{foo/bar/baz/*}}.  Where 
this breaks down is non-recursive listing, this can be surprisingly slow, while 
recursive listing takes about the same time (basically GCS scans its metadata 
and skips over the "subdirectories", but skipping can take a long time).  It 
breaks down when you try to {{stat(2)}} a prefix that does not have a backing 
object.  Some folks create sentinel objects to represent the directory, but 
there is no guarantee that reading some dataset created by another tool would 
contain those sentinels.


> [C++] Add filesystem / IO implementation for Google Cloud Storage
> -----------------------------------------------------------------
>
>                 Key: ARROW-1231
>                 URL: https://issues.apache.org/jira/browse/ARROW-1231
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Wes McKinney
>            Assignee: Carlos O'Ryan
>            Priority: Major
>              Labels: filesystem
>
> See example jumping off point
> https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/platform/cloud



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to