[
https://issues.apache.org/jira/browse/ARROW-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17422845#comment-17422845
]
Carlos O'Ryan commented on ARROW-1231:
--------------------------------------
{quote}
* Yes, it sounds good to break this down.
* Hmm, we can help on that once you have the tests working locally
{quote}
Thanks, I will proceed accordingly.
{quote}
* Which one is the lightest/easiest to deply and work with? (as an example, we
use Minio for S3 testing)
{quote}
I write {{googleapis/storage-testbench}} so take what I say with a pinch of
salt:
* {{googleapis/storage-testbench}} supports both the JSON API and the parts of
the XML API that the C++ client library uses.
* We (the team that maintains the GCS C++ client library) use
{{googleapis/storage-testbench}}.
* To use {{fsouza/fake-gcs-server}} we would need to disable the XML API which
is a bit uncomfortable.
* {{fsouza/fake-gcs-server}} can persist data, {{googleapis/storage-testbench}}
keeps all its data in memory.
* I imagine both are easy to install and run. {{googleapis/storage-testbench}}
is Python and can be installed via pip. {{fsouza/fake-gcs-server}} is golang,
it has binaries for several platforms and docker images.
I would prefer to use {{googleapis/storage-testbench}}.
{quote}
* Is there a convention for emulating directories? We already do this for S3,
as other tools also do.
{quote}
I think there is more than one convention. The most common thing is to pretend
that prefixes are directories, separated by {{/}}. You can ask GCS to list all
the objects that start with {{foo/bar}}, and to only list the prefix for
{{foo/bar/baz/}} if there are multiple objects named {{foo/bar/baz/*}}. Where
this breaks down is non-recursive listing, this can be surprisingly slow, while
recursive listing takes about the same time (basically GCS scans its metadata
and skips over the "subdirectories", but skipping can take a long time). It
breaks down when you try to {{stat(2)}} a prefix that does not have a backing
object. Some folks create sentinel objects to represent the directory, but
there is no guarantee that reading some dataset created by another tool would
contain those sentinels.
> [C++] Add filesystem / IO implementation for Google Cloud Storage
> -----------------------------------------------------------------
>
> Key: ARROW-1231
> URL: https://issues.apache.org/jira/browse/ARROW-1231
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++
> Reporter: Wes McKinney
> Assignee: Carlos O'Ryan
> Priority: Major
> Labels: filesystem
>
> See example jumping off point
> https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/platform/cloud
--
This message was sent by Atlassian Jira
(v8.3.4#803005)