coryan commented on a change in pull request #11842:
URL: https://github.com/apache/arrow/pull/11842#discussion_r764204767
##########
File path: cpp/src/arrow/filesystem/gcsfs.h
##########
@@ -40,24 +40,43 @@ struct ARROW_EXPORT GcsOptions {
bool Equals(const GcsOptions& other) const;
};
+/// - TODO(ARROW-1231) - review this documentation before closing the bug.
/// \brief GCS-backed FileSystem implementation.
///
-/// Some implementation notes:
-/// - TODO(ARROW-1231) - review all the notes once completed.
-/// - buckets are treated as top-level directories on a "root".
-/// - GCS buckets are in a global namespace, only one bucket
-/// named `foo` exists in Google Cloud.
-/// - Creating new top-level directories is implemented by creating
-/// a bucket, this may be a slower operation than usual.
-/// - A principal (service account, user, etc) can only list the
-/// buckets for a single project, but can access the buckets
-/// for many projects. It is possible that listing "all"
-/// the buckets returns fewer buckets than you have access to.
-/// - GCS does not have directories, they are emulated in this
-/// library by listing objects with a common prefix.
-/// - In general, GCS has much higher latency than local filesystems.
-/// The throughput of GCS is comparable to the throughput of
-/// a local file system.
+/// GCS (Google Cloud Storage - https://cloud.google.com/storage) is an
scalable object
+/// storage system for any amount of data. The main abstractions in GCS are
buckets and
+/// objects. A bucket is a namespace for objects, buckets can store any number
of objects,
+/// tens of millions and even billions is not uncommon. Each object contains
a single
+/// blob of data, up to 5TiB in size. Buckets are typically configured to
keep a single
+/// version of each object, but versioning can be enabled. Versioning is
important because
+/// objects are immutable, once created one cannot append data to the object
or modify the
+/// object data in any way.
+///
+/// GCS buckets are in a global namespace, if a Google Cloud customer creates
a bucket
+/// named `foo` no other customer can create a bucket with the same name. Note
that a
+/// principal (a user or service account) may only list the buckets they have
entitled to,
+/// and then only within a project. It is not possible to list "all" the
buckets.
+///
+/// Within each bucket objects are in flat namespace. GCS does not have
folders or
+/// directories. However, following some conventions it is possible to emulate
+/// directories. To this end this class:
+///
+/// - All buckets are treated as directories at the "root"
+/// - Creating a root directory results in a new bucket being created, this
may be slower
+/// than most GCS operations.
+/// - Any object with a name ending with a slash (`/`) character is treated as
a
+/// directory.
+/// - The class creates marker objects for a directory, using a trailing slash
in the
+/// marker names. For debugging purposes, the metadata and contents of these
marker
+/// objects indicate that they are markers created by this class. The class
does
Review comment:
Kind of? The UI in for Google Cloud creates empty objects ending with
`/`. The command-line utility does no such thing (e.g. `gsutil cp -r
deep-directory/ gs://my-bucket/foo` would not create these markers). The
client libraries do not create them either. Whether that amounts to a de facto
standard, I cannot say.
Frankly the use of "folder emulation" causes more harm than good. It
creates the impression that some things should work (e.g. directory renames,
directory permissions, efficient non-recursive listing) when they don't. And
the markers are sort of useless. If you want to list objects non-recursively,
the API has native support for including any matching prefixes in the results,
without the need for these markers :shrug:
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]