[
https://issues.apache.org/jira/browse/FLINK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346542#comment-17346542
]
Xintong Song commented on FLINK-19481:
--------------------------------------
{quote}The runtime complexity of having the additional Hadoop layer will likely
be strictly worse. This is because each layer has it's own configuration and
things like thread pooling, pool sizes, buffering, and other non-trivial tuning
parameters.
{quote}
I'm not sure about this. Looking into o.a.f.runtime.fs.hdfs.HadoopFileSystem,
the Flink filesystem is practically a layer of API mappings around the Hadoop
filesystem. It might be true that the parameters to be tuned are separated into
different layers, but I wonder how many extra parameters, thus complexity, are
introduced due to the additional layer. Shouldn't the total amount of
parameters be the same?
{quote}In my experience the more native (fewer layers of abstraction) you can
achieve the better the result.
{quote}
I admit that, if we are building the GCS file system from the ground up, the
less layers the better.
# GCS SDK -> Hadoop FileSystem -> Flink FileSystem
# GCS SDK -> Flink FileSystem
However, we don't have to build everything from the ground up. In the first
path above, there are already off-the-shelf solution for both mappings (google
connector for sdk -> hadoop fs, and o.a.f.runtime.fs.hdfs.HadoopFileSystem for
hadoop-> flink). It requires almost no extra efforts in addition to assembling
existing artifacts. On the other hand, in the second path we need to implement
a brand new file system, which seems to be re-inventing the wheel.
{quote}It seems from reading the comments here though that a good solution
would be a hybrid of Ben's work on the native GCS Filesystem combined with
Galen's work on the RecoverableWriter.
{quote}
Unless there're more inputs on why we should have a native GCS file system, I'm
leaning towards not introducing such a native implementation based on the
discussion so far.
> Add support for a flink native GCS FileSystem
> ---------------------------------------------
>
> Key: FLINK-19481
> URL: https://issues.apache.org/jira/browse/FLINK-19481
> Project: Flink
> Issue Type: Improvement
> Components: Connectors / FileSystem, FileSystems
> Affects Versions: 1.12.0
> Reporter: Ben Augarten
> Priority: Minor
> Labels: auto-deprioritized-major
>
> Currently, GCS is supported but only by using the hadoop connector[1]
>
> The objective of this improvement is to add support for checkpointing to
> Google Cloud Storage with the Flink File System,
>
> This would allow the `gs://` scheme to be used for savepointing and
> checkpointing. Long term, it would be nice if we could use the GCS FileSystem
> as a source and sink in flink jobs as well.
>
> Long term, I hope that implementing a flink native GCS FileSystem will
> simplify usage of GCS because the hadoop FileSystem ends up bringing in many
> unshaded dependencies.
>
> [1]
> [https://github.com/GoogleCloudDataproc/hadoop-connectors|https://github.com/GoogleCloudDataproc/hadoop-connectors)]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)