[jira] [Commented] (FLINK-19481) Add support for a flink native GCS FileSystem

Jamie Grier (Jira) Sat, 15 May 2021 06:26:08 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17345032#comment-17345032
 ]


Jamie Grier commented on FLINK-19481:
-------------------------------------

The primary benefits of a native implementation are described earlier in this 
ticket.  This is based on my own experience in production for several years 
with the other Hadoop based File Systems – primarily the S3 one though.

 
{noformat}
I think a native GCS filesytem would be a major benefit to Flink users.  The 
only way to support GCS currently is, as stated, through the Hadoop Filesystem 
implementation which brings several problems along with it.  The two largest 
problems I've experienced are:1) Hadoop has a huge dependency footprint which 
is a significant headache for Flink application developers dealing with 
dependency-hell.2) The total stack of FileSystem abstractions on this path 
becomes very difficult to tune, understand, and support.  By stack I'm 
referring to Flink's own FileSystem abstraction, then the Hadoop layer, then 
the GCS libraries.  This is very difficult to work with in production as each 
layer has its own intricacies, connection pools, thread pools, tunable 
configuration, versions, dependency versions, etc.Having gone down this path 
with the old-style Hadoop+S3 filesystem approach I know how difficult it can be 
and a native implementation should prove to be much simpler to support and 
easier to tune and modify for performance.  This is why the presto-s3-fs 
filesystem was adopted, for example.{noformat}

> Add support for a flink native GCS FileSystem
> ---------------------------------------------
>
>                 Key: FLINK-19481
>                 URL: https://issues.apache.org/jira/browse/FLINK-19481
>             Project: Flink
>          Issue Type: Improvement
>          Components: Connectors / FileSystem, FileSystems
>    Affects Versions: 1.12.0
>            Reporter: Ben Augarten
>            Priority: Minor
>              Labels: auto-deprioritized-major
>
> Currently, GCS is supported but only by using the hadoop connector[1]
>  
> The objective of this improvement is to add support for checkpointing to 
> Google Cloud Storage with the Flink File System,
>  
> This would allow the `gs://` scheme to be used for savepointing and 
> checkpointing. Long term, it would be nice if we could use the GCS FileSystem 
> as a source and sink in flink jobs as well. 
>  
> Long term, I hope that implementing a flink native GCS FileSystem will 
> simplify usage of GCS because the hadoop FileSystem ends up bringing in many 
> unshaded dependencies.
>  
> [1] 
> [https://github.com/GoogleCloudDataproc/hadoop-connectors|https://github.com/GoogleCloudDataproc/hadoop-connectors)]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-19481) Add support for a flink native GCS FileSystem

Reply via email to