[ 
https://issues.apache.org/jira/browse/FLINK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346179#comment-17346179
 ] 

Jamie Grier commented on FLINK-19481:
-------------------------------------

Yes, [~xintongsong], that's my opinion based on experience.  The runtime 
complexity of having the additional Hadoop layer will likely be strictly worse. 
 This is because each layer has it's own configuration and things like thread 
pooling, pool sizes, buffering, and other non-trivial tuning parameters.

 

It can be very difficult to tune this stuff for production workloads with 
non-trivial throughput and having all of those layers makes it (much) worse.  
Due to the config It's a leaky abstraction so you end up having to understand, 
configure, and tune the Flink, Hadoop, and GCS layers anyway.

 

Again, this is based mostly on my experience with the various flavors of the S3 
connector but it will still apply here.  In my experience the more native 
(fewer layers of abstraction) you can achieve the better the result.

 

That said I have not looked at Galen's PR.  It seems from reading the comments 
here though that a good solution would be a hybrid of Ben's work on the native 
GCS Filesystem combined with Galen's work on the RecoverableWriter.

 

 

> Add support for a flink native GCS FileSystem
> ---------------------------------------------
>
>                 Key: FLINK-19481
>                 URL: https://issues.apache.org/jira/browse/FLINK-19481
>             Project: Flink
>          Issue Type: Improvement
>          Components: Connectors / FileSystem, FileSystems
>    Affects Versions: 1.12.0
>            Reporter: Ben Augarten
>            Priority: Minor
>              Labels: auto-deprioritized-major
>
> Currently, GCS is supported but only by using the hadoop connector[1]
>  
> The objective of this improvement is to add support for checkpointing to 
> Google Cloud Storage with the Flink File System,
>  
> This would allow the `gs://` scheme to be used for savepointing and 
> checkpointing. Long term, it would be nice if we could use the GCS FileSystem 
> as a source and sink in flink jobs as well. 
>  
> Long term, I hope that implementing a flink native GCS FileSystem will 
> simplify usage of GCS because the hadoop FileSystem ends up bringing in many 
> unshaded dependencies.
>  
> [1] 
> [https://github.com/GoogleCloudDataproc/hadoop-connectors|https://github.com/GoogleCloudDataproc/hadoop-connectors)]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to