[jira] [Commented] (FLINK-11838) Create RecoverableWriter for GCS

Xintong Song (Jira) Sun, 07 Feb 2021 02:53:05 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-11838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17280454#comment-17280454
 ]


Xintong Song commented on FLINK-11838:
--------------------------------------

[~galenwarren],

Thanks for explaining the design details. This is exactly what I'm looking for.

I'm trying to understand the reasons behind some of the design choices. Please 
confirm if I understand it correctly, or correct me if otherwise.
* The design proposes to build {{GCSFileSystem}} on top of 
{{HadoopFileSystem}}. This is because we can reuse most part of the 
{{HadoopFileSystem}} implementation, leveraging the GCS-provided 
{{GoogleHadoopFileSystem}}.
* For the {{RecoverableWriter}}, we cannot reuse {{HadoopRecoverableWriter}}, 
because:
** {{HadoopRecoverableWriter}} checks the schema and Hadoop version
** {{HadoopRecoverableWriter}} assumes files can be appended, which is true for 
files on HDFS but not for immutable objects on GCS.
* The design proposes to leverage the GCS resumable upload feature.
** The feature allows capturing *write state* during writing the object and 
resume writing by restoring the captured *write state*.
** Both capturing and restoring should happen before the object is completely 
written (thus not visible for reading), and once the writing is completed the 
object becomes immutable.
** To use this feature, we need to persist an object 
({{RestorableState<WriteChannel>}}) generated by {{capture()}}, which will be 
used for {{restore()}} later. However, the implementation of 
{{RestorableState<WriteChannel>}} is internal to GCS, and we do not have a good 
way to serialize/deserialize it.

If my understanding is correct, then I have a few questions.
# Does "writing a blob to a temporary location" mean that the user always needs 
to configure a temporary location? How is the temporary location cleaned, say 
if they're never moved to the committed location?
# Per [this doc|https://cloud.google.com/storage/docs/resumable-uploads], a 
resumable upload must be completed within a week. This could be surprising for 
the users, if they try to restore a job from checkpoints/savepoints after 
pausing for more than one week.
# Relying on Java serialization means depending our compatibility on the 
compatibility of GCS, which should be avoid if possible. Would it be possible 
to directly work with the REST API and session URI? IIUC this is how the write 
channel internally works.


> Create RecoverableWriter for GCS
> --------------------------------
>
>                 Key: FLINK-11838
>                 URL: https://issues.apache.org/jira/browse/FLINK-11838
>             Project: Flink
>          Issue Type: New Feature
>          Components: Connectors / FileSystem
>    Affects Versions: 1.8.0
>            Reporter: Fokko Driesprong
>            Assignee: Galen Warren
>            Priority: Major
>              Labels: pull-request-available, usability
>             Fix For: 1.13.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> GCS supports the resumable upload which we can use to create a Recoverable 
> writer similar to the S3 implementation:
> https://cloud.google.com/storage/docs/json_api/v1/how-tos/resumable-upload
> After using the Hadoop compatible interface: 
> https://github.com/apache/flink/pull/7519
> We've noticed that the current implementation relies heavily on the renaming 
> of the files on the commit: 
> https://github.com/apache/flink/blob/master/flink-filesystems/flink-hadoop-fs/src/main/java/org/apache/flink/runtime/fs/hdfs/HadoopRecoverableFsDataOutputStream.java#L233-L259
> This is suboptimal on an object store such as GCS. Therefore we would like to 
> implement a more GCS native RecoverableWriter 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-11838) Create RecoverableWriter for GCS

Reply via email to