[ 
https://issues.apache.org/jira/browse/FLINK-9113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16440840#comment-16440840
 ] 

ASF GitHub Bot commented on FLINK-9113:
---------------------------------------

GitHub user twalthr opened a pull request:

    https://github.com/apache/flink/pull/5861

    [FLINK-9113] [connectors] Use raw local file system for bucketing sink to 
prevent data loss

    ## What is the purpose of the change
    
    This change replaces Hadoop's LocalFileSystem (which is a checksumming 
filesystem) with the RawFileSystem implementation. For performing checksums the 
default filesystem only flushes in 512 byte intervals which might lead to data 
loss during checkpointing. In order to guarantee exact results we skip the 
checksum computation and perform a raw flush.
    
    Negative effect: Existing checksums are not maintained anymore and thus 
become invalid.
    
    ## Brief change log
    
    - Replace local filesystem by raw filesystem
    
    
    ## Verifying this change
    
    Added a check for verifying the file length and file size.
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): no
      - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
      - The serializers: no
      - The runtime per-record code paths (performance sensitive): no
      - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: no
      - The S3 file system connector: no
    
    ## Documentation
    
      - Does this pull request introduce a new feature? no
      - If yes, how is the feature documented? not applicable


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/twalthr/flink FLINK-9113

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/5861.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5861
    
----
commit 17b85bd5fd65e6ec31374df0ca0af7451881d90a
Author: Timo Walther <twalthr@...>
Date:   2018-04-17T13:12:55Z

    [FLINK-9113] [connectors] Use raw local file system for bucketing sink to 
prevent data loss

----


> Data loss in BucketingSink when writing to local filesystem
> -----------------------------------------------------------
>
>                 Key: FLINK-9113
>                 URL: https://issues.apache.org/jira/browse/FLINK-9113
>             Project: Flink
>          Issue Type: Bug
>          Components: Streaming Connectors
>            Reporter: Timo Walther
>            Assignee: Timo Walther
>            Priority: Blocker
>             Fix For: 1.5.0
>
>
> For local filesystems, it is not guaranteed that the data is flushed to disk 
> during checkpointing. This leads to data loss in cases of TaskManager 
> failures when writing to a local filesystem 
> {{org.apache.hadoop.fs.LocalFileSystem}}. The {{flush()}} method returns a 
> written length but the data is not written into the file (thus the valid 
> length might be greater than the actual file size). {{hsync}} and {{hflush}} 
> have no effect either.
> It seems that this behavior won't be fixed in the near future: 
> https://issues.apache.org/jira/browse/HADOOP-7844
> One solution would be to call {{close()}} on a checkpoint for local 
> filesystems, even though this would lead to performance decrease. If we don't 
> fix this issue, we should at least add proper documentation for it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to