[ 
https://issues.apache.org/jira/browse/FLINK-3637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15206174#comment-15206174
 ] 

ASF GitHub Bot commented on FLINK-3637:
---------------------------------------

GitHub user dalegaard opened a pull request:

    https://github.com/apache/flink/pull/1826

    Refactor rolling sink writer

    Implements FLINK-3637 by changing the Writer interface such that the Writer 
must handle the output stream, instead of having the RollingSink handling it. 
This makes it possible to write outputs for ORC and Parquet.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dalegaard/flink refactor_rolling_sink_writer

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/1826.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1826
    
----
commit af7ac4f7ff1713951f2e9e3465761aafd26cf73a
Author: Lasse Dalegaard <[email protected]>
Date:   2016-03-22T09:24:11Z

    [FLINK-3637] Refactored output stream handling from RollingSink.
    
    The Writer interface now deals directly with filesystem and path, rather
    than the raw output stream.
    
    Since the RollingSink no longer has access to the raw output stream, it
    cannot directly determine the current size of the file. A getPos()
    method has been added to the Writer interface, so the RollingSink can.
    retrieve the current file size.
    
    Finally, flush() has been extended to return the offset that the file
    must be truncated to at recovery.

commit 0a1373a21b3f6275ca53b28602453f5fd5e81a64
Author: Lasse Dalegaard <[email protected]>
Date:   2016-03-22T09:24:43Z

    [FLINK-3637] Updated writer implementations for new interface.
    
    SequenceFileWriter and StringWriter are both simple outputs that work
    directly on a file in HDFS, using the logic that used to reside in
    RollingSink. This logic has been moved into a new class,
    SimpleBaseWriter, which both StringWriter and SequenceFileWriter extend.

----


> Change RollingSink Writer interface to allow wider range of outputs
> -------------------------------------------------------------------
>
>                 Key: FLINK-3637
>                 URL: https://issues.apache.org/jira/browse/FLINK-3637
>             Project: Flink
>          Issue Type: Improvement
>          Components: Streaming Connectors
>            Reporter: Lasse Dalegaard
>            Assignee: Lasse Dalegaard
>              Labels: features
>
> Currently the RollingSink Writer interface only works with 
> FSDataOutputStreams, which precludes it from being used with some existing 
> libraries like Apache ORC and Parquet.
> To fix this, a new Writer interface can be created, which receives FileSystem 
> and Path objects, instead of FSDataOutputStream.
> To ensure exactly-once semantics, the Writer interface must also be extended 
> so that the current write-offset can be retrieved at checkpointing time. For 
> formats like ORC this requires a footer to be written, before the offset is 
> returned. Checkpointing already calls flush on the writer, but either flush 
> needs to return the current length of the output file, or alternatively a new 
> method has to be added for this.
> The existing Writer interface can be recreated with a wrapper on top of the 
> new Writer interface. The existing code that manages the FSDataOutputStream 
> can then be moved into this new wrapper.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to