[jira] [Commented] (FLINK-4190) Generalise RollingSink to work with arbitrary buckets

ASF GitHub Bot (JIRA) Mon, 25 Jul 2016 03:04:44 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391647#comment-15391647
 ]


ASF GitHub Bot commented on FLINK-4190:
---------------------------------------

Github user aljoscha commented on the issue:

    https://github.com/apache/flink/pull/2269
  
    Not really, the reason for having an ITCase is just that they really 
exercise the sink embedded in a proper Flink job, which might bring up 
interactions that where overlooked when writing a test case. I think with 
proper test cases we could go completely without an ITCase. Which also improves 
testing time.
    
    The purpose of the `MiniDFSCluster` is to test the sink against an actual 
HDFS cluster because there is some stuff in there that would only be tested 
when using HDFS. For example, the truncate support.
    
    tl;dr Go for it and move everything to a test case. 😃 


> Generalise RollingSink to work with arbitrary buckets
> -----------------------------------------------------
>
>                 Key: FLINK-4190
>                 URL: https://issues.apache.org/jira/browse/FLINK-4190
>             Project: Flink
>          Issue Type: Improvement
>          Components: filesystem-connector, Streaming Connectors
>            Reporter: Josh Forman-Gornall
>            Assignee: Josh Forman-Gornall
>            Priority: Minor
>
> The current RollingSink implementation appears to be intended for writing to 
> directories that are bucketed by system time (e.g. minutely) and to only be 
> writing to one file within one bucket at any point in time. When the system 
> time determines that the current bucket should be changed, the current bucket 
> and file are closed and a new bucket and file are created. The sink cannot be 
> used for the more general problem of writing to arbitrary buckets, perhaps 
> determined by an attribute on the element/tuple being processed.
> There are three limitations which prevent the existing sink from being used 
> for more general problems:
> - Only bucketing by the current system time is supported, and not by e.g. an 
> attribute of the element being processed by the sink.
> - Whenever the sink sees a change in the bucket being written to, it flushes 
> the file and moves on to the new bucket. Therefore the sink cannot have more 
> than one bucket/file open at a time. Additionally the checkpointing mechanics 
> only support saving the state of one active bucket and file.
> - The sink determines that it should 'close' an active bucket and file when 
> the bucket path changes. We need another way to determine when a bucket has 
> become inactive and needs to be closed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-4190) Generalise RollingSink to work with arbitrary buckets

Reply via email to