Re: [DISCUSS] Table API / SQL internal timestamp handling

2017-07-25 Thread Jark Wu
Hi Xingcan,

IMO, I don't think event-time of join results could be automatically
decided by system. Considering batch tables, if users want a event time
window aggregation after join, user must specify the time field explicitly
(T1.rowtime or T2.rowtime or the computed result of them). So in the case
of streaming tables, the system also can't automatically decide the time
field for users.

In regards to the question you asked, I think we don't need to change the
watermark no matter we choose the left rowtime or right rowtime or the
combination. Because the watermark has been aligned with the rowtime in the
source. Maybe I'm wrong about this, please correct me if I'm missing
something.

What do you think?

Regards,
Jark

2017-07-26 11:24 GMT+08:00 Xingcan Cui :

> Hi all,
>
> @Fabian, thanks for raising this.
>
> @Radu and Jark, personally I think the timestamp field is critical for
> query processing
> and thus should be declared as (or supposed to be) NOT NULL. In addition, I
> think the
> event-time semantic of the join results should be automatically decided by
> the system,
> i.e., we do not hand it over to users so to avoid some unpredictable
> assignment.
>
> Generally speaking, consolidating different time fields is possible since
> all of them
> should ideally be monotonically increasing. From my point of view, the
> problem lies in
> (1) what's the relationship between the old and new watermarks. Shall they
> be one-to-one
> mapping or the new watermarks could skip some timestamps? And (2) who is in
> charge of
> emitting the blocked watermarks, the operator or the process function?
>
> I'd like to hear from you.
>
> Best,
> Xingcan
>
>
>
> On Wed, Jul 26, 2017 at 10:40 AM, Jark Wu  wrote:
>
> > Hi,
> >
> > Radu's concerns make sense to me, especially the null value timestamp and
> > multi-proctime.
> >
> > I have also something in my mind. I would like to propose some time
> > indicator built-in functions, e.g. ROW_TIME(Timestamp ts) will generate a
> > event time logical attribute, PROC_TIME() will generate a processing time
> > logical attribute. It is similar to TUMBLE_ROWTIME proposed in this PR
> > https://github.com/apache/flink/pull/4199. These can be used in any
> > queries, but there still can't be more than one rowtime attribute or more
> > than one proctime attribute in a table schema.
> >
> > The both selected timestamp fields from a JOIN query will be
> materialized.
> > If someone needs further down the computation based on the event time,
> they
> > need to create a new time attribute using the ROW_TIME(...) function. And
> > this can also solve the null timestamp problem in LEFT JOIN, because we
> can
> > use a user defined function to combine the two rowtimes and make the
> result
> > as the event time attribute, e.g. SELECT ROW_TIME(udf(T1.rowtime,
> > T2.rowtime)) as rowtime FROM T1 JOIN T2 ...
> >
> >
> > What do you think?
> >
> >
> > 2017-07-25 23:48 GMT+08:00 Radu Tudoran :
> >
> > > Hi,
> > >
> > > I think this is an interesting discussion and I would like to add some
> > > issues and give some feedback.
> > >
> > > - For supporting the join we do not only need to think of the time but
> > > also on the null values. For example if you have a LEFT (or RIGHT) JOIN
> > > between items of 2 input streams, and the secondary input is not
> > available
> > > you should still emit Row.of(event1, null)...as far as I know if you
> need
> > > to serialize/deserialize null values to send them they do not work. So
> we
> > > should include this scenario in the discussions
> > > -If we will have multiple timestamp in an (output) event, one question
> is
> > > how to select afterwards which is the primary time field on which to
> > > operate. When we describe a query we might be able to specify (or we
> get
> > > this implicitly if we implement the carryon of the 2 timestamps)
> Select
> > > T1.rowtime, T2.rowtime ...but if the output of a query is the input of
> a
> > > new processing pipeline, then, do we support generally also that the
> > input
> > > has 2 time fields? ...how do we deal with the 2 input fields (maybe I
> am
> > > missing something) further in the datastream pipeline that we build
> based
> > > on the output?
> > > - For the case of proctime - do we need to carry 2 proctimes (the
> > > proctimes of the incoming events from each stream), or 1 proctime (as
> we
> > > operate on proctime and the combination of the 2 inputs can be
> considered
> > > as a new event, the current proctime on the machine can be considered
> the
> > > (proc)time reference for output event) or 3 proctimes (the 2 proctimes
> of
> > > the input plus the proctime when the new event was created)?
> > > -Similar with the point above, for even time (which I am understanding
> as
> > > the time when the event was created...or do we understand them as a
> time
> > > carry within the event?) - when we join 2 events and output an event
> that
> > > is the result of the join - isn't this a new event deta

[jira] [Created] (FLINK-7267) Add support for lists of hosts to connect

2017-07-25 Thread Hu Hailin (JIRA)
Hu Hailin created FLINK-7267:


 Summary: Add support for lists of hosts to connect
 Key: FLINK-7267
 URL: https://issues.apache.org/jira/browse/FLINK-7267
 Project: Flink
  Issue Type: Improvement
  Components: RabbitMQ Connector
Affects Versions: 1.3.0
Reporter: Hu Hailin
Priority: Minor


The RMQConnectionConfig can assign one host:port only. I want to connect to a 
cluster with an available node.
My workaround is write my own sink extending RMQSink and override open(), 
assigning the nodes list in it. 
{{  connection = factory.newConnection(addrs)}}
I still need to build the RMQConnectionConfig with a dummy host:port or a node 
in list. It's annoying.
I think it is better to provide a configuration for it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: [DISCUSS] Table API / SQL internal timestamp handling

2017-07-25 Thread Xingcan Cui
Hi all,

@Fabian, thanks for raising this.

@Radu and Jark, personally I think the timestamp field is critical for
query processing
and thus should be declared as (or supposed to be) NOT NULL. In addition, I
think the
event-time semantic of the join results should be automatically decided by
the system,
i.e., we do not hand it over to users so to avoid some unpredictable
assignment.

Generally speaking, consolidating different time fields is possible since
all of them
should ideally be monotonically increasing. From my point of view, the
problem lies in
(1) what's the relationship between the old and new watermarks. Shall they
be one-to-one
mapping or the new watermarks could skip some timestamps? And (2) who is in
charge of
emitting the blocked watermarks, the operator or the process function?

I'd like to hear from you.

Best,
Xingcan



On Wed, Jul 26, 2017 at 10:40 AM, Jark Wu  wrote:

> Hi,
>
> Radu's concerns make sense to me, especially the null value timestamp and
> multi-proctime.
>
> I have also something in my mind. I would like to propose some time
> indicator built-in functions, e.g. ROW_TIME(Timestamp ts) will generate a
> event time logical attribute, PROC_TIME() will generate a processing time
> logical attribute. It is similar to TUMBLE_ROWTIME proposed in this PR
> https://github.com/apache/flink/pull/4199. These can be used in any
> queries, but there still can't be more than one rowtime attribute or more
> than one proctime attribute in a table schema.
>
> The both selected timestamp fields from a JOIN query will be materialized.
> If someone needs further down the computation based on the event time, they
> need to create a new time attribute using the ROW_TIME(...) function. And
> this can also solve the null timestamp problem in LEFT JOIN, because we can
> use a user defined function to combine the two rowtimes and make the result
> as the event time attribute, e.g. SELECT ROW_TIME(udf(T1.rowtime,
> T2.rowtime)) as rowtime FROM T1 JOIN T2 ...
>
>
> What do you think?
>
>
> 2017-07-25 23:48 GMT+08:00 Radu Tudoran :
>
> > Hi,
> >
> > I think this is an interesting discussion and I would like to add some
> > issues and give some feedback.
> >
> > - For supporting the join we do not only need to think of the time but
> > also on the null values. For example if you have a LEFT (or RIGHT) JOIN
> > between items of 2 input streams, and the secondary input is not
> available
> > you should still emit Row.of(event1, null)...as far as I know if you need
> > to serialize/deserialize null values to send them they do not work. So we
> > should include this scenario in the discussions
> > -If we will have multiple timestamp in an (output) event, one question is
> > how to select afterwards which is the primary time field on which to
> > operate. When we describe a query we might be able to specify (or we get
> > this implicitly if we implement the carryon of the 2 timestamps)  Select
> > T1.rowtime, T2.rowtime ...but if the output of a query is the input of a
> > new processing pipeline, then, do we support generally also that the
> input
> > has 2 time fields? ...how do we deal with the 2 input fields (maybe I am
> > missing something) further in the datastream pipeline that we build based
> > on the output?
> > - For the case of proctime - do we need to carry 2 proctimes (the
> > proctimes of the incoming events from each stream), or 1 proctime (as we
> > operate on proctime and the combination of the 2 inputs can be considered
> > as a new event, the current proctime on the machine can be considered the
> > (proc)time reference for output event) or 3 proctimes (the 2 proctimes of
> > the input plus the proctime when the new event was created)?
> > -Similar with the point above, for even time (which I am understanding as
> > the time when the event was created...or do we understand them as a time
> > carry within the event?) - when we join 2 events and output an event that
> > is the result of the join - isn't this a new event detach from the
> > source\input events? ... I would tend to say it is a new event and then
> as
> > for proctime the event time of the new event is the current time when
> this
> > output event was created. If we would accept this hypothesis then we
> would
> > not need the 2 time input fields to be carried/managed implicitly.  If
> > someone needs further down the computation pipeline, then in the query
> they
> > would be selected explicitly from the input stream and projected in some
> > fields to be carried (Select T1.rowtime as FormerTime1, T2.rowtime as
> > FormerTime2,  JOIN T1, T2...)...but they would not have the timestamp
> > logic
> >
> > ..my 2 cents
> >
> >
> >
> >
> > Dr. Radu Tudoran
> > Staff Research Engineer - Big Data Expert
> > IT R&D Division
> >
> >
> > HUAWEI TECHNOLOGIES Duesseldorf GmbH
> > German Research Center
> > Munich Office
> > Riesstrasse 25, 80992 München
> >
> > E-mail: radu.tudo...@huawei.com
> > Mobile: +49 15209084330
> > Telephon

Re: Make SubmittedJobGraphStore configurable

2017-07-25 Thread Chen Qin
Hi Till,

As far as I know there is interests of keep job graphs recoverable from shared 
zk hiccups. Or standalone mode with customized leader election. 

I plan to spend a bit time prototyping back up to Amazon S3. Will keep folks 
updated as along as I got happy pass going.

Thanks,
Chen

> On Jul 25, 2017, at 6:07 AM, Till Rohrmann  wrote:
> 
> If there is a need for this, then we can definitely make this configurable.
> The interface SubmittedJobGraphStore is already there.
> 
> Cheers,
> Till
> 
> 
>> On Fri, Jul 7, 2017 at 6:32 AM, Chen Qin  wrote:
>> 
>> Sure,
>>  I would imagine couple of extra lines within flink.conf
>> ...
>> graphstore.type: customized/zookeeper
>> graphstore.class:
>> org
>> .
>> apache.flink.contrib
>> .MyS3SubmittedJobGraphStoreImp
>> graphstore.endpoint: s3.amazonaws.com
>> graphstore.path.root: s3://my root/
>> 
>> which overwrites initiation of
>> 
>> *org.apache.flink.runtime.highavailability.HighAvailabilityServices*
>> 
>> /**
>> * Gets the submitted job graph store for the job manager
>> *
>> * @return Submitted job graph store
>> * @throws Exception if the submitted job graph store could not be created
>> */
>> 
>> SubmittedJobGraphStore *getSubmittedJobGraphStore*() throws Exception;
>> 
>> In this case, user implemented their own s3 backed job graph store and
>> stores job graphs in s3 instead of zookeeper(high availability) or
>> never(nonha)
>> 
>> I find [1] is somehow related and more focus on life cycle and dependency
>> aspect of graph-store and checkpoint-store. FLINK-7106 in this case limited
>> to enable user implemented their own jobgraphstore instead of hardcoded to
>> zookeeper.
>> 
>> Thanks,
>> Chen
>> 
>> 
>> [1] https://issues.apache.org/jira/browse/FLINK-6626
>> 
>> 
>>> On Thu, Jul 6, 2017 at 2:47 AM, Ted Yu  wrote:
>>> 
>>> The sample config entries are broken into multiple lines.
>>> 
>>> Can you send the config again with one config on one line ?
>>> 
>>> Cheers
>>> 
 On Wed, Jul 5, 2017 at 10:19 PM, Chen Qin  wrote:
 
 Hi there,
 
 I would like to propose/discuss median level refactor work to make
 submittedJobGraphStore configurable and extensible.
 
 The rationale behind is to allow users offload those meta data to
>> durable
 cross dc read after write strong consistency storage and decouple with
>> zk
 quorum.
 
 
 https://issues.apache.org/jira/browse/FLINK-7106
 
 
 New configurable setting in flink.conf
  looks like following
 
 g
 raph
 -s
 tore:
 customized/zookeeper
 g
 raph
 -s
 tore.class: xx.yy.MyS3SubmittedJobGraphStoreImp
 
 g
 raph
 -s
 tore.
 endpoint
 : s3.amazonaws.com
 g
 raph
 -s
 tore.path.root:
 s3:/
 
 /
 my root/
 
 Thanks,
 Chen
 
>>> 
>> 


Re: is flink' states functionality futile?

2017-07-25 Thread Tzu-Li (Gordon) Tai

  
it's a program that sinks messages only after enough items accumulated in 
the buffe
Yes, that’s another important aspect of how Flink’s checkpointing is used to 
achieve at-least-once / exactly-once delivery to external systems.

For more details on how this works, I recommend taking at the docs here [1].


Now, assume I'm not bothered of recovering failures and only want the 
simplest way to implement a program that remembers data from the last run in 
the stream

By “the last run in the stream”, you mean the history of the stream so far, 
correct?
If that’s the case, and you don’t care about losing state on failures and don’t 
care about at-least-once / exactly-once, then yes you don’t have to use the 
managed state APIs in Flink.
You can just have ordinary fields to achieve this since Flink’s streaming 
operators are basically long-running processes that continuously process events 
in the stream and manipulates its state.

Cheers,
Gordon

On 23 July 2017 at 10:28:37 PM, ziv (zivm...@gmail.com) wrote:

Ok, Let me see if I understand you correctly.  
You actually state that flink' states functionality is introduced only to  
handle recovering from failures.  
Let's take the example given in 1.3 documentary -  
https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/state.html
  

  
it's a program that sinks messages only after enough items accumulated in  
the buffer.  
Now, assume I'm not bothered of recovering failures and only want the  
simplest way to implement a program that remembers data from the last run in  
the stream, then, according to you, I may not use none of the elements  
associated with flink' states -  
ListState  
snapshotState  
initializeState  
restoreState  
and the program still functions correctly?  





--  
View this message in context: 
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/is-flink-states-functionality-futile-tp18867p18879.html
  
Sent from the Apache Flink Mailing List archive. mailing list archive at 
Nabble.com.  


Re: [DISCUSS] Table API / SQL internal timestamp handling

2017-07-25 Thread Jark Wu
Hi,

Radu's concerns make sense to me, especially the null value timestamp and
multi-proctime.

I have also something in my mind. I would like to propose some time
indicator built-in functions, e.g. ROW_TIME(Timestamp ts) will generate a
event time logical attribute, PROC_TIME() will generate a processing time
logical attribute. It is similar to TUMBLE_ROWTIME proposed in this PR
https://github.com/apache/flink/pull/4199. These can be used in any
queries, but there still can't be more than one rowtime attribute or more
than one proctime attribute in a table schema.

The both selected timestamp fields from a JOIN query will be materialized.
If someone needs further down the computation based on the event time, they
need to create a new time attribute using the ROW_TIME(...) function. And
this can also solve the null timestamp problem in LEFT JOIN, because we can
use a user defined function to combine the two rowtimes and make the result
as the event time attribute, e.g. SELECT ROW_TIME(udf(T1.rowtime,
T2.rowtime)) as rowtime FROM T1 JOIN T2 ...


What do you think?


2017-07-25 23:48 GMT+08:00 Radu Tudoran :

> Hi,
>
> I think this is an interesting discussion and I would like to add some
> issues and give some feedback.
>
> - For supporting the join we do not only need to think of the time but
> also on the null values. For example if you have a LEFT (or RIGHT) JOIN
> between items of 2 input streams, and the secondary input is not available
> you should still emit Row.of(event1, null)...as far as I know if you need
> to serialize/deserialize null values to send them they do not work. So we
> should include this scenario in the discussions
> -If we will have multiple timestamp in an (output) event, one question is
> how to select afterwards which is the primary time field on which to
> operate. When we describe a query we might be able to specify (or we get
> this implicitly if we implement the carryon of the 2 timestamps)  Select
> T1.rowtime, T2.rowtime ...but if the output of a query is the input of a
> new processing pipeline, then, do we support generally also that the input
> has 2 time fields? ...how do we deal with the 2 input fields (maybe I am
> missing something) further in the datastream pipeline that we build based
> on the output?
> - For the case of proctime - do we need to carry 2 proctimes (the
> proctimes of the incoming events from each stream), or 1 proctime (as we
> operate on proctime and the combination of the 2 inputs can be considered
> as a new event, the current proctime on the machine can be considered the
> (proc)time reference for output event) or 3 proctimes (the 2 proctimes of
> the input plus the proctime when the new event was created)?
> -Similar with the point above, for even time (which I am understanding as
> the time when the event was created...or do we understand them as a time
> carry within the event?) - when we join 2 events and output an event that
> is the result of the join - isn't this a new event detach from the
> source\input events? ... I would tend to say it is a new event and then as
> for proctime the event time of the new event is the current time when this
> output event was created. If we would accept this hypothesis then we would
> not need the 2 time input fields to be carried/managed implicitly.  If
> someone needs further down the computation pipeline, then in the query they
> would be selected explicitly from the input stream and projected in some
> fields to be carried (Select T1.rowtime as FormerTime1, T2.rowtime as
> FormerTime2,  JOIN T1, T2...)...but they would not have the timestamp
> logic
>
> ..my 2 cents
>
>
>
>
> Dr. Radu Tudoran
> Staff Research Engineer - Big Data Expert
> IT R&D Division
>
>
> HUAWEI TECHNOLOGIES Duesseldorf GmbH
> German Research Center
> Munich Office
> Riesstrasse 25, 80992 München
>
> E-mail: radu.tudo...@huawei.com
> Mobile: +49 15209084330
> Telephone: +49 891588344173
>
> HUAWEI TECHNOLOGIES Duesseldorf GmbH
> Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com
> Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
> Managing Director: Bo PENG, Qiuen Peng, Shengli Wang
> Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
> Geschäftsführer: Bo PENG, Qiuen Peng, Shengli Wang
> This e-mail and its attachments contain confidential information from
> HUAWEI, which is intended only for the person or entity whose address is
> listed above. Any use of the information contained herein in any way
> (including, but not limited to, total or partial disclosure, reproduction,
> or dissemination) by persons other than the intended recipient(s) is
> prohibited. If you receive this e-mail in error, please notify the sender
> by phone or email immediately and delete it!
>
> -Original Message-
> From: Fabian Hueske [mailto:fhue...@gmail.com]
> Sent: Tuesday, July 25, 2017 4:22 PM
> To: dev@flink.apache.org
> Subject: [DISCUSS] Table API / SQL internal timestamp handling
>
> H

Re: Towards a spec for robust streaming SQL, Part 2

2017-07-25 Thread Pramod Immaneni
Thanks for the invitation Tyler. I am sure folks who worked on the calcite
integration and others would be interested.

On Tue, Jul 25, 2017 at 12:12 PM, Tyler Akidau 
wrote:

> +d...@apex.apache.org, since I'm told Apex has a Calcite integration as
> well. If anyone on the Apex side wants to join in on the fun, your input
> would be welcomed!
>
> -Tyler
>
>
> On Mon, Jul 24, 2017 at 4:34 PM Tyler Akidau  wrote:
>
> > Hello Flink, Calcite, and Beam dev lists!
> >
> > Linked below is the second document I promised way back in April
> regarding
> > a collaborative spec for streaming SQL in Beam/Calcite/Flink (& apologies
> > for the delay; I thought I was nearly done a while back and then temporal
> > joins expanded to something much larger than expected).
> >
> > To repeat what it says in the doc, my hope is that it can serve various
> > purposes over it's lifetime:
> >
> >-
> >- A discussion ground for ironing out any remaining features necessary
> >for supporting robust streaming semantics in Calcite SQL.
> >
> >- A rough, high-level source of truth for tracking efforts underway in
> >support of this, currently spanning the Calcite, Flink, and Beam
> projects.
> >
> >- A written specification of the changes that were made, for the sake
> >of understanding the delta after the fact.
> >
> > The first and third points are, IMO, the most important. AFAIK, there are
> > a few features missing still that need to be defined (e.g., triggers
> > equivalents via EMIT, robust temporal join support). I'm also proposing a
> > clear distinction of streams and tables, which I think is important, but
> > which I believe is not the approach most folks have been taking in this
> > area. Sorting out these open issues and then having a concise record of
> the
> > solutions adopted will be important for providing a solid streaming
> > experience and teaching folks how to use it.
> >
> > At any rate, I would much appreciate it if anyone with an interest in
> this
> > stuff could please take a look and add comments/suggestions/references
> to
> > related work in flight/etc as appropriate. For now please use
> > comments/suggestions, but if you really want to dive in with edit access,
> > let me know.
> >
> > The doc: http://s.apache.org/streaming-sql-spec
> >
> > -Tyler
> >
> >
> >
>


Re: Towards a spec for robust streaming SQL, Part 2

2017-07-25 Thread Tyler Akidau
+d...@apex.apache.org, since I'm told Apex has a Calcite integration as
well. If anyone on the Apex side wants to join in on the fun, your input
would be welcomed!

-Tyler


On Mon, Jul 24, 2017 at 4:34 PM Tyler Akidau  wrote:

> Hello Flink, Calcite, and Beam dev lists!
>
> Linked below is the second document I promised way back in April regarding
> a collaborative spec for streaming SQL in Beam/Calcite/Flink (& apologies
> for the delay; I thought I was nearly done a while back and then temporal
> joins expanded to something much larger than expected).
>
> To repeat what it says in the doc, my hope is that it can serve various
> purposes over it's lifetime:
>
>-
>- A discussion ground for ironing out any remaining features necessary
>for supporting robust streaming semantics in Calcite SQL.
>
>- A rough, high-level source of truth for tracking efforts underway in
>support of this, currently spanning the Calcite, Flink, and Beam projects.
>
>- A written specification of the changes that were made, for the sake
>of understanding the delta after the fact.
>
> The first and third points are, IMO, the most important. AFAIK, there are
> a few features missing still that need to be defined (e.g., triggers
> equivalents via EMIT, robust temporal join support). I'm also proposing a
> clear distinction of streams and tables, which I think is important, but
> which I believe is not the approach most folks have been taking in this
> area. Sorting out these open issues and then having a concise record of the
> solutions adopted will be important for providing a solid streaming
> experience and teaching folks how to use it.
>
> At any rate, I would much appreciate it if anyone with an interest in this
> stuff could please take a look and add comments/suggestions/references to
> related work in flight/etc as appropriate. For now please use
> comments/suggestions, but if you really want to dive in with edit access,
> let me know.
>
> The doc: http://s.apache.org/streaming-sql-spec
>
> -Tyler
>
>
>


RE: [DISCUSS] Table API / SQL internal timestamp handling

2017-07-25 Thread Radu Tudoran
Hi,

I think this is an interesting discussion and I would like to add some issues 
and give some feedback.

- For supporting the join we do not only need to think of the time but also on 
the null values. For example if you have a LEFT (or RIGHT) JOIN between items 
of 2 input streams, and the secondary input is not available you should still 
emit Row.of(event1, null)...as far as I know if you need to 
serialize/deserialize null values to send them they do not work. So we should 
include this scenario in the discussions
-If we will have multiple timestamp in an (output) event, one question is how 
to select afterwards which is the primary time field on which to operate. When 
we describe a query we might be able to specify (or we get this implicitly if 
we implement the carryon of the 2 timestamps)  Select T1.rowtime, T2.rowtime 
...but if the output of a query is the input of a new processing pipeline, 
then, do we support generally also that the input has 2 time fields? ...how do 
we deal with the 2 input fields (maybe I am missing something) further in the 
datastream pipeline that we build based on the output?
- For the case of proctime - do we need to carry 2 proctimes (the proctimes of 
the incoming events from each stream), or 1 proctime (as we operate on proctime 
and the combination of the 2 inputs can be considered as a new event, the 
current proctime on the machine can be considered the (proc)time reference for 
output event) or 3 proctimes (the 2 proctimes of the input plus the proctime 
when the new event was created)?
-Similar with the point above, for even time (which I am understanding as the 
time when the event was created...or do we understand them as a time carry 
within the event?) - when we join 2 events and output an event that is the 
result of the join - isn't this a new event detach from the source\input 
events? ... I would tend to say it is a new event and then as for proctime the 
event time of the new event is the current time when this output event was 
created. If we would accept this hypothesis then we would not need the 2 time 
input fields to be carried/managed implicitly.  If someone needs further down 
the computation pipeline, then in the query they would be selected explicitly 
from the input stream and projected in some fields to be carried (Select 
T1.rowtime as FormerTime1, T2.rowtime as FormerTime2,  JOIN T1, 
T2...)...but they would not have the timestamp logic

..my 2 cents




Dr. Radu Tudoran
Staff Research Engineer - Big Data Expert
IT R&D Division


HUAWEI TECHNOLOGIES Duesseldorf GmbH
German Research Center
Munich Office
Riesstrasse 25, 80992 München

E-mail: radu.tudo...@huawei.com
Mobile: +49 15209084330
Telephone: +49 891588344173

HUAWEI TECHNOLOGIES Duesseldorf GmbH
Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com
Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
Managing Director: Bo PENG, Qiuen Peng, Shengli Wang
Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
Geschäftsführer: Bo PENG, Qiuen Peng, Shengli Wang 
This e-mail and its attachments contain confidential information from HUAWEI, 
which is intended only for the person or entity whose address is listed above. 
Any use of the information contained herein in any way (including, but not 
limited to, total or partial disclosure, reproduction, or dissemination) by 
persons other than the intended recipient(s) is prohibited. If you receive this 
e-mail in error, please notify the sender by phone or email immediately and 
delete it!

-Original Message-
From: Fabian Hueske [mailto:fhue...@gmail.com] 
Sent: Tuesday, July 25, 2017 4:22 PM
To: dev@flink.apache.org
Subject: [DISCUSS] Table API / SQL internal timestamp handling

Hi everybody,

I'd like to propose and discuss some changes in the way how the Table API / SQL 
internally handles timestamps.

The Table API is implemented on top of the DataStream API. The DataStream API 
hides timestamps from users in order to ensure that timestamps and watermarks 
are aligned. Instead users assign timestamps and watermarks once (usually at 
the source or in a subsequent operator) and let the system handle the 
timestamps from there on. Timestamps are stored in the timestamp field of the 
StreamRecord which is a holder for the user record and the timestamp. 
DataStream operators that depend on time (time-windows, process function, ...) 
access the timestamp from the StreamRecord.

In contrast to the DataSteam API, the Table API and SQL are aware of the 
semantics of a query. I.e., we can analyze how users access timestamps and 
whether they are modified or not. Another difference is that the timestamp must 
be part of the schema of a table in order to have correct query semantics.

The current design to handle timestamps is as follows. The Table API stores 
timestamps in the timestamp field of the StreamRecord. Therefore, timestamps 
are detached from the remaining data which is stored in Row objec

[jira] [Created] (FLINK-7266) Don't attempt to delete parent directory on S3

2017-07-25 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-7266:
---

 Summary: Don't attempt to delete parent directory on S3
 Key: FLINK-7266
 URL: https://issues.apache.org/jira/browse/FLINK-7266
 Project: Flink
  Issue Type: Bug
  Components: Core
Affects Versions: 1.3.1
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 1.4.0, 1.3.2


Currently, every attempted release of an S3 state object also checks if the 
"parent directory" is empty and then tries to delete it.

Not only is that unnecessary on S3, but it is prohibitively expensive and for 
example causes S3 to throttle calls by the JobManager on checkpoint cleanup.

The {{FileState}} must only attempt parent directory cleanup when operating 
against real file systems, not when operating against object stores.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (FLINK-7265) FileSystems should describe their kind and consistency level

2017-07-25 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-7265:
---

 Summary: FileSystems should describe their kind and consistency 
level
 Key: FLINK-7265
 URL: https://issues.apache.org/jira/browse/FLINK-7265
 Project: Flink
  Issue Type: Improvement
  Components: Core
Affects Versions: 1.3.1
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 1.4.0, 1.3.2


Currently, all {{FileSystem}} types look the same to Flink.

However, certain operations should only be executed on certain kinds of file 
systems.

For example, it makes no sense to attempt to delete an empty parent directory 
on S3, because there are no such thinks as directories, only hierarchical 
naming in the keys (file names).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: [DISCUSS] Release 1.3.2 planning

2017-07-25 Thread Aljoscha Krettek
I did some preliminary cluster testing on 1.3-SNAPSHOT and it seems there is 
another issue in RocksDB with incremental checkpoints. Stefan and I are 
currently investigating. I'll keep you posted and we can hopefully create an RC 
once we hunt this one down.

Best,
Aljoscha

> On 24. Jul 2017, at 14:39, Timo Walther  wrote:
> 
> I just merged the fixes for FLINK-7137 and FLINK-7177. All blockers are 
> resolved from the Table & SQL API side.
> 
> Regards,
> Timo
> 
> Am 24.07.17 um 11:14 schrieb Aljoscha Krettek:
>> @Greg: I merged that.
>> 
>> It seems like all blockers are now either resolved or determined not to be 
>> blocking for this release. I will try and cut the first RC now.
>> 
>> Best,
>> Aljoscha
>> 
>>> On 21. Jul 2017, at 19:07, Greg Hogan  wrote:
>>> 
>>> FLINK-7211 is a trivial change for excluding the gelly examples javadoc 
>>> from the release assembly and would be good to have fixed for 1.3.2.
>>> 
>>> 
 On Jul 13, 2017, at 3:34 AM, Tzu-Li (Gordon) Tai  
 wrote:
 
 I agree that FLINK-6951 should also be a blocker for 1.3.2. I’ll update 
 its priority.
 
 On 13 July 2017 at 4:06:06 PM, Bowen Li (bowen...@offerupnow.com) wrote:
 
 Hi Aljoscha,
 I'd like to see https://issues.apache.org/jira/browse/FLINK-6951 fixed
 in 1.3.2, if it makes sense.
 
 Thanks,
 Bowen
 
 On Wed, Jul 12, 2017 at 3:06 AM, Aljoscha Krettek 
 wrote:
 
> Short update, we resolved some blockers and discovered some new ones.
> There’s this nifty Jira page if you want to keep track:
> https://issues.apache.org/jira/projects/FLINK/versions/12340984 <
> https://issues.apache.org/jira/projects/FLINK/versions/12340984>
> 
> Once again, could everyone please update the Jira issues that they think
> should be release blocking. I would like to start building release
> candidates at the end of this week, if possible.
> 
> And yes, I’m volunteering to be the release manager on this release. ;-)
> 
> Best,
> Aljoscha
> 
>> On 7. Jul 2017, at 16:03, Aljoscha Krettek  wrote:
>> 
>> I think we might have another blocker: https://issues.apache.org/
> jira/browse/FLINK-7133 
>>> On 7. Jul 2017, at 09:18, Haohui Mai  wrote:
>>> 
>>> I think we are pretty close now -- Jira shows that we're down to two
>>> blockers: FLINK-7069 and FLINK-6965.
>>> 
>>> FLINK-7069 is being merged and we have a PR for FLINK-6965.
>>> 
>>> ~Haohui
>>> 
>>> On Thu, Jul 6, 2017 at 1:44 AM Aljoscha Krettek 
> wrote:
 I’m seeing these remaining blockers:
 https://issues.apache.org/jira/browse/FLINK-7069?filter=
> 12334772&jql=project%20%3D%20FLINK%20AND%20priority%20%3D%20Blocker%20AND%
> 20resolution%20%3D%20Unresolved
 <
 https://issues.apache.org/jira/browse/FLINK-7069?filter=
> 12334772&jql=project%20=%20FLINK%20AND%20priority%20=%
> 20Blocker%20AND%20resolution%20=%20Unresolved
 Could everyone please correctly mark as “blocking” those issues that
> they
 consider blocking for 1.3.2 so that we get an accurate overview of
> where we
 are.
 
 @Chesnay, could you maybe check if this one should in fact be
> considered a
 blocker: https://issues.apache.org/jira/browse/FLINK-7034? <
 https://issues.apache.org/jira/browse/FLINK-7034?>
 
 Best,
 Aljoscha
> On 6. Jul 2017, at 07:19, Tzu-Li (Gordon) Tai 
 wrote:
> FLINK-7041 has been merged.
> I’d also like to raise another blocker for 1.3.2:
 https://issues.apache.org/jira/browse/FLINK-6996.
> Cheers,
> Gordon
> On 30 June 2017 at 12:46:07 AM, Aljoscha Krettek (aljos...@apache.org
> )
 wrote:
> Gordon and I found this (in my opinion) blocking issue:
 https://issues.apache.org/jira/browse/FLINK-7041 <
 https://issues.apache.org/jira/browse/FLINK-7041>
> I’m trying to quickly provide a fix.
> 
>> On 26. Jun 2017, at 15:30, Timo Walther  wrote:
>> 
>> I just opened a PR which should be included in the next bug fix
> release
 for the Table API:
>> https://issues.apache.org/jira/browse/FLINK-7005
>> 
>> Timo
>> 
>> Am 23.06.17 um 14:09 schrieb Robert Metzger:
>>> Thanks Haohui.
>>> 
>>> The first main task for the release management is to come up with a
>>> timeline :)
>>> Lets just wait and see which issues get reported. There are
> currently
 no
>>> blockers set for 1.3.1 in JIRA.
>>> 
>>> On Thu, Jun 22, 2017 at 6:47 PM, Haohui Mai 
 wrote:
 Hi,
 
 Release management is though, I'm happy to help. Are there any
 timelines
>>>

[DISCUSS] Table API / SQL internal timestamp handling

2017-07-25 Thread Fabian Hueske
Hi everybody,

I'd like to propose and discuss some changes in the way how the Table API /
SQL internally handles timestamps.

The Table API is implemented on top of the DataStream API. The DataStream
API hides timestamps from users in order to ensure that timestamps and
watermarks are aligned. Instead users assign timestamps and watermarks once
(usually at the source or in a subsequent operator) and let the system
handle the timestamps from there on. Timestamps are stored in the timestamp
field of the StreamRecord which is a holder for the user record and the
timestamp. DataStream operators that depend on time (time-windows, process
function, ...) access the timestamp from the StreamRecord.

In contrast to the DataSteam API, the Table API and SQL are aware of the
semantics of a query. I.e., we can analyze how users access timestamps and
whether they are modified or not. Another difference is that the timestamp
must be part of the schema of a table in order to have correct query
semantics.

The current design to handle timestamps is as follows. The Table API stores
timestamps in the timestamp field of the StreamRecord. Therefore,
timestamps are detached from the remaining data which is stored in Row
objects. Hence, the physical representation of a row is different from its
logical representation. We introduced a translation layer (RowSchema) to
convert logical schema into physical schema. This is necessery for
serialization or code generation when the logical plan is translated into a
physical execution plan. Processing-time timestamps are similarly handled.
They are not included in the physical schema and looked up when needed.
This design also requires that we need to materialize timestamps when they
are accessed by expressions. Timestamp materialization is done as a
pre-optimization step.

While thinking about the implementation of the event-time windowed
stream-stream join [1] I stumbled over the question which timestamp of both
input tables to forward. With the current design, we could only have a
single timestamp, so keeping both timestamps would not be possible. The
choice of the timestamp would need to be specified by the query otherwise
it would lack clear semantics. When executing the join, the join operator
would need to make sure that no late data is emitted. This would only work
the operator was able to hold back watermarks [2].

With this information in mind, I'd like to discuss the following proposal:

- We allow more than one event-time timestamp and store them directly in
the Row
- The query operators ensure that the watermarks are always behind all
event-time timestamps. With additional analysis we will be able to restrict
this to timestamps that are actually used as such.
- When a DataStream operator is time-based (e.g., a DataStream
time-windows), we inject an operator that copies the timestamp from the Row
into the StreamRecord.
- We try to remove the distinction between logical and physical schema. For
event-time timestamps this is because we store them in the Row object, for
processing-time timestamps, we add a dummy byte field. When accessing a
field of this type, the code generator injects the code to fetch the
timestamps.
- We might be able to get around the pre-optimization time materialization
step.
- A join result would be able to keep both timestamps. The watermark would
be hold back for both so both could be used in subsequent operations.

I admit, I haven't thought this completely through.
However, the benefits of this design from my point of view are:
- encoding of timestamps in Rows means that the logical schema is equal to
the physical schema
- no timestamp materialization
- support for multiple timestamps. Otherwise we would need to expose
internal restrictions to the user which are hard to explain / communicate.
- no need to change any public interfaces at the moment.

The drawbacks as far as I see them are:
- additional payload due to unused timestamp field + possibly the
processing-time dummy field
- complete rework of the internal timestamp logic (again...)

Please let me know what you think,
Fabian

[1] https://issues.apache.org/jira/browse/FLINK-6233
[2] https://issues.apache.org/jira/browse/FLINK-7245


Re: Use Flink's actor system

2017-07-25 Thread Till Rohrmann
Hi Lorenzo,

apart from experimentation it's not recommended to directly use Flink's
ActorSystem, because it is an implementation detail. With Flip-6 the
ActorSystem will be further hidden and in the future we might implement a
different RPC system not relying on Akka.

Cheers,
Till

On Sun, Jul 16, 2017 at 8:11 PM, Chesnay Schepler 
wrote:

> Hello,
>
> I don't think the actor system is exposed to operators in any way.
>
> If you want to access it whatever it may take (reflection etc.) you could
> try working your way through the metric group (through the runtime context)
> into the MetricRegistry and from the into the MetricQueryService.
>
> Regards,
> Chesnay
>
>
> On 12.07.2017 17:05, Lorenzo Affetti wrote:
>
>> Hello everybody,
>>
>> I want to do some experiments with communication across operators.
>> Even if arguable, I wondered if it is possible to directly use the
>> clustered actor system used by Flink.
>> If so, how can I get an instance of the actor system in an Operator?
>>
>> Thank you in advance
>>
>> Lorenzo Affetti
>>
>>
>>
>>
>>
>


Re: Ship files to flink on yarn-cluster mode

2017-07-25 Thread Till Rohrmann
Hi Guy,

you can use the -yarnship command line option to specify a directory whose
content will be shipped to the launched yarn application. Thus, all files
contained in this directory will be copied to HDFS from where they are
accessible by Flink.

Cheers,
Till

On Thu, Jul 13, 2017 at 4:40 PM, Guy Harmach  wrote:

> Hi,
>
> Is there a way to ship files to a flink job running on YARN (similar to
> -files flag in Spark)?
> According the flink cli usage there is a  -yarnship flag, but no
> documentation how to use it and retrieve the files on the flink job.
> If it is possible, I'd appreciate a reference to an example of sending a
> file and reading it.
>
> Thanks, Guy
> This message and the information contained herein is proprietary and
> confidential and subject to the Amdocs policy statement,
>
> you may review at https://www.amdocs.com/about/email-disclaimer <
> https://www.amdocs.com/about/email-disclaimer>
>


[jira] [Created] (FLINK-7264) travis watchdog is killed before tests

2017-07-25 Thread Chesnay Schepler (JIRA)
Chesnay Schepler created FLINK-7264:
---

 Summary: travis watchdog is killed before tests
 Key: FLINK-7264
 URL: https://issues.apache.org/jira/browse/FLINK-7264
 Project: Flink
  Issue Type: Bug
  Components: Travis
Affects Versions: 1.4.0
Reporter: Chesnay Schepler
Assignee: Chesnay Schepler
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: Make SubmittedJobGraphStore configurable

2017-07-25 Thread Till Rohrmann
If there is a need for this, then we can definitely make this configurable.
The interface SubmittedJobGraphStore is already there.

Cheers,
Till
​

On Fri, Jul 7, 2017 at 6:32 AM, Chen Qin  wrote:

> Sure,
> ​ I would imagine ​couple of extra lines within flink.conf
> ​...​
> graphstore.type: customized/zookeeper
> graphstore.class:
> ​org​
> .
> ​apache.flink.contrib​
> .MyS3SubmittedJobGraphStoreImp
> graphstore.endpoint: s3.amazonaws.com
> graphstore.path.root: s3://my root/
>
> which overwrites initiation of
>
> *org.apache.flink.runtime.highavailability​.​HighAvailabilityServices*
>
> /**
> * Gets the submitted job graph store for the job manager
> *
> * @return Submitted job graph store
> * @throws Exception if the submitted job graph store could not be created
> */
>
> SubmittedJobGraphStore *getSubmittedJobGraphStore*() throws Exception;
>
> In this case, user implemented their own s3 backed job graph store and
> stores job graphs in s3 instead of zookeeper(high availability) or
> never(nonha)
>
> ​I find [1] is somehow related and more focus on life cycle and dependency
> aspect of graph-store and checkpoint-store. FLINK-7106 in this case limited
> to enable user implemented their own jobgraphstore instead of hardcoded to
> zookeeper.
>
> Thanks,
> Chen​
>
>
> [1] https://issues.apache.org/jira/browse/FLINK-6626
>
>
> On Thu, Jul 6, 2017 at 2:47 AM, Ted Yu  wrote:
>
> > The sample config entries are broken into multiple lines.
> >
> > Can you send the config again with one config on one line ?
> >
> > Cheers
> >
> > On Wed, Jul 5, 2017 at 10:19 PM, Chen Qin  wrote:
> >
> > > ​Hi there,
> > >
> > > ​I would like to propose/discuss median level refactor work to make
> > > submittedJobGraphStore configurable and extensible.
> > >
> > > The rationale behind is to allow users offload those meta data to
> durable
> > > cross dc read after write strong consistency storage and decouple with
> zk
> > > quorum.
> > > ​
> > >
> > > https://issues.apache.org/jira/browse/FLINK-7106
> > >
> > > 
> > > New configurable setting in flink.conf
> > > ​ looks like following
> > >
> > > g​
> > > raph
> > > ​-s
> > > tore:
> > > ​customized/zookeeper
> > > g​
> > > raph
> > > ​-s
> > > tore​.class: xx.yy.MyS3SubmittedJobGraphStore​Imp
> > >
> > > g​
> > > raph
> > > ​-s
> > > tore.
> > > ​endpoint
> > > : s3.amazonaws.com
> > > g​
> > > raph
> > > ​-s
> > > tore.path.root:
> > > ​s3:/
> > > ​
> > > /
> > > ​my root/​
> > >
> > > Thanks,
> > > Chen
> > >
> >
>


[jira] [Created] (FLINK-7263) Improve Pull Request Template

2017-07-25 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-7263:
---

 Summary: Improve Pull Request Template
 Key: FLINK-7263
 URL: https://issues.apache.org/jira/browse/FLINK-7263
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Reporter: Stephan Ewen
Assignee: Stephan Ewen


As discussed in the mailing list, the suggestion is to update the pull request 
template as follows:


*Thank you very much for contributing to Apache Flink - we are happy that you 
want to help us improve Flink. To help the community review you contribution in 
the best possible way, please go through the checklist below, which will get 
the contribution into a shape in which it can be best reviewed.*

*Please understand that we do not do this to make contributions to Flink a 
hassle. In order to uphold a high standard of quality for code contributions, 
while at the same time managing a large number of contributions, we need 
contributors to prepare the contributions well, and give reviewers enough 
contextual information for the review. Please also understand that 
contributions that do not follow this guide will take longer to review and thus 
typically be picked up with lower priority by the community.*

## Contribution Checklist

  - Make sure that the pull request corresponds to a [JIRA 
issue](https://issues.apache.org/jira/projects/FLINK/issues). Exceptions are 
made for typos in JavaDoc or documentation files, which need no JIRA issue.
  
  - Name the pull request in the form "[FLINK-1234] [component] Title of the 
pull request", where *FLINK-1234* should be replaced by the actual issue 
number. Skip *component* if you are unsure about which is the best component.
  Typo fixes that have no associated JIRA issue should be named following this 
pattern: `[hotfix] [docs] Fix typo in event time introduction` or `[hotfix] 
[javadocs] Expand JavaDoc for PuncuatedWatermarkGenerator`.

  - Fill out the template below to describe the changes contributed by the pull 
request. That will give reviewers the context they need to do the review.
  
  - Make sure that the change passes the automated tests, i.e., `mvn clean 
verify` 

  - Each pull request should address only one issue, not mix up code from 
multiple issues.
  
  - Each commit in the pull request has a meaningful commit message (including 
the JIRA id)

  - Once all items of the checklist are addressed, remove the above text and 
this checklist, leaving only the filled out template below.


**(The sections below can be removed for hotfixes of typos)**

## What is the purpose of the change

*(For example: This pull request makes task deployment go through the blob 
server, rather than through RPC. That way we avoid re-transferring them on each 
deployment (during recovery).)*


## Brief change log

*(for example:)*
  - *The TaskInfo is stored in the blob store on job creation time as a 
persistent artifact*
  - *Deployments RPC transmits only the blob storage reference*
  - *TaskManagers retrieve the TaskInfo from the blob cache*


## Verifying this change

*(Please pick either of the following options)*

This change is a trivial rework / code cleanup without any test coverage.

*(or)*

This change is already covered by existing tests, such as *(please describe 
tests)*.

*(or)*

This change added tests and can be verified as follows:

*(example:)*
  - *Added integration tests for end-to-end deployment with large payloads 
(100MB)*
  - *Extended integration test for recovery after master (JobManager) failure*
  - *Added test that validates that TaskInfo is transferred only once across 
recoveries*
  - *Manually verified the change by running a 4 node cluser with 2 JobManagers 
and 4 TaskManagers, a stateful streaming program, and killing one JobManager 
and to TaskManagers during the execution, verifying that recovery happens 
correctly.*

## Does this pull request potentially affect one of the following parts:

  - Dependencies (does it add or upgrade a dependency): **(yes / no)**
  - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: **(yes / no)**
  - The serializers: **(yes / no / don't know)**
  - The runtime per-record code paths (performance sensitive): **(yes / no / 
don't know)**
  - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: **(yes / no / don't know)**:

## Documentation

  - Does this pull request introduce a new feature? **(yes / no)**
  - If yes, how is the feature documented? **(not applicable / docs / JavaDocs 
/ not documented)**




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (FLINK-7262) remove unused FallbackLibraryCacheManager

2017-07-25 Thread Nico Kruber (JIRA)
Nico Kruber created FLINK-7262:
--

 Summary: remove unused FallbackLibraryCacheManager
 Key: FLINK-7262
 URL: https://issues.apache.org/jira/browse/FLINK-7262
 Project: Flink
  Issue Type: Sub-task
  Components: Distributed Coordination, Network
Affects Versions: 1.4.0
Reporter: Nico Kruber
Assignee: Nico Kruber


{{FallbackLibraryCacheManager}} is basically only used in unit tests nowadays 
and should probably be removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: [DISCUSS] A more thorough Pull Request check list and template

2017-07-25 Thread Fabian Hueske
Fair enough.
I still think it's too verbose but most of the feedback was positive, so I
don't want to block this.

I think it would be good to model the form with check boxes where possible.
For example

> The runtime per-record code paths (performance sensitive): *(yes / no /
don't know)*

could be done as:

> [ ] Does not touch the runtime per-record code paths (performance
sensitive).

(assuming that "Don't know" should be treated as "yes").

Also the examples could be a bit more diverse.
They are all targeted towards "engine"-related changes and do not serve as
good examples for APIs, libraries, connectors, etc. changes.

But I think we can discuss the details on the PR.

-- Fabian

2017-07-25 9:13 GMT+02:00 Ufuk Celebi :

> Great! :-) If Fabian is also OK with trying it out, I would ask
> Stephan to open a PR for this against Flink.
>
>
> On Tue, Jul 25, 2017 at 8:50 AM, Chesnay Schepler 
> wrote:
> > I'm still apprehensive about it, but don't mind trying it out. It's not
> like
> > it can break something.
> >
> >
> > On 24.07.2017 18:52, Ufuk Celebi wrote:
> >>
> >> What's the conclusion of last weeks discussion here?
> >>
> >> Fabian and Chesnay raised concerns about the introductory text. Are
> >> you still concerned?
> >>
> >> On Wed, Jul 19, 2017 at 10:04 AM, Stephan Ewen 
> wrote:
> >>>
> >>> @Chesnay:
> >>>
> >>> Put text into template => contributor will have to read it
> >>> Put link to text into template => most contributors will ignore the
> link
> >>>
> >>> Yes, that is pretty much what my observation from the past is.
> >>>
> >>>
> >>>
> >>> On Tue, Jul 18, 2017 at 11:03 PM, Chesnay Schepler  >
> >>> wrote:
> >>>
>  I'm sorry but i can't follow your logic.
> 
>  Put text into template => contributor will definitely read it
>  Put link to text into template => contributor will completely ignore
> the
>  link
> 
>  The advantage of the link is we don't duplicate the contribution guide
>  in
>  the docs and in the template.
>  Furthermore, you don't even need to remove something from the
> template,
>  since it's just a single line.
> 
> 
>  On 18.07.2017 19:25, Stephan Ewen wrote:
> 
> > Concerning moving text to the contributors guide:
> >
> > I can only say it again: I believe the contribution guide is almost
> > dead
> > text. Very few people read it.
> > Before the current template was introduced, new contributors rarely
> > gave
> > the pull request a name with Jira number. That is a good indicator
> > about
> > how many read this guide.
> > Putting the test in the template is a way that every one reads it.
> >
> >
> > I am also wondering what the concern is.
> > A new contributor should clearly read through a bit of text, to learn
> > what
> > we look for in contributions.
> > A recurring contributor will not have to read it again, simply remove
> > the
> > text from the pull request message and go on.
> >
> > Where is the disadvantage?
> >
> >
> > On Tue, Jul 18, 2017 at 5:35 PM, Nico Kruber  >
> > wrote:
> >
> > I like the new template but also agree with the text being too long
> and
> >>
> >> would
> >> move the intro to the contributors guide with a link in the PR
> >> template.
> >>
> >> Regarding the questions to fill out - I'd like the headings to be
> >> short
> >> and
> >> have the affected components last so that documentation is not lost
> >> (although
> >> being more important than this checklist), e.g.:
> >>
> >> * Purpose of the change
> >> * Brief change log
> >> * Verifying the change
> >> * Documentation
> >> * Affected components
> >>
> >> The verification options in the original template look a bit too
> large
> >> but
> >> it
> >> stresses what tests should be added, especially for bigger changes.
> >> Can't
> >> think of a way to make it shorter though.
> >>
> >>
> >> Nico
> >>
> >> On Tuesday, 18 July 2017 11:20:41 CEST Chesnay Schepler wrote:
> >>
> >>> I fully agree with Fabian.
> >>>
> >>> Multiple-choice questions provide little value to the reviewer,
> since
> >>> the
> >>> validity has to be verified in any case. While text answers have to
> >>> be
> >>> validated as well,
> >>> they give some hint to the reviewer as to how it can be verified
> and
> >>> which steps the
> >>> contributor did to do so.
> >>>
> >>> I also agree that it is too long; IMO this is really intimidating
> to
> >>> new
> >>> contributors to be greeted with this.
> >>>
> >>> Ideally we only link to the contributors guide and ask 3 questions:
> >>>
> >>> * What is the problem?
> >>> * How was it fixed?
> >>> * How can the fix be verified?
> >>>
> >>> On 18.07.2017 10:47, Fabian Hueske wrote:
> >>

[jira] [Created] (FLINK-7261) avoid unnecessary exceptions in the logs in non-HA cases

2017-07-25 Thread Nico Kruber (JIRA)
Nico Kruber created FLINK-7261:
--

 Summary: avoid unnecessary exceptions in the logs in non-HA cases
 Key: FLINK-7261
 URL: https://issues.apache.org/jira/browse/FLINK-7261
 Project: Flink
  Issue Type: Sub-task
  Components: Distributed Coordination, Network
Affects Versions: 1.4.0
Reporter: Nico Kruber
Assignee: Nico Kruber


{{PermanentBlobCache#getHAFileInternal}} first tries to download files from the 
HA store but if it does not exist, it will log an exception from the attempt to 
move the {{incomingFile}} to its destination which is misleading to the user.

We should extend {{BlobView#get}} to return whether a file was actually copied 
or not, e.g. in the {{VoidBlobStore}} to keep the abstraction of the BLOB 
stores but to not report errors in expected cases (recall that 
{{FileSystemBlobStore#get}} will already throw an exception if anything failed 
in there and if successful but the succeeding move fails the exception from the 
move should still be prevailed).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (FLINK-7260) Match not exhaustive in WindowJoinUtil.scala

2017-07-25 Thread Ted Yu (JIRA)
Ted Yu created FLINK-7260:
-

 Summary: Match not exhaustive in WindowJoinUtil.scala
 Key: FLINK-7260
 URL: https://issues.apache.org/jira/browse/FLINK-7260
 Project: Flink
  Issue Type: Bug
Reporter: Ted Yu
Priority: Minor


Here is related code:
{code}
[WARNING] 
/mnt/disk2/a/flink/flink-libraries/flink-table/src/main/scala/org/apache/flink/table/runtime/join/WindowJoinUtil.scala:296:
 warning: match may not be exhaustive.
[WARNING] It would fail on the following inputs: BETWEEN, CREATE_VIEW, DEFAULT, 
DESCRIBE_SCHEMA, DOT, EXCEPT, EXTRACT, GREATEST, LAST_VALUE, 
MAP_QUERY_CONSTRUCTOR, NOT, NOT_EQUALS, NULLS_LAST, PATTERN_ALTER, PREV, 
SIMILAR, SUM0, TIMESTAMP_DIFF, UNION, VAR_SAMP
[WARNING]   timePred.pred.getKind match {
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (FLINK-7259) Match not exhaustive in TaskMonitor

2017-07-25 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-7259:


 Summary: Match not exhaustive in TaskMonitor
 Key: FLINK-7259
 URL: https://issues.apache.org/jira/browse/FLINK-7259
 Project: Flink
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.3.1, 1.4.0
Reporter: Till Rohrmann


Two matches are not exhaustive in the class {{TaskMonitor}}. This can lead to a 
{{MatchError}}, potentially restarting this actor (depending on the supervision 
strategy).

{code}
 
/Users/uce/Code/flink/flink-mesos/src/main/scala/org/apache/flink/mesos/scheduler/TaskMonitor.scala:157:
 warning: match may not be exhaustive.
[WARNING] It would fail on the following input: TASK_KILLING
[WARNING]   msg.status().getState match {
[WARNING]^
[WARNING] 
/Users/uce/Code/flink/flink-mesos/src/main/scala/org/apache/flink/mesos/scheduler/TaskMonitor.scala:170:
 warning: match may not be exhaustive.
[WARNING] It would fail on the following input: TASK_KILLING
[WARNING]   msg.status().getState match {
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: [DISCUSS] A more thorough Pull Request check list and template

2017-07-25 Thread Ufuk Celebi
Great! :-) If Fabian is also OK with trying it out, I would ask
Stephan to open a PR for this against Flink.


On Tue, Jul 25, 2017 at 8:50 AM, Chesnay Schepler  wrote:
> I'm still apprehensive about it, but don't mind trying it out. It's not like
> it can break something.
>
>
> On 24.07.2017 18:52, Ufuk Celebi wrote:
>>
>> What's the conclusion of last weeks discussion here?
>>
>> Fabian and Chesnay raised concerns about the introductory text. Are
>> you still concerned?
>>
>> On Wed, Jul 19, 2017 at 10:04 AM, Stephan Ewen  wrote:
>>>
>>> @Chesnay:
>>>
>>> Put text into template => contributor will have to read it
>>> Put link to text into template => most contributors will ignore the link
>>>
>>> Yes, that is pretty much what my observation from the past is.
>>>
>>>
>>>
>>> On Tue, Jul 18, 2017 at 11:03 PM, Chesnay Schepler 
>>> wrote:
>>>
 I'm sorry but i can't follow your logic.

 Put text into template => contributor will definitely read it
 Put link to text into template => contributor will completely ignore the
 link

 The advantage of the link is we don't duplicate the contribution guide
 in
 the docs and in the template.
 Furthermore, you don't even need to remove something from the template,
 since it's just a single line.


 On 18.07.2017 19:25, Stephan Ewen wrote:

> Concerning moving text to the contributors guide:
>
> I can only say it again: I believe the contribution guide is almost
> dead
> text. Very few people read it.
> Before the current template was introduced, new contributors rarely
> gave
> the pull request a name with Jira number. That is a good indicator
> about
> how many read this guide.
> Putting the test in the template is a way that every one reads it.
>
>
> I am also wondering what the concern is.
> A new contributor should clearly read through a bit of text, to learn
> what
> we look for in contributions.
> A recurring contributor will not have to read it again, simply remove
> the
> text from the pull request message and go on.
>
> Where is the disadvantage?
>
>
> On Tue, Jul 18, 2017 at 5:35 PM, Nico Kruber 
> wrote:
>
> I like the new template but also agree with the text being too long and
>>
>> would
>> move the intro to the contributors guide with a link in the PR
>> template.
>>
>> Regarding the questions to fill out - I'd like the headings to be
>> short
>> and
>> have the affected components last so that documentation is not lost
>> (although
>> being more important than this checklist), e.g.:
>>
>> * Purpose of the change
>> * Brief change log
>> * Verifying the change
>> * Documentation
>> * Affected components
>>
>> The verification options in the original template look a bit too large
>> but
>> it
>> stresses what tests should be added, especially for bigger changes.
>> Can't
>> think of a way to make it shorter though.
>>
>>
>> Nico
>>
>> On Tuesday, 18 July 2017 11:20:41 CEST Chesnay Schepler wrote:
>>
>>> I fully agree with Fabian.
>>>
>>> Multiple-choice questions provide little value to the reviewer, since
>>> the
>>> validity has to be verified in any case. While text answers have to
>>> be
>>> validated as well,
>>> they give some hint to the reviewer as to how it can be verified and
>>> which steps the
>>> contributor did to do so.
>>>
>>> I also agree that it is too long; IMO this is really intimidating to
>>> new
>>> contributors to be greeted with this.
>>>
>>> Ideally we only link to the contributors guide and ask 3 questions:
>>>
>>> * What is the problem?
>>> * How was it fixed?
>>> * How can the fix be verified?
>>>
>>> On 18.07.2017 10:47, Fabian Hueske wrote:
>>>
 I like the sections about purpose, change log, and verification of
 the
 changes.

 However, I think the proposed template is too much text. This is

>>> probably
>>> the reason why the first attempt to establish a PR template failed.

 I would move most of the introduction and explanations incl.
 examples

>>> to
>>> the "Contribution Guidelines" and only pass a link.

 IMO, the template should be rather shorter than the current one and

>>> only
>>> have the link, the sections to fill out, and checkboxes.

 I'm also not sure how much the detailed questions will help.
 For example even if the question about changed dependencies is
 answered
 with "no", the reviewer still has to check that.

 I think the questions of the current template work differently.
 A question "Does the PR include tests?" suggests to