Jenkins build became unstable: beam_Release_NightlySnapshot #175

2016-09-22 Thread Apache Jenkins Server
See 



Re: Preferred locations (or data locality) for batch pipelines.

2016-09-22 Thread Amit Sela
Not where in the file, where in the cluster.

Like you said - mapper - in MapReduce the mapper instance will *prefer* to
start on the same machine as the Node hosting it (unless that's changed,
I've been out of touch with MR for a while...).

And for Spark -
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/data_locality.html
.

As for Flink, it's a streaming-first engine (sort of the opposite of Spark,
being a batch-first engine) so I *assume* they don't have this notion and
simply "stream" input.

Dataflow - no idea...

On Thu, Sep 22, 2016 at 5:45 PM Jesse Anderson 
wrote:

> I've only ever seen that being used to figure out which file the
> runner/mapper/operation is working on. Otherwise, I haven't seen those
> operations care where in the file they're working.
>
> On Thu, Sep 22, 2016 at 5:57 AM Amit Sela  wrote:
>
> > Wouldn't it force all runners to implement this for all distributed
> > filesystems ? It's true that each runner has it's own "partitioning"
> > mechanism, but I assume (maybe I'm wrong) that open-source runners use
> the
> > Hadoop InputFormat/InputSplit for that.. and the proper connectors for
> that
> > to run on top of s3/gs.
> >
> > If this is wrong, each runner should take care of it's own, but if not,
> we
> > could have a generic solution for runners, no ?
> >
> > Thanks,
> > Amit
> >
> > On Thu, Sep 22, 2016 at 3:30 PM Jean-Baptiste Onofré 
> > wrote:
> >
> > > Hi Amit,
> > >
> > > as the purpose is to remove IOChannelFactory, then I would suggest it's
> > > a runner concern. The Read.Bounded should "locate" the bundles on a
> > > executor close to the read data (even if it's not always possible
> > > depending of the source).
> > >
> > > My $0.01
> > >
> > > Regards
> > > JB
> > >
> > > On 09/22/2016 02:26 PM, Amit Sela wrote:
> > > > It's not new that batch pipeline can optimize on data locality, my
> > > question
> > > > is regarding this responsibility in Beam.
> > > > If runners should implement a generic Read.Bounded support, should
> they
> > > > also implement locating the input blocks ? or should it be a part
> > > > of IOChannelFactory implementations ? or another way to go at it that
> > I'm
> > > > missing ?
> > > >
> > > > Thanks,
> > > > Amit.
> > > >
> > >
> > > --
> > > Jean-Baptiste Onofré
> > > jbono...@apache.org
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> > >
> >
>


Re: Preferred locations (or data locality) for batch pipelines.

2016-09-22 Thread Jesse Anderson
I've only ever seen that being used to figure out which file the
runner/mapper/operation is working on. Otherwise, I haven't seen those
operations care where in the file they're working.

On Thu, Sep 22, 2016 at 5:57 AM Amit Sela  wrote:

> Wouldn't it force all runners to implement this for all distributed
> filesystems ? It's true that each runner has it's own "partitioning"
> mechanism, but I assume (maybe I'm wrong) that open-source runners use the
> Hadoop InputFormat/InputSplit for that.. and the proper connectors for that
> to run on top of s3/gs.
>
> If this is wrong, each runner should take care of it's own, but if not, we
> could have a generic solution for runners, no ?
>
> Thanks,
> Amit
>
> On Thu, Sep 22, 2016 at 3:30 PM Jean-Baptiste Onofré 
> wrote:
>
> > Hi Amit,
> >
> > as the purpose is to remove IOChannelFactory, then I would suggest it's
> > a runner concern. The Read.Bounded should "locate" the bundles on a
> > executor close to the read data (even if it's not always possible
> > depending of the source).
> >
> > My $0.01
> >
> > Regards
> > JB
> >
> > On 09/22/2016 02:26 PM, Amit Sela wrote:
> > > It's not new that batch pipeline can optimize on data locality, my
> > question
> > > is regarding this responsibility in Beam.
> > > If runners should implement a generic Read.Bounded support, should they
> > > also implement locating the input blocks ? or should it be a part
> > > of IOChannelFactory implementations ? or another way to go at it that
> I'm
> > > missing ?
> > >
> > > Thanks,
> > > Amit.
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>


Re: Preferred locations (or data locality) for batch pipelines.

2016-09-22 Thread Jean-Baptiste Onofré

Hi Amit,

as the purpose is to remove IOChannelFactory, then I would suggest it's 
a runner concern. The Read.Bounded should "locate" the bundles on a 
executor close to the read data (even if it's not always possible 
depending of the source).


My $0.01

Regards
JB

On 09/22/2016 02:26 PM, Amit Sela wrote:

It's not new that batch pipeline can optimize on data locality, my question
is regarding this responsibility in Beam.
If runners should implement a generic Read.Bounded support, should they
also implement locating the input blocks ? or should it be a part
of IOChannelFactory implementations ? or another way to go at it that I'm
missing ?

Thanks,
Amit.



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Preferred locations (or data locality) for batch pipelines.

2016-09-22 Thread Amit Sela
It's not new that batch pipeline can optimize on data locality, my question
is regarding this responsibility in Beam.
If runners should implement a generic Read.Bounded support, should they
also implement locating the input blocks ? or should it be a part
of IOChannelFactory implementations ? or another way to go at it that I'm
missing ?

Thanks,
Amit.


Re: Runtime Windows/Aggregation Computations in Beam

2016-09-22 Thread Robert Bradshaw
This may be possible with a custom WindowFn. Where is the configuration of
what aggregations to do coming from?

On Wed, Sep 21, 2016 at 11:27 PM, Chawla,Sumit 
wrote:

> Attaching the Image.
>
>
> ​
>
> Regards
> Sumit Chawla
>
>
> On Wed, Sep 21, 2016 at 11:24 PM, Chawla,Sumit 
> wrote:
>
>> Hi All
>>
>> I am trying to code a solution for following scenarios.
>>
>> 1.  I have a stream of Tuples with multiple numeric fields (e.g. A, B, C,
>> D, E ... etc )
>> 2.  I want the ability to do different Windowing and Aggregation on each
>> field or a group of fields in the Tuple.  e.g. Sum A over a Period of 2
>> minutes, Avg B over a period of 3 minutes,  Sum of C grouped by D over a
>> period of 15 minutes
>> 3.  *These window requirements can be added by user at runtime*.  My
>> pipeline should be able to compute a new aggregation at runtime.
>> 4.  Plan to support only simple aggregation windows like SUM, AVG, MAX,
>> MIN, COUNT etc.
>>
>>
>> As i understand in BEAM pipelines ( with Flink Runner), the DAG of
>> computations cannot be altered once the pipeline is deployed.  I am trying
>> to see how can i support above use case.  I would love to hear your
>> feedback on this, and suggestions on doing it in a completely different
>> way.
>>
>> *My Design:*
>>
>> 1.   Create 1 minute buckets per Field or Group of Fields and compute
>> basic aggregations for bucket.  e.g.  Buckets are highlighted in Yellow
>> here.  For each field i calculate [SUM, COUNT, MAX, MIN] in the bucket.  (
>> Bucket size of 1 minute is defined for simplicity, and would limit the
>> minimum window size to 1 minute)
>>
>> 2.  Downstream process these buckets, and compute the user defined
>> aggregations.  Following diagram depicts Tumbling Window computations.  The
>> Aggregation functions in GREEN are just NATIVE functions consuming
>> different buckets, and doing aggregations on top of these buckets.
>>
>>
>>
>>
>> ​
>> ​
>>
>> *P.S.*
>>
>> * Some of the design choices that i have decided not to go for are:*
>>
>> 1.  Multiple Pipelines for doing computation.  One master pipeline does
>> grouping, and sends to a different topic based on user configured window
>> size. (e.g. topic_window_by_5_min, topic_window_by_15_min), and have a
>> different pipeline consume each topic.
>>
>> 2.  Single pipeline doing all the business with predefined Windows
>> defined for Downstream processing. e.g. 5, 15, 30, 60 minute windows will
>> be defined which will consume from different Side Inputs.  User is only
>> allowed only to select these Window sizes.  Upstream Group By operator will
>> route to the data to different Window Function based on user configuration.
>>
>>
>>
>> Regards
>> Sumit Chawla
>>
>>
>


Runtime Windows/Aggregation Computations in Beam

2016-09-22 Thread Chawla,Sumit
Hi All

I am trying to code a solution for following scenarios.

1.  I have a stream of Tuples with multiple numeric fields (e.g. A, B, C,
D, E ... etc )
2.  I want the ability to do different Windowing and Aggregation on each
field or a group of fields in the Tuple.  e.g. Sum A over a Period of 2
minutes, Avg B over a period of 3 minutes,  Sum of C grouped by D over a
period of 15 minutes
3.  *These window requirements can be added by user at runtime*.  My
pipeline should be able to compute a new aggregation at runtime.
4.  Plan to support only simple aggregation windows like SUM, AVG, MAX,
MIN, COUNT etc.


As i understand in BEAM pipelines ( with Flink Runner), the DAG of
computations cannot be altered once the pipeline is deployed.  I am trying
to see how can i support above use case.  I would love to hear your
feedback on this, and suggestions on doing it in a completely different
way.

*My Design:*

1.   Create 1 minute buckets per Field or Group of Fields and compute basic
aggregations for bucket.  e.g.  Buckets are highlighted in Yellow here.
For each field i calculate [SUM, COUNT, MAX, MIN] in the bucket.  ( Bucket
size of 1 minute is defined for simplicity, and would limit the minimum
window size to 1 minute)

2.  Downstream process these buckets, and compute the user defined
aggregations.  Following diagram depicts Tumbling Window computations.  The
Aggregation functions in GREEN are just NATIVE functions consuming
different buckets, and doing aggregations on top of these buckets.




​
​

*P.S.*

* Some of the design choices that i have decided not to go for are:*

1.  Multiple Pipelines for doing computation.  One master pipeline does
grouping, and sends to a different topic based on user configured window
size. (e.g. topic_window_by_5_min, topic_window_by_15_min), and have a
different pipeline consume each topic.

2.  Single pipeline doing all the business with predefined Windows defined
for Downstream processing. e.g. 5, 15, 30, 60 minute windows will be
defined which will consume from different Side Inputs.  User is only
allowed only to select these Window sizes.  Upstream Group By operator will
route to the data to different Window Function based on user configuration.



Regards
Sumit Chawla