Re: buffering in operators, implementing statistics

Stephan Ewen Thu, 26 May 2016 12:00:51 -0700

Hi Stavros!

I think what Aljoscha wants to say is that the community is a bit hard
pressed reviewing new and complex things right now.
There are a lot of threads going on already.


If you want to work on this, why not make your own GitHub project
"Approximate algorithms on Apache Flink" or so?

Greetings,
Stephan



On Wed, May 25, 2016 at 3:02 PM, Aljoscha Krettek <aljos...@apache.org>
wrote:

> Hi,
> that link was interesting, thanks! As I said though, it's probably not a
> good fit for Flink right now.
>
> The things that I feel are important right now are:
>
>  - dynamic scaling: the ability of a streaming pipeline to adapt to changes
> in the amount of incoming data. This is tricky with stateful operations and
> long-running pipelines. For Spark this is easier to do because every
> mini-batch is individually scheduled and they can therefore be scheduled on
> differing numbers of machines.
>
>  - an API for joining static (or slowly evolving) data with streaming data:
> this has been coming up in different forms on the mailing lists and when
> talking with people. Apache Beam solves this with "side inputs". In Flink
> we want to add something as well, maybe along the lines of side inputs or
> maybe something more specific for the case of pure joins.
>
>  - working on managed memory: In Flink we were always very conscious about
> how memory was used, we were using our own abstractions for dealing with
> memory and efficient serialization. We call this the "managed memory"
> abstraction. Spark recently also started going in this direction with
> Project Tungsten. For the streaming API there are still some places where
> we could make things more efficient by working on the managed memory more,
> for example, there is no state backend that uses managed memory. We are
> either completely on the Java Heap or use RocksDB there.
>
>  - stream SQL: this is obvious and everybody wants it.
>
>  - A generic cross-runner API: This is what Apache Beam (née Google
> Dataflow) does. It is very interesting to write a program once and then be
> able to run it on different runners. This brings more flexibility for
> users. It's not clear how this will play out in the long run but it's very
> interesting to keep an eye on.
>
> For most of these the Flink community is in various stages of implementing
> it, so that's good. :-)
>
> Cheers,
> Aljoscha
>
> On Mon, 23 May 2016 at 17:48 Stavros Kontopoulos <st.kontopou...@gmail.com
> >
> wrote:
>
> > Hey Aljoscha,
> >
> > Thnax for the useful comments. I have recently looked at spark sketches:
> >
> >
> http://www.slideshare.net/databricks/sketching-big-data-with-spark-randomized-algorithms-for-largescale-data-analytics
> > So there must be value in this effort.
> > In my experience counting in general is a common need for large data
> sets.
> > For example people often in a non streaming setting use redis for
> > its hyperlolog algo.
> >
> > What are other areas you find more important or of higher priority for
> the
> > time being?
> >
> > Best,
> > Stavros
> >
> > On Mon, May 23, 2016 at 6:18 PM, Aljoscha Krettek <aljos...@apache.org>
> > wrote:
> >
> > > Hi,
> > > no such changes are planned right now. The separaten between the keys
> is
> > > very strict in order to make the windowing state re-partitionable so
> that
> > > we can implement dynamic rescaling of the parallelism of a program.
> > >
> > > The WindowAll is only used for specific cases where you need a Trigger
> > that
> > > sees all elements of the stream. I personally don't think it is very
> > useful
> > > because it is not scaleable. In theory, for time windows this can be
> > > parallelized but it is not currently done in Flink.
> > >
> > > Do you have a specific use case for the count-min sketch in mind. If
> not,
> > > maybe our energy is better spent on producing examples with real-world
> > > applicability. I'm not against having an example for a count-min
> sketch,
> > > I'm just worried that you might put your energy into something that is
> > not
> > > useful to a lot of people.
> > >
> > > Cheers,
> > > Aljoscha
> > >
> > > On Fri, 20 May 2016 at 20:13 Stavros Kontopoulos <
> > st.kontopou...@gmail.com
> > > >
> > > wrote:
> > >
> > > > Hi thnx for the feedback.
> > > >
> > > > So there is a limitation due to parallel windows implementation.
> > > > No intentions to change that somehow to accommodate similar
> > estimations?
> > > >
> > > > WindowAll in practice is used as step in the pipeline? I mean since
> its
> > > > inherently not parallel cannot scale correct?
> > > > Although there is an exception: "Only for special cases, such as
> > aligned
> > > > time windows is it possible to perform this operation in parallel"
> > > > Probably missing something...
> > > >
> > > > I could try do the example stuff (and open a new feature on jira for
> > > that).
> > > > I will also vote for closing the old issue too since there is no
> other
> > > way
> > > > at least for the time being...
> > > >
> > > > Thanx,
> > > > Stavros
> > > >
> > > > On Fri, May 20, 2016 at 7:02 PM, Aljoscha Krettek <
> aljos...@apache.org
> > >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > > with how the window API currently works this can only be done for
> > > > > non-parallel windows. For keyed windows everything that happens is
> > > scoped
> > > > > to the key of the elements: window contents are kept in per-key
> > state,
> > > > > triggers fire on a per-key basis. Therefore a count-min sketch
> cannot
> > > be
> > > > > used because it would require to keep state across keys.
> > > > >
> > > > > For non-parallel windows a user could do this:
> > > > >
> > > > > DataStream input = ...
> > > > > input
> > > > >   .windowAll(<some window>)
> > > > >   .fold(new MySketch(), new MySketchFoldFunction())
> > > > >
> > > > > with sketch data types and a fold function that is tailored to the
> > user
> > > > > types. Therefore, I would prefer to not add a special API for this
> > and
> > > > vote
> > > > > to close https://issues.apache.org/jira/browse/FLINK-2147. I
> already
> > > > > commented on https://issues.apache.org/jira/browse/FLINK-2144,
> > saying
> > > a
> > > > > similar thing.
> > > > >
> > > > > What I would welcome very much is to add some well documented
> > examples
> > > to
> > > > > Flink that showcase how some of these operations can be written.
> > > > >
> > > > > Cheers,
> > > > > Aljoscha
> > > > >
> > > > > On Thu, 19 May 2016 at 16:38 Stavros Kontopoulos <
> > > > st.kontopou...@gmail.com
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > Hi guys,
> > > > > >
> > > > > > I would like to push forward the work here:
> > > > > > https://issues.apache.org/jira/browse/FLINK-2147
> > > > > >
> > > > > > Can anyone more familiar with streaming api verify if this could
> > be a
> > > > > > mature task.
> > > > > > The intention is to summarize data over a window like in the case
> > of
> > > > > > StreamGroupedFold.
> > > > > > Specifically implement count min in a window.
> > > > > >
> > > > > > Best,
> > > > > > Stavros
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: buffering in operators, implementing statistics

Reply via email to