Re: load based stream partitioning

Timothy Farkas Thu, 11 Feb 2016 12:21:30 -0800

+1 for the idea.

Gaurav, this could be done idempotently in the same way that dynamic
repartitioning is done idempotently. All the partitions are rolled back to
a common checkpoint and the new StreamCodec is applied starting then. The
statistics that the Stream Codec are given are the statistics for the
windows computed before the common checkpoint that the partitions are
rolled back to.


In fact I think this feature could be added easily by avoiding buffer
server entirely and by allowing the Partitioner to redefine the StreamCodec
for the operator when define partitions is called.

Thanks,
Tim

On Thu, Feb 11, 2016 at 12:07 PM, Amol Kekre <[email protected]> wrote:

> Gaurav,
> It would not be idempotent per partition, but will be across all partitions
> combined. In this case the user would have explicitly asked for such a
> pattern.
>
> Thks,
> Amol
>
>
> On Thu, Feb 11, 2016 at 12:04 PM, Gaurav Gupta <[email protected]>
> wrote:
>
> > Pramod,
> >
> > How would it work with recovery? There could be cases where a tuple went
> to
> > P1 and post recovery it can go to P2
> >
> > Gaurav
> >
> > On Thu, Feb 11, 2016 at 11:56 AM, Pramod Immaneni <
> [email protected]>
> > wrote:
> >
> > > Hi,
> > >
> > > There are scenarios where the downstream partitions of an upstream
> > operator
> > > are generally not performing uniformly resulting in an overall
> > sub-optimal
> > > performance dictated by the slowest partitions. The reasons could be
> data
> > > related such as some partitions are receiving more data to process than
> > the
> > > others or could be environment related such as some partitions are
> > running
> > > slower than others because they are on heavily loaded nodes.
> > >
> > > A solution based on currently available functionality in the engine
> would
> > > be to write a StreamCodec implementation to distribute data among the
> > > partitions such that each partition is receiving similar amount of data
> > to
> > > process. We should consider adding StreamCodecs like these to the
> library
> > > but these however do not solve the problem when it is environment
> > related.
> > >
> > > For that a better and more comprehensive approach would be look at how
> > data
> > > is being consumed by the downstream partitions from the BufferServer
> and
> > > use that information to make decisions on how to send future data. If
> > some
> > > partitions are behind others in consuming data then data can be
> directed
> > to
> > > the other partitions. One way to do this would be to relay this type of
> > > statistical and positional information from BufferServer to the
> upstream
> > > publishers. The publishers can use this information in ways such as
> > making
> > > it available to StreamCodecs to affect destination of future data.
> > >
> > > What do you think.
> > >
> > > Thanks
> > >
> >
>

Re: load based stream partitioning

Reply via email to