+1 for the idea. Gaurav, this could be done idempotently in the same way that dynamic repartitioning is done idempotently. All the partitions are rolled back to a common checkpoint and the new StreamCodec is applied starting then. The statistics that the Stream Codec are given are the statistics for the windows computed before the common checkpoint that the partitions are rolled back to.
In fact I think this feature could be added easily by avoiding buffer server entirely and by allowing the Partitioner to redefine the StreamCodec for the operator when define partitions is called. Thanks, Tim On Thu, Feb 11, 2016 at 12:07 PM, Amol Kekre <[email protected]> wrote: > Gaurav, > It would not be idempotent per partition, but will be across all partitions > combined. In this case the user would have explicitly asked for such a > pattern. > > Thks, > Amol > > > On Thu, Feb 11, 2016 at 12:04 PM, Gaurav Gupta <[email protected]> > wrote: > > > Pramod, > > > > How would it work with recovery? There could be cases where a tuple went > to > > P1 and post recovery it can go to P2 > > > > Gaurav > > > > On Thu, Feb 11, 2016 at 11:56 AM, Pramod Immaneni < > [email protected]> > > wrote: > > > > > Hi, > > > > > > There are scenarios where the downstream partitions of an upstream > > operator > > > are generally not performing uniformly resulting in an overall > > sub-optimal > > > performance dictated by the slowest partitions. The reasons could be > data > > > related such as some partitions are receiving more data to process than > > the > > > others or could be environment related such as some partitions are > > running > > > slower than others because they are on heavily loaded nodes. > > > > > > A solution based on currently available functionality in the engine > would > > > be to write a StreamCodec implementation to distribute data among the > > > partitions such that each partition is receiving similar amount of data > > to > > > process. We should consider adding StreamCodecs like these to the > library > > > but these however do not solve the problem when it is environment > > related. > > > > > > For that a better and more comprehensive approach would be look at how > > data > > > is being consumed by the downstream partitions from the BufferServer > and > > > use that information to make decisions on how to send future data. If > > some > > > partitions are behind others in consuming data then data can be > directed > > to > > > the other partitions. One way to do this would be to relay this type of > > > statistical and positional information from BufferServer to the > upstream > > > publishers. The publishers can use this information in ways such as > > making > > > it available to StreamCodecs to affect destination of future data. > > > > > > What do you think. > > > > > > Thanks > > > > > >
