> > 2) Along with Dataflow model, I would also borrow some of the features > > from MillWheel [1] and FlumeJava [2] (features such as Fault-tolerance, > > running efficient data parallel pipelines, etc). > > Perfect. Do you have something more concrete in mind? Any use cases? > Design ideas? > > Following is my brief design idea that incorporates features from both > FlumeJava and MillWheel: > > Let us assume each steps in stream processing to be a directed acyclic > graph with output corresponded by directed edges, now for stream > processing each output would have three parameters, i.e; key, value, > timestamp (Key here refers to as processing request, value as its output > and timestamp the time we received request) . Assuming each > processFunction(user defined) to be node of the directed acyclic graph, we > would send back the ACK(acknowledgment) signal once the data in i+1- th > node is received from i - th node (this property ensures data is not lost > in the process). Data on a given node is stored in a std::map<uint, > std::string>. A point worth ,mentioning here is that each map would exist > for a defined time period (few millisecond) in case ACK is not received > for that time period, map for that specific node will be cleared. Now in > case there are many parallel data pipelines directed edges of final output > from each each pipeline will be concatenated using "join()" function and > then further processed by the resulting function (analogous to reduce). > A point worth to mention here is that instead of ACK model, we can also > consider Uber's RingPOP RPC model here: https://eng.uber.com/intro-to- > ringpop/ > > PS:There is a Data-flow based framework under development known as Apache > Beam: https://github.com/apache/incubator-beam, and can be looked for > inspiration.
All of this sounds like a good first step in the right direction. Several comments: Do we need explicit graphs representing the execution flow? Wouldn't something implicit work as well? What I have in mind is to pass along future objects representing/holding the result of a computational step and using those as 'tokens' representing the data dependency on this result. This would allow to simply use hpx::dataflow to trigger the required operation once all futures have become ready. Not sure if we should build an explicit ACK feedback in the first step. We run in tightly coupled scenarios where failures usually lead to full application crashes anyways :) But I agree having something like that in the design might be a good idea. I'd ask you to keep poking at the problem and to develop keep developing your design. It might be a good idea if you tried to outline things in a bit more detail for us to really grasp what you're up to. > > We definitely will be here to discuss things as you start putting out > > ideas, questions, suggestions, etc. I think you have already started > > looking at HPX itself, if not - it might be a good time to start doing so. > > Along with adding Framework for Dataflow/Map-Reduce to HPX i also plan to > have a pluggable storage and cache interface provided so that framework > user can store data into various storage system like(BigTable, Cassandra, > etc), for pluggable cache interface user can store data into any cache > system available (Redis/Memcache). PS: there won't be external plugin > needed for both these functions just two unix port for communications. Please keep in mind that HPX has to be portable, i.e. direct usage of *nix interfaces is not an option. Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu _______________________________________________ hpx-users mailing list [email protected] https://mail.cct.lsu.edu/mailman/listinfo/hpx-users
