Hello Sir, Please find my further improved draft proposal in the document here: https://docs.google.com/document/d/1TAq0cfEMhfN53G5hkIym4FScZcHbjSZHTLc2AEHUO7U/edit?usp=sharing
This proposal tries to present a unified Batch Processing and Stream Processing Engine (like DataFlow does). I have included Future as an intermediate steps for DataFlow Window's (replacing old ACK Model) and further also used HyperTables for storage purpose (to incorporate streaming processing into batch processing + Map/Reduce). Please let me know your view point about this approach :) Thanks, Aalekh Nigam On Fri, Mar 11, 2016 at 4:53 PM, Hartmut Kaiser <[email protected]> wrote: > > > > 2) Along with Dataflow model, I would also borrow some of the features > > > from MillWheel [1] and FlumeJava [2] (features such as Fault-tolerance, > > > running efficient data parallel pipelines, etc). > > > > Perfect. Do you have something more concrete in mind? Any use cases? > > Design ideas? > > > > Following is my brief design idea that incorporates features from both > > FlumeJava and MillWheel: > > > > Let us assume each steps in stream processing to be a directed acyclic > > graph with output corresponded by directed edges, now for stream > > processing each output would have three parameters, i.e; key, value, > > timestamp (Key here refers to as processing request, value as its output > > and timestamp the time we received request) . Assuming each > > processFunction(user defined) to be node of the directed acyclic graph, > we > > would send back the ACK(acknowledgment) signal once the data in i+1- th > > node is received from i - th node (this property ensures data is not lost > > in the process). Data on a given node is stored in a std::map<uint, > > std::string>. A point worth ,mentioning here is that each map would exist > > for a defined time period (few millisecond) in case ACK is not received > > for that time period, map for that specific node will be cleared. Now in > > case there are many parallel data pipelines directed edges of final > output > > from each each pipeline will be concatenated using "join()" function and > > then further processed by the resulting function (analogous to reduce). > > A point worth to mention here is that instead of ACK model, we can also > > consider Uber's RingPOP RPC model here: https://eng.uber.com/intro-to- > > ringpop/ > > > > PS:There is a Data-flow based framework under development known as Apache > > Beam: https://github.com/apache/incubator-beam, and can be looked for > > inspiration. > > All of this sounds like a good first step in the right direction. Several > comments: > > Do we need explicit graphs representing the execution flow? Wouldn't > something implicit work as well? What I have in mind is to pass along > future objects representing/holding the result of a computational step and > using those as 'tokens' representing the data dependency on this result. > This would allow to simply use hpx::dataflow to trigger the required > operation once all futures have become ready. > > Not sure if we should build an explicit ACK feedback in the first step. We > run in tightly coupled scenarios where failures usually lead to full > application crashes anyways :) But I agree having something like that in > the design might be a good idea. > > I'd ask you to keep poking at the problem and to develop keep developing > your design. It might be a good idea if you tried to outline things in a > bit more detail for us to really grasp what you're up to. > > > > We definitely will be here to discuss things as you start putting out > > > ideas, questions, suggestions, etc. I think you have already started > > > looking at HPX itself, if not - it might be a good time to start doing > so. > > > > Along with adding Framework for Dataflow/Map-Reduce to HPX i also plan to > > have a pluggable storage and cache interface provided so that framework > > user can store data into various storage system like(BigTable, Cassandra, > > etc), for pluggable cache interface user can store data into any cache > > system available (Redis/Memcache). PS: there won't be external plugin > > needed for both these functions just two unix port for communications. > > Please keep in mind that HPX has to be portable, i.e. direct usage of *nix > interfaces is not an option. > > Regards Hartmut > --------------- > http://boost-spirit.com > http://stellar.cct.lsu.edu > > > >
_______________________________________________ hpx-users mailing list [email protected] https://mail.cct.lsu.edu/mailman/listinfo/hpx-users
