Re: [hpx-users] GSoC 2016 [Implement a Map/Reduce Framework]

Hartmut Kaiser Fri, 11 Mar 2016 03:23:56 -0800

> > 2) Along with Dataflow model, I would also borrow some of the features
> > from MillWheel [1] and FlumeJava [2] (features such as Fault-tolerance,
> > running efficient data parallel pipelines, etc).
> 
> Perfect. Do you have something more concrete in mind? Any use cases?
> Design ideas?
> 
>  Following is my brief design idea that incorporates features from both
> FlumeJava and MillWheel:
> 
> Let us assume each steps in stream processing to be a directed acyclic
> graph with output corresponded by directed edges, now for stream
> processing each output would have three parameters, i.e; key, value,
> timestamp (Key here refers to as processing request, value as its output
> and timestamp the time we received request) . Assuming each
> processFunction(user defined) to be node of the directed acyclic graph, we
> would send back the ACK(acknowledgment) signal once the data in i+1- th
> node is received from i - th node (this property ensures data is not lost
> in the process). Data on a given node is stored in a std::map<uint,
> std::string>. A point worth ,mentioning here is that each map would exist
> for a defined time period (few millisecond) in case ACK is not received
> for that time period, map for that specific node will be cleared. Now in
> case there are many parallel data pipelines directed edges of final output
> from each each pipeline will be concatenated using "join()" function and
> then further processed by the resulting function (analogous to reduce).
> A point worth to mention here is that instead of ACK model, we can also
> consider Uber's RingPOP RPC model here: https://eng.uber.com/intro-to-
> ringpop/
> 
> PS:There is a Data-flow based framework under development known as Apache
> Beam: https://github.com/apache/incubator-beam, and can be looked  for
> inspiration.


All of this sounds like a good first step in the right direction. Several 
comments:

Do we need explicit graphs representing the execution flow? Wouldn't something 
implicit work as well? What I have in mind is to pass along future objects 
representing/holding the result of a computational step and using those as 
'tokens' representing the data dependency on this result. This would allow to 
simply use hpx::dataflow to trigger the required operation once all futures 
have become ready.

Not sure if we should build an explicit ACK feedback in the first step. We run 
in tightly coupled scenarios where failures usually lead to full application 
crashes anyways :) But I agree having something like that in the design might 
be a good idea.

I'd ask you to keep poking at the problem and to develop keep developing your 
design. It might be a good idea if you tried to outline things in a bit more 
detail for us to really grasp what you're up to.

> > We definitely will be here to discuss things as you start putting out
> > ideas, questions, suggestions, etc. I think you have already started
> > looking at HPX itself, if not - it might be a good time to start doing so.
> 
> Along with adding Framework for Dataflow/Map-Reduce to HPX i also plan to
> have a pluggable storage and cache interface provided so that  framework
> user can store data into various storage system like(BigTable, Cassandra,
> etc), for pluggable cache interface user can store data into any cache
> system available (Redis/Memcache). PS: there won't be external plugin
> needed for  both these functions just two unix port for communications.

Please keep in mind that HPX has to be portable, i.e. direct usage of *nix 
interfaces is not an option.

Regards Hartmut
---------------
http://boost-spirit.com
http://stellar.cct.lsu.edu



_______________________________________________
hpx-users mailing list
[email protected]
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users

Re: [hpx-users] GSoC 2016 [Implement a Map/Reduce Framework]

Reply via email to