Re: [hpx-users] GSoC 2016 [Implement a Map/Reduce Framework]

Aalekh Nigam Sat, 19 Mar 2016 02:13:52 -0700

Hello Sir,

Please find my further improved draft proposal in the document here:
https://docs.google.com/document/d/1TAq0cfEMhfN53G5hkIym4FScZcHbjSZHTLc2AEHUO7U/edit?usp=sharing


This proposal tries to present a unified Batch Processing and Stream
Processing Engine (like DataFlow does). I have included Future as an
intermediate steps for DataFlow Window's (replacing old ACK Model) and
further also used HyperTables for storage purpose (to incorporate streaming
processing into batch processing + Map/Reduce).

Please let me know your view point about this approach :)

Thanks,
Aalekh Nigam

On Fri, Mar 11, 2016 at 4:53 PM, Hartmut Kaiser <[email protected]>
wrote:

>
> > > 2) Along with Dataflow model, I would also borrow some of the features
> > > from MillWheel [1] and FlumeJava [2] (features such as Fault-tolerance,
> > > running efficient data parallel pipelines, etc).
> >
> > Perfect. Do you have something more concrete in mind? Any use cases?
> > Design ideas?
> >
> >  Following is my brief design idea that incorporates features from both
> > FlumeJava and MillWheel:
> >
> > Let us assume each steps in stream processing to be a directed acyclic
> > graph with output corresponded by directed edges, now for stream
> > processing each output would have three parameters, i.e; key, value,
> > timestamp (Key here refers to as processing request, value as its output
> > and timestamp the time we received request) . Assuming each
> > processFunction(user defined) to be node of the directed acyclic graph,
> we
> > would send back the ACK(acknowledgment) signal once the data in i+1- th
> > node is received from i - th node (this property ensures data is not lost
> > in the process). Data on a given node is stored in a std::map<uint,
> > std::string>. A point worth ,mentioning here is that each map would exist
> > for a defined time period (few millisecond) in case ACK is not received
> > for that time period, map for that specific node will be cleared. Now in
> > case there are many parallel data pipelines directed edges of final
> output
> > from each each pipeline will be concatenated using "join()" function and
> > then further processed by the resulting function (analogous to reduce).
> > A point worth to mention here is that instead of ACK model, we can also
> > consider Uber's RingPOP RPC model here: https://eng.uber.com/intro-to-
> > ringpop/
> >
> > PS:There is a Data-flow based framework under development known as Apache
> > Beam: https://github.com/apache/incubator-beam, and can be looked  for
> > inspiration.
>
> All of this sounds like a good first step in the right direction. Several
> comments:
>
> Do we need explicit graphs representing the execution flow? Wouldn't
> something implicit work as well? What I have in mind is to pass along
> future objects representing/holding the result of a computational step and
> using those as 'tokens' representing the data dependency on this result.
> This would allow to simply use hpx::dataflow to trigger the required
> operation once all futures have become ready.
>
> Not sure if we should build an explicit ACK feedback in the first step. We
> run in tightly coupled scenarios where failures usually lead to full
> application crashes anyways :) But I agree having something like that in
> the design might be a good idea.
>
> I'd ask you to keep poking at the problem and to develop keep developing
> your design. It might be a good idea if you tried to outline things in a
> bit more detail for us to really grasp what you're up to.
>
> > > We definitely will be here to discuss things as you start putting out
> > > ideas, questions, suggestions, etc. I think you have already started
> > > looking at HPX itself, if not - it might be a good time to start doing
> so.
> >
> > Along with adding Framework for Dataflow/Map-Reduce to HPX i also plan to
> > have a pluggable storage and cache interface provided so that  framework
> > user can store data into various storage system like(BigTable, Cassandra,
> > etc), for pluggable cache interface user can store data into any cache
> > system available (Redis/Memcache). PS: there won't be external plugin
> > needed for  both these functions just two unix port for communications.
>
> Please keep in mind that HPX has to be portable, i.e. direct usage of *nix
> interfaces is not an option.
>
> Regards Hartmut
> ---------------
> http://boost-spirit.com
> http://stellar.cct.lsu.edu
>
>
>
>

_______________________________________________
hpx-users mailing list
[email protected]
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users

Re: [hpx-users] GSoC 2016 [Implement a Map/Reduce Framework]

Reply via email to