Hey Hitesh, thanks for you thoughts! In one chooses the multi-vertex approach, i guess there is no simple thing one could do to achieve n-iterations where n is flexible based on the output of the n-1 iteration. So you can’t do - do max 20 iterationos - stop in case certain conditions are met
!? Johannes > On 25 Mar 2015, at 18:05, Hitesh Shah <[email protected]> wrote: > > Hi Johannes, > > You would likely not avoid it if you went with the approach of multiple DAGs. > For most iterative programs, you do need to checkpoint at some point. The > checkpoint would likely need to be reliable to reduce the amount of > re-computation needed if the check pointed data is lost. An option would be > to use something like the HDFS in-memory storage tier ( which lazily persists > to disk ) to reduce the perf overhead. Also, in terms of loop unrolling, a > single DAG could be pre-constructed to run multiple iterations using multiple > vertices and then use the final vertex of the DAG as a checkpointing > mechanism after N iterations/vertices. > > Also, depending on the amount of data being written out, the overhead of > writing to HDFS may not be too high. Furthermore, with Tez sessions, there is > no real overhead of launching a new DAG ( if some containers are retained ) > as compared to trying to do the same with multiple MR jobs. > > — Hitesh > > > On Mar 25, 2015, at 2:02 AM, Johannes Zillmann <[email protected]> > wrote: > >> Hey Gopal, >> >>> On 25 Mar 2015, at 05:26, Gopal Vijayaraghavan <[email protected]> wrote: >>> >>> Hi, >>> >>> Iterative algorithms are expressed as DAGs in a loop. >>> >>> The acyclic nature of DAGs, whether in Tez or Spark (since you mention the >>> paper) make that the natural way to implement that - repeated application >>> of the same operation over the same data, with a decision condition >>> determining whether to stay in the loop or not. >> >> Can you point to a piece of code which implements this approach ? >> If you each look operation is a single DAG, how would that avoid hdfs >> barrier ? >> >> Johannes >> >>> >>> You might want to look at last year¹s Hadoop Summit presentations for a >>> direct example of Iterative algorithms with Tez. >>> >>> http://www.slideshare.net/Hadoop_Summit/pig-on-tez-low-latency-etl-with-big >>> -data/25 >>> >>> >>> Logistic regression needs you to use a library which implements that >>> specific algorithm [1]. >>> >>> On that note, something which needs incremental iteration can probably be >>> even more efficient in Tez than these approaches if you unroll the >>> iteration as 1-1 edges all of the final tasks ending up generating outputs. >>> >>> Cheers, >>> Gopal >>> [1] - https://github.com/myui/hivemall#regression >>> >>> >>> On 3/24/15, 8:43 PM, "Chang Chen" <[email protected]> wrote: >>> >>>> Hi >>>> >>>> from the PhD Disseration >>>> <http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdf> of >>>> Matei >>>> Zaharia, there are four computation models in the large scale clusters: >>>> >>>> >>>> 1. *Iterative algorithm*, such as graph processing and machine leaning >>>> algorithm >>>> 2. *Relational query* >>>> 3. *MapReduce*, a general parallel computation model >>>> 4. *Stream processing*, >>>> >>>> Obviously, Tez supports #2 and #3, but for #1 and #4, I don't see any >>>> examples. >>>> >>>> As for streaming, I guess if we implement appropriate input, there is no >>>> reason that tez can't support in theory. >>>> >>>> But for Machine Leaning, how do we use vertex and edge to express >>>> *Logistic >>>> Regression*? >>>> >>>> Thanks >>>> Chang >>> >>> >> >
