Hey Hitesh, thanks for you thoughts!

In one chooses the multi-vertex approach, i guess there is no simple thing one 
could do to achieve n-iterations where n is flexible based on the output of the 
n-1 iteration.
So you can’t do 
- do max 20 iterationos
- stop in case certain conditions are met

!?

Johannes


> On 25 Mar 2015, at 18:05, Hitesh Shah <[email protected]> wrote:
> 
> Hi Johannes, 
> 
> You would likely not avoid it if you went with the approach of multiple DAGs. 
> For most iterative programs, you do need to checkpoint at some point. The 
> checkpoint would likely need to be reliable to reduce the amount of 
> re-computation needed if the check pointed data is lost. An option would be 
> to use something like the HDFS in-memory storage tier ( which lazily persists 
> to disk ) to reduce the perf overhead. Also, in terms of loop unrolling, a 
> single DAG could be pre-constructed to run multiple iterations using multiple 
> vertices and then use the final vertex of the DAG as a checkpointing 
> mechanism after N iterations/vertices.
> 
> Also, depending on the amount of data being written out, the overhead of 
> writing to HDFS may not be too high. Furthermore, with Tez sessions, there is 
> no real overhead of launching a new DAG ( if some containers are retained ) 
> as compared to trying to do the same with multiple MR jobs. 
> 
> — Hitesh
> 
> 
> On Mar 25, 2015, at 2:02 AM, Johannes Zillmann <[email protected]> 
> wrote:
> 
>> Hey Gopal,
>> 
>>> On 25 Mar 2015, at 05:26, Gopal Vijayaraghavan <[email protected]> wrote:
>>> 
>>> Hi,
>>> 
>>> Iterative algorithms are expressed as DAGs in a loop.
>>> 
>>> The acyclic nature of DAGs, whether in Tez or Spark (since you mention the
>>> paper) make that the natural way to implement that - repeated application
>>> of the same operation over the same data, with a decision condition
>>> determining whether to stay in the loop or not.
>> 
>> Can you point to a piece of code which implements this approach ?
>> If you each look operation is a single DAG, how would that avoid hdfs 
>> barrier ?
>> 
>> Johannes
>> 
>>> 
>>> You might want to look at last year¹s Hadoop Summit presentations for a
>>> direct example of Iterative algorithms with Tez.
>>> 
>>> http://www.slideshare.net/Hadoop_Summit/pig-on-tez-low-latency-etl-with-big
>>> -data/25
>>> 
>>> 
>>> Logistic regression needs you to use a library which implements that
>>> specific algorithm [1].
>>> 
>>> On that note, something which needs incremental iteration can probably be
>>> even more efficient in Tez than these approaches if you unroll the
>>> iteration as 1-1 edges all of the final tasks ending up generating outputs.
>>> 
>>> Cheers,
>>> Gopal
>>> [1] - https://github.com/myui/hivemall#regression
>>> 
>>> 
>>> On 3/24/15, 8:43 PM, "Chang Chen" <[email protected]> wrote:
>>> 
>>>> Hi
>>>> 
>>>> from the PhD Disseration
>>>> <http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdf> of
>>>> Matei
>>>> Zaharia, there are four computation models in the large scale clusters:
>>>> 
>>>> 
>>>> 1. *Iterative algorithm*, such as graph processing and machine leaning
>>>> algorithm
>>>> 2. *Relational query*
>>>> 3. *MapReduce*, a general parallel computation model
>>>> 4. *Stream processing*,
>>>> 
>>>> Obviously, Tez supports #2 and #3, but for #1 and #4, I don't see any
>>>> examples.
>>>> 
>>>> As for streaming, I guess if we implement appropriate input,  there is no
>>>> reason that tez can't support in theory.
>>>> 
>>>> But for Machine Leaning, how do we use vertex and edge to express
>>>> *Logistic
>>>> Regression*?
>>>> 
>>>> Thanks
>>>> Chang
>>> 
>>> 
>> 
> 

Reply via email to