[GitHub] incubator-madlib pull request #105: Graph:

iyerr3 Mon, 27 Feb 2017 17:13:54 -0800

Github user iyerr3 commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/105#discussion_r103327658
  
    --- Diff: doc/design/modules/graph.tex ---
    @@ -86,3 +131,49 @@ \section{Single Source Shortest Path} 
\label{sec:graph:sssp}
     
     This is not a 1-to-1 pseudocode for the implementation since we don't 
compare the `toupdate` table records one by one but calculate the overall 
minimum. In addition, the comparison with `cur` values take place earlier to 
reduce the number of tuples in the `toupdate` table.
     
    +\subsection{Implementation Details}
    +
    +In this section, we discuss the MADlib implementation of the SSSP 
algorithm in depth.
    +
    +\begin{algorithm}[SSSP$(V,E,start)$] \label{alg:sssp:high}
    +\begin{algorithmic}[1]
    +   \Repeat
    +           \State Find Updates
    +           \State Apply updates to the output table
    +   \Until {There are no updates}
    +\end{algorithmic}
    +\end{algorithm}
    +
    +The implementation consists of two SQL blocks that are called sequentially 
inside a loop. We will follow the example graph at Figure~\ref{sssp:example} 
with the starting point as $v_0$. The very first update on the output table is 
the source vertex. Its weight is $0$ and its parent is itself ($v_0$). After 
this initialization step, the loop starts with Find Updates (the individual 
updates will be represented with <dest,value,parent> format). Looking at the 
example, it is clear that the updates should be <1,1,0>, <2,1,0> and <4,10,0>. 
We will assume this iteration is already completed and look how the next 
iteration of the algorithm works to explain the implementation details.
    +
    +\begin{algorithm}[Find Updates$(E,old\_update,out\_table)$]
    +\label{alg:sssp:findu}
    +\begin{lstlisting}
    +INSERT INTO new_update
    +   SELECT DISTINCT ON (y.id) y.id AS id,
    +           y.val AS val,
    +           y.parent AS parent
    +   FROM out_table INNER JOIN (
    +                   SELECT edge_table.dest AS id, x.val AS val, 
old_update.id AS parent
    +                   FROM old_update
    +                           INNER JOIN edge_table
    +                           ON (edge_table.src = old_update.id)
    +                           INNER JOIN (
    +                                   SELECT edge_table.dest AS id,
    +                                           min(old_update.val + 
edge_table.weight) AS val
    +                                   FROM old_update INNER JOIN
    +                                           edge_table AS edge_table ON
    +                                           (edge_table.src=old_update.id)
    +                                   GROUP BY edge_table.dest
    +                           ) x
    +                           ON (edge_table.dest = x.id)
    +                   WHERE ABS(old_update.val + edge_table.weight - x.val) < 
EPSILON
    +           ) AS y ON (y.id = out_table.vertex_id)
    +   WHERE y.val<out_table.weight
    +\end{lstlisting}
    +\end{algorithm}
    +
    +We begin our analysis of Find Updates function from its innermost 
subquery. This subquery (lines 11-16) takes a set of vertices (in the table 
$old\_update$) and finds the reachable vertices. In case a vertex is reachable 
by multiple vertices, only the path that has the minimum cost is considered. 
This means the input vertices need the value of their path as well. In our 
example, both $v_1$ and $v_2$ can reach $v_3$. In this case, we would have to 
use $v_2$ -> $v_3$ edge since that gives the lowest possible path value. Please 
note that we are aggregating the rows using the $min$ operator for each 
destination vertex and we are unable to return the source vertex at the same 
time. This means we know the value of $v_3$ should be $2$ but we cannot know 
its parent ($v_2$) at the same time. To solve this limitation, we combine the 
result with $edge$ and $old\_update$ tables (lines 7-10) and get the rows that 
has the same minimum value. At this point, we would have to tackle the problem 
 of tie-breaking. Vertex $v_5$ has two paths leading into: <5,2,1> and <5,2,2>. 
The inner subquery will return <5,2> and it will match both of these edges. 
However, it is redundant to keep both of them in the update list as that would 
require updating the same vertex multiple times in a given iteration. By using 
the $DISTINCT$ clause at line 2, we allow the underlying system to accept only 
a single one of them. Finally, we want to make sure these updates are actually 
leading us to shortest paths. Line 21 ensures that the values stored in the 
$out\_table$ does not increase and the solution does not regress throughout the 
iterations.
    --- End diff --
    
    Couple of suggestions on simplifying: 
    1. Use meaningful names for the query aliases and refer to those in the 
text explanation. 
    2. Make the explanation a list that gives (shorter) explanation of each set 
of lines. 
    
    Again, hard-wrapping to 80 characters would help.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-madlib pull request #105: Graph:

Reply via email to