Github user iyerr3 commented on a diff in the pull request:
https://github.com/apache/incubator-madlib/pull/105#discussion_r103327658
--- Diff: doc/design/modules/graph.tex ---
@@ -86,3 +131,49 @@ \section{Single Source Shortest Path}
\label{sec:graph:sssp}
This is not a 1-to-1 pseudocode for the implementation since we don't
compare the `toupdate` table records one by one but calculate the overall
minimum. In addition, the comparison with `cur` values take place earlier to
reduce the number of tuples in the `toupdate` table.
+\subsection{Implementation Details}
+
+In this section, we discuss the MADlib implementation of the SSSP
algorithm in depth.
+
+\begin{algorithm}[SSSP$(V,E,start)$] \label{alg:sssp:high}
+\begin{algorithmic}[1]
+ \Repeat
+ \State Find Updates
+ \State Apply updates to the output table
+ \Until {There are no updates}
+\end{algorithmic}
+\end{algorithm}
+
+The implementation consists of two SQL blocks that are called sequentially
inside a loop. We will follow the example graph at Figure~\ref{sssp:example}
with the starting point as $v_0$. The very first update on the output table is
the source vertex. Its weight is $0$ and its parent is itself ($v_0$). After
this initialization step, the loop starts with Find Updates (the individual
updates will be represented with <dest,value,parent> format). Looking at the
example, it is clear that the updates should be <1,1,0>, <2,1,0> and <4,10,0>.
We will assume this iteration is already completed and look how the next
iteration of the algorithm works to explain the implementation details.
+
+\begin{algorithm}[Find Updates$(E,old\_update,out\_table)$]
+\label{alg:sssp:findu}
+\begin{lstlisting}
+INSERT INTO new_update
+ SELECT DISTINCT ON (y.id) y.id AS id,
+ y.val AS val,
+ y.parent AS parent
+ FROM out_table INNER JOIN (
+ SELECT edge_table.dest AS id, x.val AS val,
old_update.id AS parent
+ FROM old_update
+ INNER JOIN edge_table
+ ON (edge_table.src = old_update.id)
+ INNER JOIN (
+ SELECT edge_table.dest AS id,
+ min(old_update.val +
edge_table.weight) AS val
+ FROM old_update INNER JOIN
+ edge_table AS edge_table ON
+ (edge_table.src=old_update.id)
+ GROUP BY edge_table.dest
+ ) x
+ ON (edge_table.dest = x.id)
+ WHERE ABS(old_update.val + edge_table.weight - x.val) <
EPSILON
+ ) AS y ON (y.id = out_table.vertex_id)
+ WHERE y.val<out_table.weight
+\end{lstlisting}
+\end{algorithm}
+
+We begin our analysis of Find Updates function from its innermost
subquery. This subquery (lines 11-16) takes a set of vertices (in the table
$old\_update$) and finds the reachable vertices. In case a vertex is reachable
by multiple vertices, only the path that has the minimum cost is considered.
This means the input vertices need the value of their path as well. In our
example, both $v_1$ and $v_2$ can reach $v_3$. In this case, we would have to
use $v_2$ -> $v_3$ edge since that gives the lowest possible path value. Please
note that we are aggregating the rows using the $min$ operator for each
destination vertex and we are unable to return the source vertex at the same
time. This means we know the value of $v_3$ should be $2$ but we cannot know
its parent ($v_2$) at the same time. To solve this limitation, we combine the
result with $edge$ and $old\_update$ tables (lines 7-10) and get the rows that
has the same minimum value. At this point, we would have to tackle the problem
of tie-breaking. Vertex $v_5$ has two paths leading into: <5,2,1> and <5,2,2>.
The inner subquery will return <5,2> and it will match both of these edges.
However, it is redundant to keep both of them in the update list as that would
require updating the same vertex multiple times in a given iteration. By using
the $DISTINCT$ clause at line 2, we allow the underlying system to accept only
a single one of them. Finally, we want to make sure these updates are actually
leading us to shortest paths. Line 21 ensures that the values stored in the
$out\_table$ does not increase and the solution does not regress throughout the
iterations.
--- End diff --
Couple of suggestions on simplifying:
1. Use meaningful names for the query aliases and refer to those in the
text explanation.
2. Make the explanation a list that gives (shorter) explanation of each set
of lines.
Again, hard-wrapping to 80 characters would help.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---