[madlib] branch master updated: WCC: Update the design doc

okislal Thu, 18 Apr 2019 14:46:29 -0700

This is an automated email from the ASF dual-hosted git repository.

okislal pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/madlib.git



The following commit(s) were added to refs/heads/master by this push:
     new 8feb9bf  WCC: Update the design doc
8feb9bf is described below

commit 8feb9bfdfcd8e1162d08bca9d2199cba39af8247
Author: Orhan Kislal <[email protected]>
AuthorDate: Mon Apr 15 14:53:58 2019 -0700

    WCC: Update the design doc
    
    Closes #369
---
 doc/design/modules/graph.tex | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/doc/design/modules/graph.tex b/doc/design/modules/graph.tex
index 18c8ba7..7df2bc6 100644
--- a/doc/design/modules/graph.tex
+++ b/doc/design/modules/graph.tex
@@ -600,7 +600,7 @@ WHERE newupdate.id = toupdate.id
 Finally, the $message$ table is updated with potential new
 component IDs for active vertices using the following query:
 
-\begin{algorithm}[Update oldupdate table]
+\begin{algorithm}[Update oldupdate table] \label{wcc:message}
 \begin{lstlisting}
 CREATE TEMP TABLE message AS
 SELECT id, MIN(component_id) AS component_id
@@ -624,6 +624,20 @@ associated with each vertex in $G$. The component ID of 
all the vertices
 in a component is equal to the smallest vertex ID in the corresponding
 subgraph.
 
+\subsection{Edge Table Duplication} \label{sec:wcc:duplication}
+
+The queries explained in the Section~\ref{sec:wcc:implementation} expose a
+potential performance drawback in Greenplum systems. In general, we advise
+that the edge tables should be distributed by their source columns. However,
+in WCC, we use both source and destination columns of the edge table in JOIN
+clauses. In addition, we employ a GROUP BY clause using the column that did
+not serve as the join key. Algorithm~\ref{wcc:message} shows that when $dest$
+is used for the JOIN clause, $src$ is renamed to $id$ to be used for GROUP BY
+and vice versa. This query forces multiple redistribute motions in the
+database which might cause performance degradation. To address this issue, we
+create a duplicate of the edge table and distribute on the destination column
+(only for Greenplum systems).
+
 \section{Breadth-first Search} \label{sec:graph:bfs}
 
 Given a graph $G$ and a user-specified origin vertex $src$, this algorithm

[madlib] branch master updated: WCC: Update the design doc

Reply via email to