This is an automated email from the ASF dual-hosted git repository.

baunsgaard pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/systemds.git


The following commit(s) were added to refs/heads/master by this push:
     new fc0e210  [SYSTEMDS-2621] DBScan and dist builtin
fc0e210 is described below

commit fc0e21059e06d1928020657b3882e986363f7fda
Author: Olga Ovcharenko <[email protected]>
AuthorDate: Sat Aug 15 14:56:44 2020 +0200

    [SYSTEMDS-2621] DBScan and dist builtin
    
    - Implemented dist for euclidean distance matrix.
    - Implemented DBSCAN
    - Added R dbscan package
    - Fixed error for similar points and all noise points
    - Fixed DBSCAN
    - Updated DBSCAN and added documentation
    - Added tests for dist and DBScan.
    
    Closes #1003.
---
 docs/site/builtins-reference.md                    | 61 ++++++++++++++
 scripts/builtin/dbscan.dml                         | 72 +++++++++++++++++
 scripts/builtin/dist.dml                           | 29 +++++++
 .../java/org/apache/sysds/common/Builtins.java     |  4 +-
 .../test/functions/builtin/BuiltinDBSCANTest.java  | 93 ++++++++++++++++++++++
 .../test/functions/builtin/BuiltinDistTest.java    | 85 ++++++++++++++++++++
 src/test/scripts/functions/builtin/dbscan.R        | 31 ++++++++
 src/test/scripts/functions/builtin/dbscan.dml      | 26 ++++++
 src/test/scripts/functions/builtin/dist.R          | 28 +++++++
 src/test/scripts/functions/builtin/dist.dml        | 24 ++++++
 10 files changed, 452 insertions(+), 1 deletion(-)

diff --git a/docs/site/builtins-reference.md b/docs/site/builtins-reference.md
index 0ad39c4..6d772f3 100644
--- a/docs/site/builtins-reference.md
+++ b/docs/site/builtins-reference.md
@@ -29,7 +29,9 @@ limitations under the License.
   * [DML-Bodied Built-In functions](#dml-bodied-built-in-functions)
     * [`confusionMatrix`-Function](#confusionmatrix-function)
     * [`cvlm`-Function](#cvlm-function)
+    * [`DBSCAN`-Function](#DBSCAN-function)
     * [`discoverFD`-Function](#discoverFD-function)
+    * [`dist`-Function](#dist-function)
     * [`glm`-Function](#glm-function)
     * [`gridSearch`-Function](#gridSearch-function)
     * [`hyperband`-Function](#hyperband-function)
@@ -212,6 +214,37 @@ y = X %*% rand(rows = ncol(X), cols = 1)
 [predict, beta] = cvlm(X = X, y = y, k = 4)
 ```
 
+## `DBSCAN`-Function
+
+The dbscan() implements the DBSCAN Clustering algorithm using Euclidian 
distance.
+
+### Usage
+
+```r
+Y = dbscan(X = X, eps = 2.5, minPts = 5)
+```
+
+### Arguments
+
+| Name       | Type            | Default    | Description |
+| :--------- | :-------------- | :--------- | :---------- |
+| X          | Matrix[Double]  | required   | The input Matrix to do DBSCAN 
on. |
+| eps        | Double          | `0.5`      | Maximum distance between two 
points for one to be considered reachable for the other. |
+| minPts     | Int             | `5`        | Number of points in a 
neighborhood for a point to be considered as a core point (includes the point 
itself). |
+
+### Returns
+
+| Type        | Description |
+| :-----------| :---------- |
+| Matrix[Integer] | The mapping of records to clusters |
+
+### Example
+
+```r
+X = rand(rows=1780, cols=180, min=1, max=20) 
+dbscan(X = X, eps = 2.5, minPts = 360)
+```
+
 ## `discoverFD`-Function
 
 The `discoverFD`-function finds the functional dependencies.
@@ -236,6 +269,34 @@ discoverFD(X, Mask, threshold)
 | :----- | :---------- |
 | Double | matrix of functional dependencies |
 
+## `dist`-Function
+
+The `dist`-function is used to compute Euclidian distances between N 
d-dimensional points.
+
+### Usage
+
+```r
+dist(X)
+```
+
+### Arguments
+
+| Name | Type           | Default  | Description |
+| :--- | :------------- | :------- | :---------- |
+| X    | Matrix[Double] | required | (n x d) matrix of d-dimensional points  |
+
+### Returns
+
+| Type           | Description |
+| :------------- | :---------- |
+| Matrix[Double] | (n x n) symmetric matrix of Euclidian distances |
+
+### Example
+
+```r
+X = rand (rows = 5, cols = 5)
+Y = dist(X)
+```
 
 ## `glm`-Function
 
diff --git a/scripts/builtin/dbscan.dml b/scripts/builtin/dbscan.dml
new file mode 100644
index 0000000..74d4040
--- /dev/null
+++ b/scripts/builtin/dbscan.dml
@@ -0,0 +1,72 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+#
+# Implements the DBSCAN clustering algorithm using Euclidian distance matrix
+#
+# INPUT PARAMETERS:
+# ----------------------------------------------------------------------------
+# NAME      TYPE             DEFAULT    MEANING
+# ----------------------------------------------------------------------------
+# X         Matrix[Double]   ---        The input Matrix to do DBSCAN on.
+# eps       Double           0.5        Maximum distance between two points 
for one to be considered reachable for the other.
+# minPts    Int              5          Number of points in a neighborhood for 
a point to be considered as a core point (includes the point itself).
+#
+
+m_dbscan = function (Matrix[double] X, Double eps = 0.5, Integer minPts = 5)
+    return (Matrix[double] clusterMembers)
+{
+    #check input parameter assertions
+    if(minPts < 0) { stop("DBSCAN: Stopping due to invalid inputs: minPts 
should be greater than 0"); }
+    if(eps < 0) { stop("DBSCAN: Stopping due to invalid inputs: Epsilon (eps) 
should be greater than 0"); }
+
+    UNASSIGNED = 0;
+
+    num_records = nrow(X);
+    num_features = ncol(X);
+
+    neighbors = dist(X);
+
+    #find same pts and set their distance to the smallest double representation
+    neighbors = replace(target = neighbors, pattern = 0, replacement = 
2.225e-307)
+    neighbors = neighbors - diag(diag(neighbors));
+
+    # neighbors within eps
+    withinEps = ((neighbors <= eps) * (0 < neighbors));
+    corePts = rowSums(withinEps) + 1 >= minPts;
+
+    clusterMembers = matrix(UNASSIGNED, num_records, 1);
+
+    if (sum(corePts) != 0) {
+        # leave only density reachable pts
+        neighbors = (neighbors * corePts * withinEps) > 0;
+
+        # border pts of multiple clusters
+        border = neighbors * (t(corePts) == 0 & colSums(neighbors) > 1) * 
seq(num_records, 1);
+        border = (border - colMaxs(border)) == 0;
+        neighbors = neighbors * border;
+
+        adjacency = (neighbors + t(neighbors)) > 0;
+
+        clusterMembers = components(G=adjacency, verbose=FALSE);
+        # noise to 0
+        clusterMembers = clusterMembers * (rowSums(adjacency) > 0);
+    }
+}
\ No newline at end of file
diff --git a/scripts/builtin/dist.dml b/scripts/builtin/dist.dml
new file mode 100644
index 0000000..9c473d8
--- /dev/null
+++ b/scripts/builtin/dist.dml
@@ -0,0 +1,29 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+
+# Returns Euclidian distance matrix (distances between N n-dimensional points)
+
+m_dist = function(Matrix[Double] X) return (Matrix[Double] Y) {
+    G = X %*% t(X);
+    I = matrix(1, rows = nrow(G), cols = ncol(G));
+    Y = -2 * (G) + t(I %*% diag(diag(G))) + t(diag(diag(G)) %*% I);
+    Y = sqrt(Y);
+    Y = replace(target = Y, pattern=0/0, replacement = 0);
+}
\ No newline at end of file
diff --git a/src/main/java/org/apache/sysds/common/Builtins.java 
b/src/main/java/org/apache/sysds/common/Builtins.java
index 1cd430c..0a05d15 100644
--- a/src/main/java/org/apache/sysds/common/Builtins.java
+++ b/src/main/java/org/apache/sysds/common/Builtins.java
@@ -84,9 +84,11 @@ public enum Builtins {
        CUMSUMPROD("cumsumprod", false),
        CONFUSIONMATRIX("confusionMatrix", true),
        COR("cor", true),
+       DBSCAN("dbscan", true),
        DETECTSCHEMA("detectSchema", false),
        DIAG("diag", false),
        DISCOVER_FD("discoverFD", true),
+       DIST("dist", true),
        DROP_INVALID_TYPE("dropInvalidType", false),
        DROP_INVALID_LENGTH("dropInvalidLength", false),
        EIGEN("eigen", false, ReturnType.MULTI_RETURN),
@@ -291,7 +293,7 @@ public enum Builtins {
        public static boolean contains(String name, boolean script, boolean 
parameterized) {
                Builtins tmp = get(name);
                return tmp != null && script == tmp.isScript()
-                       && parameterized == tmp.isParameterized();
+                               && parameterized == tmp.isParameterized();
        }
 
        public static Builtins get(String name) {
diff --git 
a/src/test/java/org/apache/sysds/test/functions/builtin/BuiltinDBSCANTest.java 
b/src/test/java/org/apache/sysds/test/functions/builtin/BuiltinDBSCANTest.java
new file mode 100644
index 0000000..401229a
--- /dev/null
+++ 
b/src/test/java/org/apache/sysds/test/functions/builtin/BuiltinDBSCANTest.java
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.test.functions.builtin;
+
+import com.google.common.collect.BiMap;
+import com.google.common.collect.HashBiMap;
+import org.apache.sysds.common.Types.ExecMode;
+import org.apache.sysds.lops.LopProperties.ExecType;
+import org.apache.sysds.runtime.matrix.data.MatrixValue.CellIndex;
+import org.apache.sysds.test.AutomatedTestBase;
+import org.apache.sysds.test.TestConfiguration;
+import org.apache.sysds.test.TestUtils;
+import org.junit.Test;
+import java.util.HashMap;
+
+public class BuiltinDBSCANTest extends AutomatedTestBase
+{
+       private final static String TEST_NAME = "dbscan";
+       private final static String TEST_DIR = "functions/builtin/";
+       private static final String TEST_CLASS_DIR = TEST_DIR + 
BuiltinDBSCANTest.class.getSimpleName() + "/";
+
+       private final static double eps = 1e-3;
+       private final static int rows = 1700;
+       private final static double spDense = 0.99;
+
+       private final static double epsDBSCAN = 1;
+       private final static int minPts = 5;
+
+       @Override
+       public void setUp() { addTestConfiguration(TEST_NAME,new 
TestConfiguration(TEST_CLASS_DIR, TEST_NAME,new String[]{"B"})); }
+
+       @Test
+       public void testDBSCANDefaultCP() { runDBSCAN(true, ExecType.CP); }
+
+       @Test
+       public void testDBSCANDefaultSP() { runDBSCAN(true, ExecType.SPARK); }
+
+       private void runDBSCAN(boolean defaultProb, ExecType instType)
+       {
+               ExecMode platformOld = setExecMode(instType);
+
+               try
+               {
+                       loadTestConfiguration(getTestConfiguration(TEST_NAME));
+                       String HOME = SCRIPT_DIR + TEST_DIR;
+
+                       fullDMLScriptName = HOME + TEST_NAME + ".dml";
+                       programArgs = new String[]{"-nvargs", "X=" + 
input("A"), "Y=" + output("B"), "eps=" + epsDBSCAN, "minPts=" + minPts};
+                       fullRScriptName = HOME + TEST_NAME + ".R";
+                       rCmd = getRCmd(inputDir(), Double.toString(epsDBSCAN), 
Integer.toString(minPts), expectedDir());
+
+                       //generate actual dataset
+                       double[][] A = getNonZeroRandomMatrix(rows, 3, -10, 10, 
7);
+                       writeInputMatrixWithMTD("A", A, true);
+
+                       runTest(true, false, null, -1);
+                       runRScript(true);
+
+                       //compare matrices
+                       HashMap<CellIndex, Double> dmlfile = 
readDMLMatrixFromHDFS("B");
+                       HashMap<CellIndex, Double> rfile  = 
readRMatrixFromFS("B");
+
+                       //map cluster ids
+                       //NOTE: border points that are reachable from more than 
1 cluster
+                       // are assigned to lowest point id, not cluster id -> 
can fail in this case, but it's still correct
+                       BiMap<Double, Double> merged = HashBiMap.create();
+                       rfile.forEach((key, value) -> merged.put(value, 
dmlfile.get(key)));
+                       dmlfile.replaceAll((k, v) -> merged.inverse().get(v));
+
+                       TestUtils.compareMatrices(dmlfile, rfile, eps, 
"Stat-DML", "Stat-R");
+               }
+               finally {
+                       rtplatform = platformOld;
+               }
+       }
+}
\ No newline at end of file
diff --git 
a/src/test/java/org/apache/sysds/test/functions/builtin/BuiltinDistTest.java 
b/src/test/java/org/apache/sysds/test/functions/builtin/BuiltinDistTest.java
new file mode 100644
index 0000000..26970f6
--- /dev/null
+++ b/src/test/java/org/apache/sysds/test/functions/builtin/BuiltinDistTest.java
@@ -0,0 +1,85 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.test.functions.builtin;
+
+import org.apache.sysds.common.Types.ExecMode;
+import org.apache.sysds.lops.LopProperties.ExecType;
+import org.apache.sysds.runtime.matrix.data.MatrixValue.CellIndex;
+import org.apache.sysds.test.AutomatedTestBase;
+import org.apache.sysds.test.TestConfiguration;
+import org.apache.sysds.test.TestUtils;
+import org.junit.Test;
+
+import java.util.HashMap;
+
+public class BuiltinDistTest extends AutomatedTestBase
+{
+       private final static String TEST_NAME = "dist";
+       private final static String TEST_DIR = "functions/builtin/";
+       private static final String TEST_CLASS_DIR = TEST_DIR + 
BuiltinDistTest.class.getSimpleName() + "/";
+       
+       private final static double eps = 1e-3;
+       private final static int rows = 1765;
+       private final static double spDense = 0.99;
+       
+       @Override
+       public void setUp() {
+               addTestConfiguration(TEST_NAME,new 
TestConfiguration(TEST_CLASS_DIR, TEST_NAME,new String[]{"B"})); 
+       }
+
+       @Test
+       public void testDistDefaultCP() { runDist(true, ExecType.CP); }
+       
+       @Test
+       public void testDistSP() {
+               runDist(true, ExecType.SPARK);
+       }
+
+       private void runDist(boolean defaultProb, ExecType instType)
+       {
+               ExecMode platformOld = setExecMode(instType);
+               
+               try
+               {
+                       loadTestConfiguration(getTestConfiguration(TEST_NAME));
+                       
+                       String HOME = SCRIPT_DIR + TEST_DIR;
+                       fullDMLScriptName = HOME + TEST_NAME + ".dml";
+                       programArgs = new String[]{"-args", input("A"), 
output("B") };
+                       fullRScriptName = HOME + TEST_NAME + ".R";
+                       rCmd = "Rscript" + " " + fullRScriptName + " " + 
inputDir() + " " + expectedDir();
+                       
+                       //generate actual dataset 
+                       double[][] A = getRandomMatrix(rows, 10, -1, 1, 
spDense, 7);
+                       writeInputMatrixWithMTD("A", A, true);
+                       
+                       runTest(true, false, null, -1);
+                       runRScript(true);
+                       
+                       //compare matrices
+                       HashMap<CellIndex, Double> dmlfile = 
readDMLMatrixFromHDFS("B");
+                       HashMap<CellIndex, Double> rfile  = 
readRMatrixFromFS("B");
+                       TestUtils.compareMatrices(dmlfile, rfile, eps, 
"Stat-DML", "Stat-R");
+               }
+               finally {
+                       rtplatform = platformOld;
+               }
+       }
+}
diff --git a/src/test/scripts/functions/builtin/dbscan.R 
b/src/test/scripts/functions/builtin/dbscan.R
new file mode 100644
index 0000000..e332528
--- /dev/null
+++ b/src/test/scripts/functions/builtin/dbscan.R
@@ -0,0 +1,31 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+args<-commandArgs(TRUE)
+library("Matrix")
+options(digits=22)
+library("dbscan")
+
+X = as.matrix(readMM(paste(args[1], "A.mtx", sep="")));
+eps = as.double(args[2]);
+minPts = as.integer(args[3]);
+Ys = dbscan(X, eps, minPts);
+Y = as.matrix(Ys$cluster, FALSE);
+writeMM(as(Y, "CsparseMatrix"), paste(args[4], "B", sep=""));
\ No newline at end of file
diff --git a/src/test/scripts/functions/builtin/dbscan.dml 
b/src/test/scripts/functions/builtin/dbscan.dml
new file mode 100644
index 0000000..6d5e1eb
--- /dev/null
+++ b/src/test/scripts/functions/builtin/dbscan.dml
@@ -0,0 +1,26 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+X = read($X);
+eps = as.double($eps);
+minPts = as.integer($minPts);
+Y = dbscan(X, eps, minPts);
+write(Y, $Y);
\ No newline at end of file
diff --git a/src/test/scripts/functions/builtin/dist.R 
b/src/test/scripts/functions/builtin/dist.R
new file mode 100644
index 0000000..dde54f4
--- /dev/null
+++ b/src/test/scripts/functions/builtin/dist.R
@@ -0,0 +1,28 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+args<-commandArgs(TRUE)
+options(digits=22)
+library("Matrix")
+
+X = as.matrix(readMM(paste(args[1], "A.mtx", sep="")));
+R = round(as.matrix(dist(X)), 3);
+diag(R) = 0;
+writeMM(as(R, "CsparseMatrix"), paste(args[2], "B", sep=""));
\ No newline at end of file
diff --git a/src/test/scripts/functions/builtin/dist.dml 
b/src/test/scripts/functions/builtin/dist.dml
new file mode 100644
index 0000000..9039a97
--- /dev/null
+++ b/src/test/scripts/functions/builtin/dist.dml
@@ -0,0 +1,24 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+X = read($1);
+Y = dist(X);
+write(Y, $2);

Reply via email to