[GitHub] [systemds] Baunsgaard commented on a change in pull request #993: [SYSTEMDS-265] Entity resolution pipelines and primitives.

GitBox Mon, 20 Jul 2020 03:25:01 -0700


Baunsgaard commented on a change in pull request #993:
URL: https://github.com/apache/systemds/pull/993#discussion_r457212235




##########
File path: scripts/staging/entity-resolution/README.md
##########
@@ -0,0 +1,99 @@
+# Entity Resolution
+
+## Pipeline design and primitives
+
+We provide two example scripts, `entity-clustering.dml` and 
`binary-entity-resolution.dml`. These handle reading input 
+files and writing output files and call functions provided in 
`primitives/pipeline.dml`.
+
+### Input files
+
+The provided scripts can read two types of input files. The token file is 
mandatory since it contains the row identifiers, 
+but the embedding file is optional. The actual use of tokens and/or embeddings 
can be configured via command line parameters 
+to the scripts.
+
+##### Token files
+
+This file type is a CSV file with 3 columns. The first column is the string or 
integer row identifier, the second is the 
+string token, and the third is the number of occurences. This simple format is 
used as a bag-of-words representation.
+
+##### Embedding files
+
+This file type is a CSV matrix file with each row containing 
arbitrary-dimensional embeddings. The order of row identifiers
+is assumed to be the same as in the token file. This saves some computation 
and storage time, but could be changed with 
+some modifications to the example scripts.
+
+### Primitives
+
+While the example scripts may be sufficient for many simple use cases, we aim 
to provide a toolkit of composable functions
+to facilitate more complex tasks. The top-level pipelines are defined as a 
couple of functions in `primitives/pipeline.dml`.
+The goal is that it should be relatively easy to copy one of these pipelines 
and swap out the primitive functions used
+to create a custom pipeline.
+
+To convert the input token file into a bag-of-words contingency table 
representation, we provide the functions
+`convert_frame_tokens_to_matrix_bow` and 
`convert_frame_tokens_to_matrix_bow_2` in  `primitives/preprocessing.dml`.
+The latter is used to compute a compatible contigency table with matching 
vocabulary for binary entity resolution. 
+
+We provide naive, constant-size blocking and locality-sensitive hashing (LSH) 
as functions in `primitives/blocking.dml`.
+
+For entity clustering, we only provide a simple clustering approach which 
makes all connected components in an adjacency
+matrix fully connected. This function is located in 
`primitives/clustering.dml`.
+
+To restore an adjacency matrix to a list of pairs, we provide the functions 
`untable` and `untable_offset` in 
+`primitives/postprocessing.dml`.
+
+Finally, `primitives/evaluation.dml` defines some metrics that can be used to 
evaluate the performance of the entity
+resolution pipelines. They are used in the script 
`eval-entity-resolution.dml`. 
+
+## Testing and Examples
+
+There is a test data repository that was used to develop these scripts at 
+[repo](https://github.com/skogler/systemds-amls-project-data). In the examples 
below, it is assumed that this repo is 
+cloned as `data` in the SystemDS root folder. The data in that repository is 
sourced from the Uni Leipzig entity resolution 
+[benchmark](https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution).
+
+### Preprocessing
+
+Since there is no tokenization functionality in SystemDS yet, we provide a 
Python preprocessing script in the data repository
+that tokenizes the text columns and performs some simple embedding lookup 
using Glove embeddings.
+
+The tokens are written as CSV files to enable Bag-of-Words representations as 
well as matrices with combined embeddings. D
+epending on the type of data, one or the other or a combination of both may be 
better. The SystemDS DML scripts can be 
+called with different parameters to experiment with this.

Review comment:
       I believe it should be possible to do in systemds, but currently it is a 
bit of a hassle.

##########
File path: scripts/staging/entity-resolution/README.md
##########
@@ -0,0 +1,99 @@
+# Entity Resolution
+
+## Pipeline design and primitives
+
+We provide two example scripts, `entity-clustering.dml` and 
`binary-entity-resolution.dml`. These handle reading input 
+files and writing output files and call functions provided in 
`primitives/pipeline.dml`.
+
+### Input files
+
+The provided scripts can read two types of input files. The token file is 
mandatory since it contains the row identifiers, 
+but the embedding file is optional. The actual use of tokens and/or embeddings 
can be configured via command line parameters 
+to the scripts.
+
+##### Token files
+
+This file type is a CSV file with 3 columns. The first column is the string or 
integer row identifier, the second is the 
+string token, and the third is the number of occurences. This simple format is 
used as a bag-of-words representation.
+
+##### Embedding files
+
+This file type is a CSV matrix file with each row containing 
arbitrary-dimensional embeddings. The order of row identifiers
+is assumed to be the same as in the token file. This saves some computation 
and storage time, but could be changed with 
+some modifications to the example scripts.
+
+### Primitives
+
+While the example scripts may be sufficient for many simple use cases, we aim 
to provide a toolkit of composable functions
+to facilitate more complex tasks. The top-level pipelines are defined as a 
couple of functions in `primitives/pipeline.dml`.
+The goal is that it should be relatively easy to copy one of these pipelines 
and swap out the primitive functions used
+to create a custom pipeline.
+
+To convert the input token file into a bag-of-words contingency table 
representation, we provide the functions
+`convert_frame_tokens_to_matrix_bow` and 
`convert_frame_tokens_to_matrix_bow_2` in  `primitives/preprocessing.dml`.
+The latter is used to compute a compatible contigency table with matching 
vocabulary for binary entity resolution. 
+
+We provide naive, constant-size blocking and locality-sensitive hashing (LSH) 
as functions in `primitives/blocking.dml`.
+
+For entity clustering, we only provide a simple clustering approach which 
makes all connected components in an adjacency
+matrix fully connected. This function is located in 
`primitives/clustering.dml`.
+
+To restore an adjacency matrix to a list of pairs, we provide the functions 
`untable` and `untable_offset` in 
+`primitives/postprocessing.dml`.
+
+Finally, `primitives/evaluation.dml` defines some metrics that can be used to 
evaluate the performance of the entity
+resolution pipelines. They are used in the script 
`eval-entity-resolution.dml`. 
+
+## Testing and Examples
+
+There is a test data repository that was used to develop these scripts at 
+[repo](https://github.com/skogler/systemds-amls-project-data). In the examples 
below, it is assumed that this repo is 
+cloned as `data` in the SystemDS root folder. The data in that repository is 
sourced from the Uni Leipzig entity resolution 
+[benchmark](https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution).

Review comment:
       How did this implementation compare to this benchmark?

##########
File path: pom.xml
##########
@@ -257,12 +257,6 @@
                                <version>3.0.0-M4</version><!--$NO-MVN-MAN-VER$ 
-->
                                <configuration>
                                        <skipTests>${skipTests}</skipTests>
-                                       <parallel>classes</parallel>
-                                       <!-- 
<useUnlimitedThreads>true</useUnlimitedThreads> -->
-                                       <threadCount>12</threadCount>
-                                       <!-- 1C means the number of threads 
times 1 possible maximum forks for testing-->
-                                       <forkCount>1C</forkCount>
-                                       <reuseForks>false</reuseForks>

Review comment:
       I would really appreciate if this was switched back.
   
   It is a major time saver when executing the tests to run in parallel. (going 
from ~3 hours to 20 minutes on my laptop).
   
   In your fix you combine the `notThreadsafe` flag and `parameterized tests`. 
which should be sufficient for executing the tests.
   
   Most of the time, it crashes for me because out test setup is not very nice, 
with TestName and TestDir getting removed and from other tests if the name is 
equal. This means since we parallelize on classes, the parameterized test 
launches multiple `classes` that then delete the other tests folders.
   Another case is if the parallel tests are not in their own forks therefore 
sharing static resources, resulting in equal temporary directories.
   

##########
File path: scripts/staging/entity-resolution/README.md
##########
@@ -0,0 +1,99 @@
+# Entity Resolution

Review comment:
       I Like this README, but i would suggest moving it to our docs folder 
under `docs/site/.` and add a link to it from the `/docs/index.md` Once you do 
that you need to add a header:
   
   ```code
   ---
   layout: site
   title: Entity Resolution
   ---
   <!--
   {% comment %}
   Licensed to the Apache Software Foundation (ASF) under one or more
   contributor license agreements.  See the NOTICE file distributed with
   this work for additional information regarding copyright ownership.
   The ASF licenses this file to you under the Apache License, Version 2.0
   (the "License"); you may not use this file except in compliance with
   the License.  You may obtain a copy of the License at
   
   http://www.apache.org/licenses/LICENSE-2.0
   
   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
   {% endcomment %}
   -->
   
   ```

##########
File path: scripts/staging/entity-resolution/README.md
##########
@@ -0,0 +1,99 @@
+# Entity Resolution
+
+## Pipeline design and primitives
+
+We provide two example scripts, `entity-clustering.dml` and 
`binary-entity-resolution.dml`. These handle reading input 
+files and writing output files and call functions provided in 
`primitives/pipeline.dml`.
+
+### Input files
+
+The provided scripts can read two types of input files. The token file is 
mandatory since it contains the row identifiers, 
+but the embedding file is optional. The actual use of tokens and/or embeddings 
can be configured via command line parameters 
+to the scripts.
+
+##### Token files
+
+This file type is a CSV file with 3 columns. The first column is the string or 
integer row identifier, the second is the 
+string token, and the third is the number of occurences. This simple format is 
used as a bag-of-words representation.
+
+##### Embedding files
+
+This file type is a CSV matrix file with each row containing 
arbitrary-dimensional embeddings. The order of row identifiers
+is assumed to be the same as in the token file. This saves some computation 
and storage time, but could be changed with 
+some modifications to the example scripts.
+
+### Primitives
+
+While the example scripts may be sufficient for many simple use cases, we aim 
to provide a toolkit of composable functions
+to facilitate more complex tasks. The top-level pipelines are defined as a 
couple of functions in `primitives/pipeline.dml`.
+The goal is that it should be relatively easy to copy one of these pipelines 
and swap out the primitive functions used
+to create a custom pipeline.
+
+To convert the input token file into a bag-of-words contingency table 
representation, we provide the functions
+`convert_frame_tokens_to_matrix_bow` and 
`convert_frame_tokens_to_matrix_bow_2` in  `primitives/preprocessing.dml`.
+The latter is used to compute a compatible contigency table with matching 
vocabulary for binary entity resolution. 
+
+We provide naive, constant-size blocking and locality-sensitive hashing (LSH) 
as functions in `primitives/blocking.dml`.
+
+For entity clustering, we only provide a simple clustering approach which 
makes all connected components in an adjacency
+matrix fully connected. This function is located in 
`primitives/clustering.dml`.
+
+To restore an adjacency matrix to a list of pairs, we provide the functions 
`untable` and `untable_offset` in 
+`primitives/postprocessing.dml`.
+
+Finally, `primitives/evaluation.dml` defines some metrics that can be used to 
evaluate the performance of the entity
+resolution pipelines. They are used in the script 
`eval-entity-resolution.dml`. 
+
+## Testing and Examples
+
+There is a test data repository that was used to develop these scripts at 
+[repo](https://github.com/skogler/systemds-amls-project-data). In the examples 
below, it is assumed that this repo is 

Review comment:
       I'm not a huge fan of the data being located in this repository, but 
then again i don't know where we would store such a thing in Apache or if it is 
even possible.
   
   opinions @mboehm7 
   

##########
File path: scripts/staging/entity-resolution/README.md
##########
@@ -0,0 +1,99 @@
+# Entity Resolution

Review comment:
       Also somewhere here in the documentation mention the paper it is based 
on, i noticed it in `pipeline.dml` only:
   ```
   Distributed representations of tuples for entity resolution.  Proceedings of 
the VLDB Endowment
   ```
   
   Also no LSTM has been used here.

##########
File path: 
src/test/java/org/apache/sysds/test/applications/EntityResolutionBinaryTest.java
##########
@@ -0,0 +1,95 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sysds.test.applications;
+
+import org.apache.sysds.test.AutomatedTestBase;
+import org.apache.sysds.test.TestConfiguration;
+import org.junit.Assert;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+
+import java.nio.file.Files;
+import java.nio.file.Paths;
+import java.util.Arrays;
+import java.util.Collection;
+
+@RunWith(value = Parameterized.class)
[email protected]

Review comment:
       If you want to avoid the NotThreadSafe, i suggest to make multiple tests 
that call like the bellow.
   I Know it is not a beautiful and concise as the previous, but would make the 
test non blocking for the other tests.
   
   ```java
   ...
   
   @Test 
   public void test1(){
     testScriptEndToEnd(1,1);
   }
   
   public void testScriptEndToEnd(int numLshHashTables, int numLshHyperplanes) 
   ...
   ```

##########
File path: scripts/staging/entity-resolution/README.md
##########
@@ -0,0 +1,99 @@
+# Entity Resolution
+
+## Pipeline design and primitives
+
+We provide two example scripts, `entity-clustering.dml` and 
`binary-entity-resolution.dml`. These handle reading input 
+files and writing output files and call functions provided in 
`primitives/pipeline.dml`.
+
+### Input files
+
+The provided scripts can read two types of input files. The token file is 
mandatory since it contains the row identifiers, 
+but the embedding file is optional. The actual use of tokens and/or embeddings 
can be configured via command line parameters 
+to the scripts.
+
+##### Token files
+
+This file type is a CSV file with 3 columns. The first column is the string or 
integer row identifier, the second is the 
+string token, and the third is the number of occurences. This simple format is 
used as a bag-of-words representation.
+
+##### Embedding files
+
+This file type is a CSV matrix file with each row containing 
arbitrary-dimensional embeddings. The order of row identifiers
+is assumed to be the same as in the token file. This saves some computation 
and storage time, but could be changed with 
+some modifications to the example scripts.
+
+### Primitives
+
+While the example scripts may be sufficient for many simple use cases, we aim 
to provide a toolkit of composable functions
+to facilitate more complex tasks. The top-level pipelines are defined as a 
couple of functions in `primitives/pipeline.dml`.
+The goal is that it should be relatively easy to copy one of these pipelines 
and swap out the primitive functions used
+to create a custom pipeline.
+
+To convert the input token file into a bag-of-words contingency table 
representation, we provide the functions
+`convert_frame_tokens_to_matrix_bow` and 
`convert_frame_tokens_to_matrix_bow_2` in  `primitives/preprocessing.dml`.
+The latter is used to compute a compatible contigency table with matching 
vocabulary for binary entity resolution. 
+
+We provide naive, constant-size blocking and locality-sensitive hashing (LSH) 
as functions in `primitives/blocking.dml`.
+
+For entity clustering, we only provide a simple clustering approach which 
makes all connected components in an adjacency
+matrix fully connected. This function is located in 
`primitives/clustering.dml`.
+
+To restore an adjacency matrix to a list of pairs, we provide the functions 
`untable` and `untable_offset` in 
+`primitives/postprocessing.dml`.
+
+Finally, `primitives/evaluation.dml` defines some metrics that can be used to 
evaluate the performance of the entity
+resolution pipelines. They are used in the script 
`eval-entity-resolution.dml`. 
+
+## Testing and Examples
+
+There is a test data repository that was used to develop these scripts at 
+[repo](https://github.com/skogler/systemds-amls-project-data). In the examples 
below, it is assumed that this repo is 
+cloned as `data` in the SystemDS root folder. The data in that repository is 
sourced from the Uni Leipzig entity resolution 
+[benchmark](https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution).
+
+### Preprocessing
+
+Since there is no tokenization functionality in SystemDS yet, we provide a 
Python preprocessing script in the data repository
+that tokenizes the text columns and performs some simple embedding lookup 
using Glove embeddings.
+
+The tokens are written as CSV files to enable Bag-of-Words representations as 
well as matrices with combined embeddings. D
+epending on the type of data, one or the other or a combination of both may be 
better. The SystemDS DML scripts can be 
+called with different parameters to experiment with this.
+
+### Entity clustering
+
+In this case we detect duplicates within one database. As an example, we use 
the benchmark dataset Affiliations from Uni Leipzig.
+For this dataset, embeddings do not work well since the data is mostly just 
names. Therefore, we encode it as Bag-of-Words vectors
+in the example below. This dataset would benefit from more preprocessing, as 
simply matching words for all the different kinds of
+abbreviations does not work particularly well.
+
+Example command to run on Affiliations dataset:
+```
+./bin/systemds ./scripts/algorithms/entity-resolution/entity-clustering.dml 
-nvargs FX=data/affiliationstrings/affiliationstrings_tokens.csv 
OUT=data/affiliationstrings/affiliationstrings_res.csv store_mapping=FALSE 
MX=data/affiliationstrings/affiliationstrings_MX.csv use_embeddings=FALSE 
XE=data/affiliationstrings/affiliationstrings_embeddings.csv
+```
+Evaluation:
+```
+./bin/systemds 
./scripts/algorithms/entity-resolution/eval-entity-resolution.dml -nvargs 
FX=data/affiliationstrings/affiliationstrings_res.csv 
FY=data/affiliationstrings/affiliationstrings_mapping_fixed.csv
+```
+
+### Binary entity resolution
+
+In this case we detect duplicate pairs of rows between two databases. As an 
example, we use the benchmark dataset DBLP-ACM from Uni Leipzig.
+Embeddings work really well for this dataset, so the results are quite good 
with an F1 score of 0.89.
+
+Example command to run on DBLP-ACM dataset with embeddings:
+```
+./bin/systemds 
./scripts/algorithms/entity-resolution/binary-entity-resolution.dml -nvargs 
FY=data/DBLP-ACM/ACM_tokens.csv FX=data/DBLP-ACM/DBLP2_tokens.csv 
MX=data/DBLP-ACM_MX.csv OUT=data/DBLP-ACM/DBLP-ACM_res.csv 
XE=data/DBLP-ACM/DBLP2_embeddings.csv YE=data/DBLP-ACM/ACM_embeddings.csv 
use_embeddings=TRUE
+```
+Evaluation:
+```
+./bin/systemds 
./scripts/algorithms/entity-resolution/eval-entity-resolution.dml -nvargs 
FX=data/DBLP-ACM/DBLP-ACM_res.csv FY=data/DBLP-ACM/DBLP-ACM_perfectMapping.csv
+```
+
+## Further Work
+
+1. Better clustering algorithms.
+2. Multi-Probe LSH.
+3. Classifier-based matching.
+4. Better/built-in tokenization
+5. Better/built-in embeddings.

Review comment:
       I think it is great that you include this further work. :star: 
   
   Could you clarify for me:
   
   1. Better clustering algorithms? are you thinking of something specific?
   2. Multi-Probe LSH, could be cool, i think the direction should be making a 
dedicated operator inside the system, since currently in this implementation 
there is an overhead in the conversion to double all the time. Then different 
LSH techniques could easily be compared.
   3. I don't know what you intend here could you elaborate?
   4. & 5. Are you referring to the glove embedding or something else in 
specific?

##########
File path: src/test/java/org/apache/sysds/test/TestUtils.java
##########
@@ -62,9 +62,9 @@
  * <li>clean up</li>
  * </ul>
  */
-public class TestUtils 
+public class TestUtils

Review comment:
       I think you might have accidentally reformatted TestUtils.
   This needs to be done, but i would rather do it in a single PR/Commit.

##########
File path: scripts/staging/entity-resolution/entity-clustering.dml
##########
@@ -0,0 +1,119 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+#
+# THIS SCRIPT PERFORMS AN ENTITY RESOLUTION PIPELINE FOR CLUSTERING ON A 
SINGLE FILE
+# CONSISTS OF BLOCKING, MATCHING, AND CLUSTERING
+#
+# INPUT PARAMETERS:
+# 
---------------------------------------------------------------------------------------------
+# NAME           TYPE   DEFAULT  MEANING
+# 
---------------------------------------------------------------------------------------------
+# FX              String  ---     Location to read the frame of tokens in bow 
format
+#                                 Each line contains comma separated list of 
id, token and value
+# OUT             String  ---     Location to save the output of maching pairs
+#                                 Each line contains comma separated ids of 
one matched pair
+#                                 Third column provides the similarity score
+# threshold       Double  0.9     Threshold to be considered as a match
+# blocking_method String  naive   Possible values: ["naive", "lsh"].
+# num_blocks      Int     1       Number of blocks for naive blocking
+# num_hashtables  Int     6       Number of hashtables for LSH blocking.
+# num_hyperplanes Int     4       Number of hyperplanes for LSH blocking.
+
+# use_tokens      Boolean TRUE    Whether to use the tokens of FX to generate 
predictions
+# use_embeddings  Boolean FALSE   Whether to use the embeddings of XE to 
generate predictions
+# XE              String  ---     Location to read the frame of embedding 
matrix
+#                                 Required if use_embeddings is set to TRUE
+# store_mapping   Boolean FALSE   Whether to store the mapping of 
transformencode
+# MX              String  ---     Location to write the frame of mapping
+#                                Required if store_mapping is set to TRUE
+# 
---------------------------------------------------------------------------------------------
+# OUTPUT: frame of maching pairs
+# 
---------------------------------------------------------------------------------------------
+
+source("./scripts/staging/entity-resolution/primitives/preprocessing.dml") as 
pre;
+source("./scripts/staging/entity-resolution/primitives/postprocessing.dml") as 
post;
+source("./scripts/staging/entity-resolution/primitives/pipeline.dml") as pipe;
+
+# Command Line Arguments
+fileFX = $FX;
+fileOUT = $OUT;
+
+threshold = ifdef($threshold, 0.9);
+blocking_method = ifdef($blocking_method, "lsh");
+num_blocks = ifdef($num_blocks, 1);
+num_hyperplanes = ifdef($num_hyperplanes, 4);
+num_hashtables = ifdef($num_hashtables, 6);
+use_tokens = ifdef($use_tokens, TRUE);
+use_embeddings = ifdef($use_embeddings, FALSE);
+# file XE is only required if using embeddings
+fileXE = ifdef($XE, "");
+# mapping file is required for evaluation
+store_mapping = ifdef($store_mapping, FALSE);
+fileMX = ifdef($MX, "");
+
+if (!(blocking_method == "naive" | blocking_method == "lsh")) {
+  print("ERROR: blocking method must be in ['naive', 'lsh']");
+}
+
+# Read data
+FX = read(fileFX);
+if (use_embeddings) {
+  if (fileXE == "") {
+    print("You need to specify file XE when use_embeddings is set to TRUE");
+  } else {
+    X_embeddings = read(fileXE);
+  }
+}
+
+# Convert data
+[X, MX] = pre::convert_frame_tokens_to_matrix_bow(FX);
+if (use_tokens & use_embeddings) {
+  X = cbind(X, X_embeddings);
+} else if (use_tokens) {
+  # Nothing to do in this case, since X already contains tokens
+} else if (use_embeddings) {
+  X = X_embeddings;
+} else {
+  print("Either use_tokens or use_embeddings needs to be TRUE, using tokens 
only as default.");
+}
+
+if (store_mapping) {
+  if (fileMX == "") {
+    print("You need to specify file MX when store_mapping is set to TRUE.");
+  } else {
+    write(MX, fileMX);
+  }
+}
+
+# Perform clustering
+if (blocking_method == "naive") {
+  CLUSTER = pipe::entity_clustering_pipeline(X, num_blocks, threshold);
+} else if (blocking_method == "lsh") {
+  CLUSTER = pipe::entity_clustering_pipeline_lsh(X, num_hashtables, 
num_hyperplanes, threshold);
+}

Review comment:
       Can you maybe elaborate on performance for these two techniques?
   
   If possible could you try adding another case using kmeans (only if 
applicable and you don't have to change much )
   
   

##########
File path: scripts/staging/entity-resolution/primitives/blocking.dml
##########
@@ -0,0 +1,292 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# Splits the rows of X into num_blocks non-overlapping regions.
+# May produce less than num_blocks regions.
+#
+# INPUT PARAMETERS:
+# 
--------------------------------------------------------------------------------------------
+# NAME          TYPE    DEFAULT   MEANING
+# 
--------------------------------------------------------------------------------------------
+# X             matrix  ---       A dataset with rows to split into blocks.
+# num_blocks    Integer ---       How many blocks to produce.
+#
+# Output:
+# 
--------------------------------------------------------------------------------------------
+# NAME          TYPE     MEANING
+# 
--------------------------------------------------------------------------------------------
+# BLOCKS         Double   A column vector with start indices for each block. 
Has one more row
+#                         than blocks such that the last row is the end index 
of the last block.
+# 
--------------------------------------------------------------------------------------------
+naive_blocking = function(Matrix[Double] X, Integer num_blocks) return 
(Matrix[Double] BLOCKS) {
+  block_size_flt = nrow(X) / num_blocks;
+  if (block_size_flt < 1.0) {
+    BLOCKS= seq(1, nrow(X) + 1);
+  } else {
+    block_size = ceil(block_size_flt);
+    BLOCKS = rbind(as.matrix(1), 1 +  block_size * seq(1, num_blocks));
+    BLOCKS[num_blocks+1,] = nrow(X) + 1
+  }
+}
+
+# Sorts the rows in dataset X by the vector v. The order is defined by the 
ascending order of v.
+# Can return the vector v as first column of X, depending on parameter 
prepend_v.
+#
+# INPUT PARAMETERS:
+# 
--------------------------------------------------------------------------------------------
+# NAME          TYPE    DEFAULT   MEANING
+# 
--------------------------------------------------------------------------------------------
+# X             matrix  ---       Any matrix with rows to be sorted.
+# v             matrix  ---       A vector with the same number of rows as X. 
Defines ordering.
+# prepend_v     boolean ---       Whether to return v as first column of X.
+#
+# Output:
+# 
--------------------------------------------------------------------------------------------
+# NAME          TYPE     MEANING
+# 
--------------------------------------------------------------------------------------------
+# X_index       matrix  A vector with the original row number for each row in 
X.
+# X_sorted      matrix  The sorted input matrix X.
+# 
--------------------------------------------------------------------------------------------
+sort_by_vector = function(Matrix[Double] X, Matrix[Double] v, Boolean 
prepend_v) return (Matrix[Double] X_index, Matrix[Double] X_sorted) {
+  X_index = order(target=v, by=1, decreasing = FALSE, index.return=TRUE);
+  X_sorted = order(target=cbind(v, X), by=1, decreasing = FALSE, 
index.return=FALSE);
+  if (!prepend_v) {
+    X_sorted = X_sorted[,2:ncol(X_sorted)];
+  }
+}
+
+# Sorts the rows in dataset X by their sum.
+#
+# INPUT PARAMETERS:
+# 
--------------------------------------------------------------------------------------------
+# NAME          TYPE    DEFAULT   MEANING
+# 
--------------------------------------------------------------------------------------------
+# X             matrix  ---       Any matrix with rows to be sorted.
+#
+# Output:
+# 
--------------------------------------------------------------------------------------------
+# NAME          TYPE     MEANING
+# 
--------------------------------------------------------------------------------------------
+# X_index       matrix  A vector with the original row number for each row in 
X.
+# X_sorted      matrix  The sorted input matrix X.
+# 
--------------------------------------------------------------------------------------------
+row_sum_sorting = function(Matrix[Double] X) return (Matrix[Double] X_index, 
Matrix[Double] X_sorted) {
+  X_rowSum = rowSums(X);
+  [X_index, X_sorted] = sort_by_vector(X, X_rowSum, FALSE);
+}
+
+# Sorts the rows in dataset X by their original row numbers (index).
+#
+# INPUT PARAMETERS:
+# 
--------------------------------------------------------------------------------------------
+# NAME          TYPE    DEFAULT   MEANING
+# 
--------------------------------------------------------------------------------------------
+# X             matrix  ---       Any matrix with rows to be sorted.
+# X_index       matrix  ---       The orginal row numbers to be restored.
+#
+# Output:
+# 
--------------------------------------------------------------------------------------------
+# NAME          TYPE     MEANING
+# 
--------------------------------------------------------------------------------------------
+# X_reindex     matrix  The reindexed matrix X.
+# 
--------------------------------------------------------------------------------------------
+reindex_rowwise = function(Matrix[Double] X, Matrix[Double] X_index) return 
(Matrix[Double] X_reindex) {
+  X_conc = cbind(X_index, X);
+  X_reindex = order(target=X_conc, by=1, decreasing = FALSE, 
index.return=FALSE);
+  # Remove index column
+  X_reindex = X_reindex[,2:ncol(X_reindex)];
+}
+
+# Sorts both rows and columns in dataset X by their original row numbers 
(index).
+#
+# INPUT PARAMETERS:
+# 
--------------------------------------------------------------------------------------------
+# NAME          TYPE    DEFAULT   MEANING
+# 
--------------------------------------------------------------------------------------------
+# X             matrix  ---       Any matrix that should be resorted.
+# X_index       matrix  ---       The orginal row numbers to be restored.
+#
+# Output:
+# 
--------------------------------------------------------------------------------------------
+# NAME          TYPE     MEANING
+# 
--------------------------------------------------------------------------------------------
+# X_reindex     matrix  The reindexed matrix X.
+# 
--------------------------------------------------------------------------------------------
+reindex_rows_and_cols = function(Matrix[Double] X, Matrix[Double] X_index) 
return (Matrix[Double] X_reindex) {
+  # First reindex rows
+  X_reindex = X;
+  X_reindex = reindex_rowwise(X_reindex, X_index);
+  # Then transpose and repeat
+  X_reindex = t(X_reindex);
+  X_reindex = reindex_rowwise(X_reindex, X_index);
+}
+
+# Generates a random matrix of hyperplane parameters.
+#
+# INPUT PARAMETERS:
+# 
--------------------------------------------------------------------------------------------
+# NAME            TYPE    DEFAULT   MEANING
+# 
--------------------------------------------------------------------------------------------
+# num_hyperplanes  Integer ---       How many hyperplanes to generate.
+# dimension        Integer ---       The number of parameters per hyperplane.
+#
+# Output:
+# 
--------------------------------------------------------------------------------------------
+# NAME          TYPE     MEANING
+# 
--------------------------------------------------------------------------------------------
+# H              matrix   A num_hyperplanes x dimension matrix of hyperplane 
parameters.
+# 
--------------------------------------------------------------------------------------------
+gen_rand_hyperplanes = function(Integer num_hyperplanes, Integer dimension) 
return (Matrix[Double] H) {
+  H = rand(rows=num_hyperplanes, cols=dimension, min=-1, max=1);
+}
+
+# Creates a scalar hash code for each row in X_plane by interpreting positive 
values as a binary
+# binary number.

Review comment:
       binary twice




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [systemds] Baunsgaard commented on a change in pull request #993: [SYSTEMDS-265] Entity resolution pipelines and primitives.

Reply via email to