Repository: systemml
Updated Branches:
  refs/heads/master 91f6fb572 -> 0ba165cdd


[MINOR] Data preparation script for recoding user/product ratings

This utility script allows the conversion of user ratings in triple
representation of alphanumeric user names, alphanumeric product names,
and numeric ratings into an n x m ratings matrix where n is the number
of unique users, and m is the number of unique products. The output
ratings matrix can be directly fed into algorithm scripts such as ALS.


Project: http://git-wip-us.apache.org/repos/asf/systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/systemml/commit/0ba165cd
Tree: http://git-wip-us.apache.org/repos/asf/systemml/tree/0ba165cd
Diff: http://git-wip-us.apache.org/repos/asf/systemml/diff/0ba165cd

Branch: refs/heads/master
Commit: 0ba165cddf38eecd1318c7ce59141c6fb793b1a7
Parents: 91f6fb5
Author: Matthias Boehm <[email protected]>
Authored: Mon Jan 29 22:45:35 2018 -0800
Committer: Matthias Boehm <[email protected]>
Committed: Mon Jan 29 22:47:14 2018 -0800

----------------------------------------------------------------------
 scripts/utils/dataprep.dml | 59 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/systemml/blob/0ba165cd/scripts/utils/dataprep.dml
----------------------------------------------------------------------
diff --git a/scripts/utils/dataprep.dml b/scripts/utils/dataprep.dml
new file mode 100644
index 0000000..cc95958
--- /dev/null
+++ b/scripts/utils/dataprep.dml
@@ -0,0 +1,59 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+# 
+#   http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# Utility functions for common data preparation tasks
+
+/*
+ * Function to convert a frame of user ratings in triple representation of
+ * alphanumeric user names, alphanumeric product names, and numeric ratings
+ * into an n x m ratings matrix where n is the number of unique users, and
+ * m is the number of unique products.
+ * 
+ * Inputs:
+ *  - R: ratings in triple representation.
+ *  - thrU: minimum number of ratings per user (<=0 for unconstrained).
+ *  - thrP: minimum number of ratings per product (<=0 for unconstrained).
+ *
+ * Outputs:
+ *  - X: ratings matrix.
+ *  - M: encoding meta data (for user and product name reconstruction).
+ * 
+ * Example: amazon product ratings (http://jmcauley.ucsd.edu/data/amazon/)
+ *   F = read("tmp/ratings_books.csv", data_type="frame", format="csv");
+ *   [X, M] = convertToRatingsMatrix(F[, 1:3], 5, 5);
+ */
+convertToRatingsMatrix = function(frame[string] R, int thrU, int thrP)
+  return (matrix[double] X, frame[string] M) 
+{
+  # recode users and products into continuous numeric ids
+  jspec = "{ids:true, recode:[1,2]}";
+  [TX,M] = transformencode( target = R, spec = jspec );
+  
+  # convert triples of user-product-rating into ratings matrix
+  X_full = table( TX[,1], TX[,2], TX[,3] );
+
+  # filter users and products if necessary (filter applies to original counts)
+  X = X_full;
+  if( thrU > 0 )
+    X = removeEmpty(target=X, margin="rows", select=rowSums(X_full!=0)>=thrU);
+  if( thrP > 0 )
+    X = removeEmpty(target=X, margin="cols", select=colSums(X_full!=0)>=thrP);
+}

Reply via email to