Repository: systemml Updated Branches: refs/heads/master 91f6fb572 -> 0ba165cdd
[MINOR] Data preparation script for recoding user/product ratings This utility script allows the conversion of user ratings in triple representation of alphanumeric user names, alphanumeric product names, and numeric ratings into an n x m ratings matrix where n is the number of unique users, and m is the number of unique products. The output ratings matrix can be directly fed into algorithm scripts such as ALS. Project: http://git-wip-us.apache.org/repos/asf/systemml/repo Commit: http://git-wip-us.apache.org/repos/asf/systemml/commit/0ba165cd Tree: http://git-wip-us.apache.org/repos/asf/systemml/tree/0ba165cd Diff: http://git-wip-us.apache.org/repos/asf/systemml/diff/0ba165cd Branch: refs/heads/master Commit: 0ba165cddf38eecd1318c7ce59141c6fb793b1a7 Parents: 91f6fb5 Author: Matthias Boehm <[email protected]> Authored: Mon Jan 29 22:45:35 2018 -0800 Committer: Matthias Boehm <[email protected]> Committed: Mon Jan 29 22:47:14 2018 -0800 ---------------------------------------------------------------------- scripts/utils/dataprep.dml | 59 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 59 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/systemml/blob/0ba165cd/scripts/utils/dataprep.dml ---------------------------------------------------------------------- diff --git a/scripts/utils/dataprep.dml b/scripts/utils/dataprep.dml new file mode 100644 index 0000000..cc95958 --- /dev/null +++ b/scripts/utils/dataprep.dml @@ -0,0 +1,59 @@ +#------------------------------------------------------------- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +#------------------------------------------------------------- + +# Utility functions for common data preparation tasks + +/* + * Function to convert a frame of user ratings in triple representation of + * alphanumeric user names, alphanumeric product names, and numeric ratings + * into an n x m ratings matrix where n is the number of unique users, and + * m is the number of unique products. + * + * Inputs: + * - R: ratings in triple representation. + * - thrU: minimum number of ratings per user (<=0 for unconstrained). + * - thrP: minimum number of ratings per product (<=0 for unconstrained). + * + * Outputs: + * - X: ratings matrix. + * - M: encoding meta data (for user and product name reconstruction). + * + * Example: amazon product ratings (http://jmcauley.ucsd.edu/data/amazon/) + * F = read("tmp/ratings_books.csv", data_type="frame", format="csv"); + * [X, M] = convertToRatingsMatrix(F[, 1:3], 5, 5); + */ +convertToRatingsMatrix = function(frame[string] R, int thrU, int thrP) + return (matrix[double] X, frame[string] M) +{ + # recode users and products into continuous numeric ids + jspec = "{ids:true, recode:[1,2]}"; + [TX,M] = transformencode( target = R, spec = jspec ); + + # convert triples of user-product-rating into ratings matrix + X_full = table( TX[,1], TX[,2], TX[,3] ); + + # filter users and products if necessary (filter applies to original counts) + X = X_full; + if( thrU > 0 ) + X = removeEmpty(target=X, margin="rows", select=rowSums(X_full!=0)>=thrU); + if( thrP > 0 ) + X = removeEmpty(target=X, margin="cols", select=colSums(X_full!=0)>=thrP); +}
