[ https://issues.apache.org/jira/browse/SYSTEMML-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511329#comment-15511329 ]
Mike Dusenberry edited comment on SYSTEMML-946 at 9/21/16 10:04 PM: -------------------------------------------------------------------- [~mboehm7] This is using a slightly modified version of the LeNet example I wrote (using the library) as well as some logic to perform a hyperparameter search over it. It has been working on small samples of the overall data (although today I am running into null pointer issues in SYSTEMML-948 with the small data that may be related to the updates for this JIRA). Here's the code, using DML and the Python MLContext API. Basically, I have a **conversion** script that scales the features and one-hot encodes the labels, for both training and validation splits of the data. Then I have a **train** script that does a hyperparameter search by repeatedly sampling random values for the various hyperparameters and then trains a LeNet-like neural net, saving the accuracy to a file. I want to be able to let this run for the next week without any errors. Conversion code: {code} script = """ # Scale images to [0,1] X = (X / 255) * 2 - 1 X_val = (X_val / 255) * 2 - 1 # One-hot encode the labels num_tumor_classes = 3 n = nrow(Y) n_val = nrow(Y_val) Y = table(seq(1, n), Y, n, num_tumor_classes) Y_val = table(seq(1, n_val), Y_val, n_val, num_tumor_classes) """ outputs = ("X", "X_val", "Y", "Y_val") script = dml(script).input(X=X_df, X_val=X_val_df, Y=Y_df, Y_val=Y_val_df).output(*outputs) X, X_val, Y, Y_val = ml.execute(script).get(*outputs) {code} Training: {code} script = """ source("mnist_lenet.dml") as clf i = 0 run = TRUE while(run) { # Hyperparameters & Settings lr = 10 ^ as.scalar(rand(rows=1, cols=1, min=-7, max=-1)) mu = as.scalar(rand(rows=1, cols=1, min=0.5, max=0.9)) decay = as.scalar(rand(rows=1, cols=1, min=0.9, max=1)) lambda = 10 ^ as.scalar(rand(rows=1, cols=1, min=-7, max=-1)) batch_size = 50 epochs = 1 iters = ceil(nrow(Y) / batch_size) # Train [W1, b1, W2, b2, W3, b3, W4, b4] = clf::train(X, Y, X_val, Y_val, C, Hin, Win, lr, mu, decay, lambda, batch_size, epochs, iters) # Eval probs = clf::predict(X, C, Hin, Win, W1, b1, W2, b2, W3, b3, W4, b4) [loss, accuracy] = clf::eval(probs, Y) probs_val = clf::predict(X_val, C, Hin, Win, W1, b1, W2, b2, W3, b3, W4, b4) [loss_val, accuracy_val] = clf::eval(probs_val, Y_val) # Save hyperparams str = "lr: " + lr + ", mu: " + mu + ", decay: " + decay + ", lambda: " + lambda name = "models/"+accuracy_val+","+accuracy+","+i write(str, name) i = i + 1 } """ script = dml(script).input(X=X, X_val=X_val, Y=Y, Y_val=Y_val, C=3, Hin=256, Win=256) ml.execute(script) {code} {{mnist_lenet.dml}}: Attached was (Author: mwdus...@us.ibm.com): [~mboehm7] This is using a slightly modified version of the LeNet example I wrote (using the library) as well as some logic to perform a hyperparameter search over it. Here's the code, using DML and the Python MLContext API. Basically, I have a **conversion** script that scales the features and one-hot encodes the labels, for both training and validation splits of the data. Then I have a **train** script that does a hyperparameter search by repeatedly sampling random values for the various hyperparameters and then trains a LeNet-like neural net, saving the accuracy to a file. I want to be able to let this run for the next week without any errors. It has been working on small samples of the overall data (although today I am running into null pointer issues in SYSTEMML-948 with the small data that may be related to the updates for this JIRA). Conversion code: {code} script = """ # Scale images to [0,1] X = (X / 255) * 2 - 1 X_val = (X_val / 255) * 2 - 1 # One-hot encode the labels num_tumor_classes = 3 n = nrow(Y) n_val = nrow(Y_val) Y = table(seq(1, n), Y, n, num_tumor_classes) Y_val = table(seq(1, n_val), Y_val, n_val, num_tumor_classes) """ outputs = ("X", "X_val", "Y", "Y_val") script = dml(script).input(X=X_df, X_val=X_val_df, Y=Y_df, Y_val=Y_val_df).output(*outputs) X, X_val, Y, Y_val = ml.execute(script).get(*outputs) {code} Training: {code} script = """ source("mnist_lenet.dml") as clf i = 0 run = TRUE while(run) { # Hyperparameters & Settings lr = 10 ^ as.scalar(rand(rows=1, cols=1, min=-7, max=-1)) mu = as.scalar(rand(rows=1, cols=1, min=0.5, max=0.9)) decay = as.scalar(rand(rows=1, cols=1, min=0.9, max=1)) lambda = 10 ^ as.scalar(rand(rows=1, cols=1, min=-7, max=-1)) batch_size = 50 epochs = 1 iters = ceil(nrow(Y) / batch_size) # Train [W1, b1, W2, b2, W3, b3, W4, b4] = clf::train(X, Y, X_val, Y_val, C, Hin, Win, lr, mu, decay, lambda, batch_size, epochs, iters) # Eval probs = clf::predict(X, C, Hin, Win, W1, b1, W2, b2, W3, b3, W4, b4) [loss, accuracy] = clf::eval(probs, Y) probs_val = clf::predict(X_val, C, Hin, Win, W1, b1, W2, b2, W3, b3, W4, b4) [loss_val, accuracy_val] = clf::eval(probs_val, Y_val) # Save hyperparams str = "lr: " + lr + ", mu: " + mu + ", decay: " + decay + ", lambda: " + lambda name = "models/"+accuracy_val+","+accuracy+","+i write(str, name) i = i + 1 } """ script = dml(script).input(X=X, X_val=X_val, Y=Y, Y_val=Y_val, C=3, Hin=256, Win=256) ml.execute(script) {code} {{mnist_lenet.dml}}: Attached > OOM on spark dataframe-matrix / csv-matrix conversion > ----------------------------------------------------- > > Key: SYSTEMML-946 > URL: https://issues.apache.org/jira/browse/SYSTEMML-946 > Project: SystemML > Issue Type: Bug > Components: Runtime > Reporter: Matthias Boehm > Attachments: mnist_lenet.dml > > > The decision on dense/sparse block allocation in our dataframeToBinaryBlock > and csvToBinaryBlock data converters is purely based on the sparsity. This > works very well for the common case of tall & skinny matrices. However, for > scenarios with dense data but huge number of columns a single partition will > rarely have 1000 rows to fill an entire row of blocks. This leads to > unnecessary allocation and dense-sparse conversion as well as potential > out-of-memory errors because the temporary memory requirement can be up to > 1000x larger than the input partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332)