[jira] [Comment Edited] (SYSTEMML-946) OOM on spark dataframe-matrix / csv-matrix conversion

Mike Dusenberry (JIRA) Wed, 21 Sep 2016 15:05:41 -0700

    [ 
https://issues.apache.org/jira/browse/SYSTEMML-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511329#comment-15511329
 ]


Mike Dusenberry edited comment on SYSTEMML-946 at 9/21/16 10:04 PM:
--------------------------------------------------------------------

[~mboehm7]  This is using a slightly modified version of the LeNet example I 
wrote (using the library) as well as some logic to perform a hyperparameter 
search over it.  It has been working on small samples of the overall data 
(although today I am running into null pointer issues in SYSTEMML-948 with the 
small data that may be related to the updates for this JIRA).

Here's the code, using DML and the Python MLContext API.  Basically, I have a 
**conversion** script that scales the features and one-hot encodes the labels, 
for both training and validation splits of the data.  Then I have a **train** 
script that does a hyperparameter search by repeatedly sampling random values 
for the various hyperparameters and then trains a LeNet-like neural net, saving 
the accuracy to a file.

I want to be able to let this run for the next week without any errors. 

Conversion code:
{code}
script = """
# Scale images to [0,1]
X = (X / 255) * 2 - 1
X_val = (X_val / 255) * 2 - 1

# One-hot encode the labels
num_tumor_classes = 3
n = nrow(Y)
n_val = nrow(Y_val)
Y = table(seq(1, n), Y, n, num_tumor_classes)
Y_val = table(seq(1, n_val), Y_val, n_val, num_tumor_classes)
"""
outputs = ("X", "X_val", "Y", "Y_val")
script = dml(script).input(X=X_df, X_val=X_val_df, Y=Y_df, 
Y_val=Y_val_df).output(*outputs)
X, X_val, Y, Y_val = ml.execute(script).get(*outputs)
{code}

Training:
{code}
script = """
source("mnist_lenet.dml") as clf

i = 0
run = TRUE
while(run) {
  # Hyperparameters & Settings
  lr = 10 ^ as.scalar(rand(rows=1, cols=1, min=-7, max=-1))
  mu = as.scalar(rand(rows=1, cols=1, min=0.5, max=0.9))
  decay = as.scalar(rand(rows=1, cols=1, min=0.9, max=1))
  lambda = 10 ^ as.scalar(rand(rows=1, cols=1, min=-7, max=-1))
  batch_size = 50
  epochs = 1
  iters = ceil(nrow(Y) / batch_size)

  # Train
  [W1, b1, W2, b2, W3, b3, W4, b4] = clf::train(X, Y, X_val, Y_val, C, Hin, 
Win, lr, mu, decay, lambda, batch_size, epochs, iters)

  # Eval
  probs = clf::predict(X, C, Hin, Win, W1, b1, W2, b2, W3, b3, W4, b4)
  [loss, accuracy] = clf::eval(probs, Y)
  probs_val = clf::predict(X_val, C, Hin, Win, W1, b1, W2, b2, W3, b3, W4, b4)
  [loss_val, accuracy_val] = clf::eval(probs_val, Y_val)

  # Save hyperparams
  str = "lr: " + lr + ", mu: " + mu + ", decay: " + decay + ", lambda: " + 
lambda
  name = "models/"+accuracy_val+","+accuracy+","+i
  write(str, name)
  i = i + 1
}
"""
script = dml(script).input(X=X, X_val=X_val, Y=Y, Y_val=Y_val, C=3, Hin=256, 
Win=256)
ml.execute(script)
{code}

{{mnist_lenet.dml}}: Attached


was (Author: mwdus...@us.ibm.com):
[~mboehm7]  This is using a slightly modified version of the LeNet example I 
wrote (using the library) as well as some logic to perform a hyperparameter 
search over it.  Here's the code, using DML and the Python MLContext API.  
Basically, I have a **conversion** script that scales the features and one-hot 
encodes the labels, for both training and validation splits of the data.  Then 
I have a **train** script that does a hyperparameter search by repeatedly 
sampling random values for the various hyperparameters and then trains a 
LeNet-like neural net, saving the accuracy to a file.  I want to be able to let 
this run for the next week without any errors.  It has been working on small 
samples of the overall data (although today I am running into null pointer 
issues in SYSTEMML-948 with the small data that may be related to the updates 
for this JIRA). 

Conversion code:
{code}
script = """
# Scale images to [0,1]
X = (X / 255) * 2 - 1
X_val = (X_val / 255) * 2 - 1

# One-hot encode the labels
num_tumor_classes = 3
n = nrow(Y)
n_val = nrow(Y_val)
Y = table(seq(1, n), Y, n, num_tumor_classes)
Y_val = table(seq(1, n_val), Y_val, n_val, num_tumor_classes)
"""
outputs = ("X", "X_val", "Y", "Y_val")
script = dml(script).input(X=X_df, X_val=X_val_df, Y=Y_df, 
Y_val=Y_val_df).output(*outputs)
X, X_val, Y, Y_val = ml.execute(script).get(*outputs)
{code}

Training:
{code}
script = """
source("mnist_lenet.dml") as clf

i = 0
run = TRUE
while(run) {
  # Hyperparameters & Settings
  lr = 10 ^ as.scalar(rand(rows=1, cols=1, min=-7, max=-1))
  mu = as.scalar(rand(rows=1, cols=1, min=0.5, max=0.9))
  decay = as.scalar(rand(rows=1, cols=1, min=0.9, max=1))
  lambda = 10 ^ as.scalar(rand(rows=1, cols=1, min=-7, max=-1))
  batch_size = 50
  epochs = 1
  iters = ceil(nrow(Y) / batch_size)

  # Train
  [W1, b1, W2, b2, W3, b3, W4, b4] = clf::train(X, Y, X_val, Y_val, C, Hin, 
Win, lr, mu, decay, lambda, batch_size, epochs, iters)

  # Eval
  probs = clf::predict(X, C, Hin, Win, W1, b1, W2, b2, W3, b3, W4, b4)
  [loss, accuracy] = clf::eval(probs, Y)
  probs_val = clf::predict(X_val, C, Hin, Win, W1, b1, W2, b2, W3, b3, W4, b4)
  [loss_val, accuracy_val] = clf::eval(probs_val, Y_val)

  # Save hyperparams
  str = "lr: " + lr + ", mu: " + mu + ", decay: " + decay + ", lambda: " + 
lambda
  name = "models/"+accuracy_val+","+accuracy+","+i
  write(str, name)
  i = i + 1
}
"""
script = dml(script).input(X=X, X_val=X_val, Y=Y, Y_val=Y_val, C=3, Hin=256, 
Win=256)
ml.execute(script)
{code}

{{mnist_lenet.dml}}: Attached

> OOM on spark dataframe-matrix / csv-matrix conversion
> -----------------------------------------------------
>
>                 Key: SYSTEMML-946
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-946
>             Project: SystemML
>          Issue Type: Bug
>          Components: Runtime
>            Reporter: Matthias Boehm
>         Attachments: mnist_lenet.dml
>
>
> The decision on dense/sparse block allocation in our dataframeToBinaryBlock 
> and csvToBinaryBlock data converters is purely based on the sparsity. This 
> works very well for the common case of tall & skinny matrices. However, for 
> scenarios with dense data but huge number of columns a single partition will 
> rarely have 1000 rows to fill an entire row of blocks. This leads to 
> unnecessary allocation and dense-sparse conversion as well as potential 
> out-of-memory errors because the temporary memory requirement can be up to 
> 1000x larger than the input partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (SYSTEMML-946) OOM on spark dataframe-matrix / csv-matrix conversion

Reply via email to