[
https://issues.apache.org/jira/browse/SYSTEMML-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15511329#comment-15511329
]
Mike Dusenberry commented on SYSTEMML-946:
------------------------------------------
[~mboehm7] This is using a slightly modified version of the LeNet example I
wrote (using the library) as well as some logic to perform a hyperparameter
search over it. Here's the code, using DML and the Python MLContext API.
Basically, I have a **conversion** script that scales the features and one-hot
encodes the labels, for both training and validation splits of the data. Then
I have a **train** script that does a hyperparameter search by repeatedly
sampling random values for the various hyperparameters and then trains a
LeNet-like neural net, saving the accuracy to a file. I want to be able to let
this run for the next week without any errors. It has been working on small
samples of the overall data (although today I am running into null pointer
issues in SYSTEMML-948 with the small data that may be related to the updates
for this JIRA).
Conversion code:
{code}
script = """
# Scale images to [0,1]
X = (X / 255) * 2 - 1
X_val = (X_val / 255) * 2 - 1
# One-hot encode the labels
num_tumor_classes = 3
n = nrow(Y)
n_val = nrow(Y_val)
Y = table(seq(1, n), Y, n, num_tumor_classes)
Y_val = table(seq(1, n_val), Y_val, n_val, num_tumor_classes)
"""
outputs = ("X", "X_val", "Y", "Y_val")
script = dml(script).input(X=X_df, X_val=X_val_df, Y=Y_df,
Y_val=Y_val_df).output(*outputs)
X, X_val, Y, Y_val = ml.execute(script).get(*outputs)
{code}
Training:
{code}
script = """
source("mnist_lenet.dml") as clf
i = 0
run = TRUE
while(run) {
# Hyperparameters & Settings
lr = 10 ^ as.scalar(rand(rows=1, cols=1, min=-7, max=-1))
mu = as.scalar(rand(rows=1, cols=1, min=0.5, max=0.9))
decay = as.scalar(rand(rows=1, cols=1, min=0.9, max=1))
lambda = 10 ^ as.scalar(rand(rows=1, cols=1, min=-7, max=-1))
batch_size = 50
epochs = 1
iters = ceil(nrow(Y) / batch_size)
# Train
[W1, b1, W2, b2, W3, b3, W4, b4] = clf::train(X, Y, X_val, Y_val, C, Hin,
Win, lr, mu, decay, lambda, batch_size, epochs, iters)
# Eval
probs = clf::predict(X, C, Hin, Win, W1, b1, W2, b2, W3, b3, W4, b4)
[loss, accuracy] = clf::eval(probs, Y)
probs_val = clf::predict(X_val, C, Hin, Win, W1, b1, W2, b2, W3, b3, W4, b4)
[loss_val, accuracy_val] = clf::eval(probs_val, Y_val)
# Save hyperparams
str = "lr: " + lr + ", mu: " + mu + ", decay: " + decay + ", lambda: " +
lambda
name = "models/"+accuracy_val+","+accuracy+","+i
write(str, name)
i = i + 1
}
"""
script = dml(script).input(X=X, X_val=X_val, Y=Y, Y_val=Y_val, C=3, Hin=256,
Win=256)
ml.execute(script)
{code}
{{mnist_lenet.dml}}: Attached
> OOM on spark dataframe-matrix / csv-matrix conversion
> -----------------------------------------------------
>
> Key: SYSTEMML-946
> URL: https://issues.apache.org/jira/browse/SYSTEMML-946
> Project: SystemML
> Issue Type: Bug
> Components: Runtime
> Reporter: Matthias Boehm
>
> The decision on dense/sparse block allocation in our dataframeToBinaryBlock
> and csvToBinaryBlock data converters is purely based on the sparsity. This
> works very well for the common case of tall & skinny matrices. However, for
> scenarios with dense data but huge number of columns a single partition will
> rarely have 1000 rows to fill an entire row of blocks. This leads to
> unnecessary allocation and dense-sparse conversion as well as potential
> out-of-memory errors because the temporary memory requirement can be up to
> 1000x larger than the input partition.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)