I've been training a lot of random forests in a really big dataset and while
saving my transformations of the data in JLD files has been a breeze saving the
Models and their respective details is not going smoothly. I'm experimenting
with different sizes of trees and different number of parameters per tree, so I
have 10 forests total and since they take about 1 hour to train each I'd like
to save them every 7 iterations in case I have to shut down a machine. My code
for the process is the following:
using HDF5, JLD, DataFrames, Distributions, DecisionTree, MLBase, StatsBase
...
num_of_trees = collect(10:10:100);
num_of_features = collect(20:5:50);
Models =
Array{DecisionTree.Ensemble}(length(num_of_trees),length(num_of_features));
Predictions =
Array{Array{Float64,1}}(length(num_of_trees),length(num_of_features));
RMSEs = Array{Float64}(length(num_of_trees),length(num_of_features));
train = rand(Bernoulli(0.8), size(Y)) .== 1;
for i in 1:length(num_of_trees)
for j in 1:length(num_of_features)
Models[i,j] =
build_forest(Y[train],DataSTD[train,:],num_of_features[j],num_of_trees[i]);
Predictions[i,j] = apply_forest(Models[i,j], DataSTD[!train,:]);
RMSEs[i,j] = root_mean_squared_error(Y[!train],
Predictions[i,j]);
println("\n", Models[i,j])
println("Features: ",num_of_features[j])
println("RMSE: ",RMSEs[i,j])
display(confusion_matrix_regression(Y[!train],Predictions[i,j],10))
end
save("Models_run1.jld", "Models", Models, "Features", num_of_features,
"Predictions", Predictions, "RMSEs", RMSEs, "Bernoulli", train);
end
Finishing the internal for loop takes around 7 hours, which is not a surprise,
but the save function runs for hours as well. The file keeps slowly increasing
in size, so I think something is happening but I'm not sure what. I'm still
unable to get to a second iteration of my outer loop after 3 hours of the
intern loop has finished. I plan to leave it running over night to see whether
it fails or finishes. Any idea on why this is happening?