I did see that other post, but I really thought that this could be a different problem. The save function is running for the past 20 hours without terminating. I am inexperienced with serializers but I will see what I can make from the code you posted. Thank you very much.
On Sunday, January 24, 2016 at 10:29:25 AM UTC-8, Tim Holy wrote: > > Similar question here, asked just a couple of days ago (please do search > the > archives first): > https://groups.google.com/d/msg/julia-users/VInJ4M-yNUY/Z6N8wCCfAwAJ > > Someone should just add a serializer to the relevant random > forest/decision > tree packages. These aren't hard to write, and there's an example in the > linked docs. > > For reference, here's a more complicated example: in my own lab's code, we > use > "tile trees" to represent sums over little pieces of images. They combine > QuadTrees/OctTrees (depending on spatial dimensionality) with > spatio-temporal > factorizations. The main point being that these might seem like fairly > complicated data structures, yet the serializer and deserializer can each > be > written in ~10 lines of code, and gave me an orders-of-magnitude > performance > improvement when saving/loading. > > For reference, I've pasted the code below: it's not self-contained, but it > should give you the idea. > > Best, > --Tim > > # This contains info needed to reconstruct the BoxTree, but does not store > the > # BoxTree itself > type TileTreeSerializer{TT<:Tile} > tiles::Vector{TT} > ids::Vector{Int} > ntiles::Int > dims::Dims > Ts::Type > Tel::Type > K::Int > W::Tuple > end > TileTrees.tiletype{TT}(::Type{TileTreeSerializer{TT}}) = TT > TileTrees.tiletype{TT}(::TileTreeSerializer{TT}) = TT > > function JLD.readas(serdata::TileTreeSerializer) > bt = boxtree(serdata.Ts, serdata.Tel, serdata.K, serdata.W, > dimspans(serdata.dims[1:end-1])) > TT = tiletype(serdata) > tiles = Array(TT, serdata.ntiles) > for i = 1:length(serdata.tiles) > id = serdata.ids[i] > tile = serdata.tiles[i] > tiles[id] = tile > roi = boxroi(tile.spans, id) > push!(bt, roi) > end > ttree = TileTree(tiles, bt, serdata.dims) > end > > function JLD.writeas(ttree::TileTree) > tiles = Array(tiletype(ttree), 0) > ids = Int[] > for (id, tile) in ttree > push!(tiles, tile) > push!(ids, id) > end > BT = boxtreetype(ttree) > ST = splittype(BT) > TileTreeSerializer{tiletype(ttree)}( > tiles, > ids, > length(ttree.tiles), > ttree.dims, > ST, > eltype(BT), > splitk(BT), > (splitwidth(BT)...)) > end > > > On Sunday, January 24, 2016 02:15:50 AM Pedro Silva wrote: > > I've been training a lot of random forests in a really big dataset and > while > > saving my transformations of the data in JLD files has been a breeze > saving > > the Models and their respective details is not going smoothly. I'm > > experimenting with different sizes of trees and different number of > > parameters per tree, so I have 10 forests total and since they take > about 1 > > hour to train each I'd like to save them every 7 iterations in case I > have > > to shut down a machine. My code for the process is the following: > > > > using HDF5, JLD, DataFrames, Distributions, DecisionTree, MLBase, > StatsBase > > > > ... > > > > num_of_trees = collect(10:10:100); > > num_of_features = collect(20:5:50); > > Models = > > > Array{DecisionTree.Ensemble}(length(num_of_trees),length(num_of_features)); > > Predictions = > > Array{Array{Float64,1}}(length(num_of_trees),length(num_of_features)); > > RMSEs = Array{Float64}(length(num_of_trees),length(num_of_features)); > train > > = rand(Bernoulli(0.8), size(Y)) .== 1; > > > > for i in 1:length(num_of_trees) > > for j in 1:length(num_of_features) > > Models[i,j] = > > > build_forest(Y[train],DataSTD[train,:],num_of_features[j],num_of_trees[i]); > > Predictions[i,j] = apply_forest(Models[i,j], DataSTD[!train,:]); > RMSEs[i,j] > > = root_mean_squared_error(Y[!train], Predictions[i,j]); println("\n", > > Models[i,j]) > > println("Features: ",num_of_features[j]) > > println("RMSE: ",RMSEs[i,j]) > > > > display(confusion_matrix_regression(Y[!train],Predictions[i,j],10)) > > > end > > save("Models_run1.jld", "Models", Models, "Features", > num_of_features, > > "Predictions", Predictions, "RMSEs", RMSEs, "Bernoulli", train); end > > > > Finishing the internal for loop takes around 7 hours, which is not a > > surprise, but the save function runs for hours as well. The file keeps > > slowly increasing in size, so I think something is happening but I'm not > > sure what. I'm still unable to get to a second iteration of my outer > loop > > after 3 hours of the intern loop has finished. I plan to leave it > running > > over night to see whether it fails or finishes. Any idea on why this is > > happening? > >
