Do you have cycles in the objects you're trying to save? (like A->B->A) I'm not sure JLD handles cycles. In which case breaking the cycle with a custom serializer will also solve the problem. (More ambitiously, one could also solve the general cycle problem.)
Best, --Tim On Sunday, January 24, 2016 12:31:33 PM Pedro Silva wrote: > I did see that other post, but I really thought that this could be a > different problem. The save function is running for the past 20 hours > without terminating. I am inexperienced with serializers but I will see > what I can make from the code you posted. Thank you very much. > > On Sunday, January 24, 2016 at 10:29:25 AM UTC-8, Tim Holy wrote: > > Similar question here, asked just a couple of days ago (please do search > > the > > archives first): > > https://groups.google.com/d/msg/julia-users/VInJ4M-yNUY/Z6N8wCCfAwAJ > > > > Someone should just add a serializer to the relevant random > > forest/decision > > tree packages. These aren't hard to write, and there's an example in the > > linked docs. > > > > For reference, here's a more complicated example: in my own lab's code, we > > use > > "tile trees" to represent sums over little pieces of images. They combine > > QuadTrees/OctTrees (depending on spatial dimensionality) with > > spatio-temporal > > factorizations. The main point being that these might seem like fairly > > complicated data structures, yet the serializer and deserializer can each > > be > > written in ~10 lines of code, and gave me an orders-of-magnitude > > performance > > improvement when saving/loading. > > > > For reference, I've pasted the code below: it's not self-contained, but it > > should give you the idea. > > > > Best, > > --Tim > > > > # This contains info needed to reconstruct the BoxTree, but does not store > > the > > # BoxTree itself > > type TileTreeSerializer{TT<:Tile} > > > > tiles::Vector{TT} > > ids::Vector{Int} > > ntiles::Int > > dims::Dims > > Ts::Type > > Tel::Type > > K::Int > > W::Tuple > > > > end > > TileTrees.tiletype{TT}(::Type{TileTreeSerializer{TT}}) = TT > > TileTrees.tiletype{TT}(::TileTreeSerializer{TT}) = TT > > > > function JLD.readas(serdata::TileTreeSerializer) > > > > bt = boxtree(serdata.Ts, serdata.Tel, serdata.K, serdata.W, > > > > dimspans(serdata.dims[1:end-1])) > > > > TT = tiletype(serdata) > > tiles = Array(TT, serdata.ntiles) > > for i = 1:length(serdata.tiles) > > > > id = serdata.ids[i] > > tile = serdata.tiles[i] > > tiles[id] = tile > > roi = boxroi(tile.spans, id) > > push!(bt, roi) > > > > end > > ttree = TileTree(tiles, bt, serdata.dims) > > > > end > > > > function JLD.writeas(ttree::TileTree) > > > > tiles = Array(tiletype(ttree), 0) > > ids = Int[] > > for (id, tile) in ttree > > > > push!(tiles, tile) > > push!(ids, id) > > > > end > > BT = boxtreetype(ttree) > > ST = splittype(BT) > > TileTreeSerializer{tiletype(ttree)}( > > > > tiles, > > ids, > > length(ttree.tiles), > > ttree.dims, > > ST, > > eltype(BT), > > splitk(BT), > > (splitwidth(BT)...)) > > > > end > > > > On Sunday, January 24, 2016 02:15:50 AM Pedro Silva wrote: > > > I've been training a lot of random forests in a really big dataset and > > > > while > > > > > saving my transformations of the data in JLD files has been a breeze > > > > saving > > > > > the Models and their respective details is not going smoothly. I'm > > > experimenting with different sizes of trees and different number of > > > parameters per tree, so I have 10 forests total and since they take > > > > about 1 > > > > > hour to train each I'd like to save them every 7 iterations in case I > > > > have > > > > > to shut down a machine. My code for the process is the following: > > > > > > using HDF5, JLD, DataFrames, Distributions, DecisionTree, MLBase, > > > > StatsBase > > > > > ... > > > > > > num_of_trees = collect(10:10:100); > > > num_of_features = collect(20:5:50); > > > Models = > > > > Array{DecisionTree.Ensemble}(length(num_of_trees),length(num_of_features)) > > ; > > > > > Predictions = > > > Array{Array{Float64,1}}(length(num_of_trees),length(num_of_features)); > > > RMSEs = Array{Float64}(length(num_of_trees),length(num_of_features)); > > > > train > > > > > = rand(Bernoulli(0.8), size(Y)) .== 1; > > > > > > for i in 1:length(num_of_trees) > > > > > > for j in 1:length(num_of_features) > > > > > > Models[i,j] = > > > > build_forest(Y[train],DataSTD[train,:],num_of_features[j],num_of_trees[i]) > > ; > > > > > Predictions[i,j] = apply_forest(Models[i,j], DataSTD[!train,:]); > > > > RMSEs[i,j] > > > > > = root_mean_squared_error(Y[!train], Predictions[i,j]); println("\n", > > > Models[i,j]) > > > > > > println("Features: ",num_of_features[j]) > > > println("RMSE: ",RMSEs[i,j]) > > > > display(confusion_matrix_regression(Y[!train],Predictions[ > > i,j],10)) > > > > > > end > > > save("Models_run1.jld", "Models", Models, "Features", > > > > num_of_features, > > > > > "Predictions", Predictions, "RMSEs", RMSEs, "Bernoulli", train); end > > > > > > Finishing the internal for loop takes around 7 hours, which is not a > > > surprise, but the save function runs for hours as well. The file keeps > > > slowly increasing in size, so I think something is happening but I'm not > > > sure what. I'm still unable to get to a second iteration of my outer > > > > loop > > > > > after 3 hours of the intern loop has finished. I plan to leave it > > > > running > > > > > over night to see whether it fails or finishes. Any idea on why this is > > > happening?
