Re: [julia-users] JLD save function taking hours to finish

Pedro Silva Sun, 24 Jan 2016 19:21:06 -0800

I did see that other post, but I really thought that this could be a 
different problem. The save function is running for the past 20 hours 
without terminating. I am inexperienced with serializers but I will see 
what I can make from the code you posted. Thank you very much.


On Sunday, January 24, 2016 at 10:29:25 AM UTC-8, Tim Holy wrote:
>
> Similar question here, asked just a couple of days ago (please do search 
> the 
> archives first): 
> https://groups.google.com/d/msg/julia-users/VInJ4M-yNUY/Z6N8wCCfAwAJ 
>
> Someone should just add a serializer to the relevant random 
> forest/decision 
> tree packages. These aren't hard to write, and there's an example in the 
> linked docs. 
>
> For reference, here's a more complicated example: in my own lab's code, we 
> use 
> "tile trees" to represent sums over little pieces of images. They combine 
> QuadTrees/OctTrees (depending on spatial dimensionality) with 
> spatio-temporal 
> factorizations. The main point being that these might seem like fairly 
> complicated data structures, yet the serializer and deserializer can each 
> be 
> written in ~10 lines of code, and gave me an orders-of-magnitude 
> performance 
> improvement when saving/loading. 
>
> For reference, I've pasted the code below: it's not self-contained, but it 
> should give you the idea. 
>
> Best, 
> --Tim 
>
> # This contains info needed to reconstruct the BoxTree, but does not store 
> the 
> # BoxTree itself 
> type TileTreeSerializer{TT<:Tile} 
>     tiles::Vector{TT} 
>     ids::Vector{Int} 
>     ntiles::Int 
>     dims::Dims 
>     Ts::Type 
>     Tel::Type 
>     K::Int 
>     W::Tuple 
> end 
> TileTrees.tiletype{TT}(::Type{TileTreeSerializer{TT}}) = TT 
> TileTrees.tiletype{TT}(::TileTreeSerializer{TT}) = TT 
>
> function JLD.readas(serdata::TileTreeSerializer) 
>     bt = boxtree(serdata.Ts, serdata.Tel, serdata.K, serdata.W, 
> dimspans(serdata.dims[1:end-1])) 
>     TT = tiletype(serdata) 
>     tiles = Array(TT, serdata.ntiles) 
>     for i = 1:length(serdata.tiles) 
>         id = serdata.ids[i] 
>         tile = serdata.tiles[i] 
>         tiles[id] = tile 
>         roi = boxroi(tile.spans, id) 
>         push!(bt, roi) 
>     end 
>     ttree = TileTree(tiles, bt, serdata.dims) 
> end 
>
> function JLD.writeas(ttree::TileTree) 
>     tiles = Array(tiletype(ttree), 0) 
>     ids = Int[] 
>     for (id, tile) in ttree 
>         push!(tiles, tile) 
>         push!(ids, id) 
>     end 
>     BT = boxtreetype(ttree) 
>     ST = splittype(BT) 
>     TileTreeSerializer{tiletype(ttree)}( 
>         tiles, 
>         ids, 
>         length(ttree.tiles), 
>         ttree.dims, 
>         ST, 
>         eltype(BT), 
>         splitk(BT), 
>         (splitwidth(BT)...)) 
> end 
>
>
> On Sunday, January 24, 2016 02:15:50 AM Pedro Silva wrote: 
> > I've been training a lot of random forests in a really big dataset and 
> while 
> > saving my transformations of the data in JLD files has been a breeze 
> saving 
> > the Models and their respective details is not going smoothly. I'm 
> > experimenting with different sizes of trees and different number of 
> > parameters per tree, so I have 10 forests total and since they take 
> about 1 
> > hour to train each I'd like to save them every 7 iterations in case I 
> have 
> > to shut down a machine. My code for the process is the following: 
> > 
> > using HDF5, JLD, DataFrames, Distributions, DecisionTree, MLBase, 
> StatsBase 
> > 
> > ... 
> > 
> > num_of_trees = collect(10:10:100); 
> > num_of_features = collect(20:5:50); 
> > Models = 
> > 
> Array{DecisionTree.Ensemble}(length(num_of_trees),length(num_of_features)); 
> > Predictions = 
> > Array{Array{Float64,1}}(length(num_of_trees),length(num_of_features)); 
> > RMSEs = Array{Float64}(length(num_of_trees),length(num_of_features)); 
> train 
> > = rand(Bernoulli(0.8), size(Y)) .== 1; 
> > 
> > for i in 1:length(num_of_trees) 
> >         for j in 1:length(num_of_features) 
> >                 Models[i,j] = 
> > 
> build_forest(Y[train],DataSTD[train,:],num_of_features[j],num_of_trees[i]); 
> > Predictions[i,j] = apply_forest(Models[i,j], DataSTD[!train,:]); 
> RMSEs[i,j] 
> > = root_mean_squared_error(Y[!train], Predictions[i,j]); println("\n", 
> > Models[i,j]) 
> >                 println("Features: ",num_of_features[j]) 
> >                 println("RMSE: ",RMSEs[i,j]) 
> > 
>                 
> display(confusion_matrix_regression(Y[!train],Predictions[i,j],10)) 
>
> >         end 
> >         save("Models_run1.jld", "Models", Models, "Features", 
> num_of_features, 
> > "Predictions", Predictions, "RMSEs", RMSEs, "Bernoulli", train); end 
> > 
> > Finishing the internal for loop takes around 7 hours, which is not a 
> > surprise, but the save function runs for hours as well. The file keeps 
> > slowly increasing in size, so I think something is happening but I'm not 
> > sure what. I'm still unable to get to a second iteration of my outer 
> loop 
> > after 3 hours of the intern loop has finished. I plan to leave it 
> running 
> > over night to see whether it fails or finishes. Any idea on why this is 
> > happening? 
>
>

Re: [julia-users] JLD save function taking hours to finish

Reply via email to