[julia-users] Advice on randomly partitioning a large dense array

Christof Stocker Thu, 03 Sep 2015 11:19:55 -0700

I could use some advice on how to randomly partition large arrays intotrain and holdout set

I am working on a machine learning related problem where I want toprovide convenience methods to deal with large in-memory data sets (i.e.a large dense matrix |X| and a target vector |t|)In particular I want to write a method |splitTrainTest| that splits myinitial data into training and holdout set.


The naive way would be something along the line of the following.

Let |X| be a 10x10000000 Array{Float64,2} where the rows are thefeatures and the cols are the observations. (I am working in Julia 0.3.11)

|julia> @time trainIdx = collect(RandomSub(10000000, 7000000, 1))[1]elapsed time: 1.525156191 seconds (136000576 bytes allocated, 13.39% gctime) julia> @time testIdx = setdiff(1:10000000, trainIdx) elapsed time:9.610973618 seconds (1499169872 bytes allocated, 22.34% gc time) julia>@time train = X[:, trainIdx] elapsed time: 0.236620469 seconds(560000256 bytes allocated) julia> @time test = X[:, testIdx] elapsedtime: 0.151963809 seconds (240000256 bytes allocated) |


Now this doesn’t seem very memory efficient.

The best solution I could come up with is by shuffling |X| in-place, andthen using two array views to define the train and test set

|function shuffleCols!(A::Matrix) rows = size(A, 1) cols = size(A, 2) forc = 1:cols i = rand(c:cols) for r = 1:rows A[r,c], A[r,i] = A[r,i],A[r,c] end end A end |

|julia> @time shuffleCols!(X) elapsed time: 1.202112921 seconds (80 bytesallocated) julia> @time train = view(X, :, 1:7000000) elapsed time:1.7596e-5 seconds (192 bytes allocated) julia> @time test = view(X, :,7000001:10000000) elapsed time: 1.3097e-5 seconds (192 bytes allocated) |

I wonder if this is a clean way to tackle this issue. Does anyone have abetter idea on how to approach this? Also, I am not quite confident ifmy shuffle method is sufficient to reach a randomly partitioned dataset,so if someone has an opinion on the shuffle function I would also begrateful.

[julia-users] Advice on randomly partitioning a large dense array

Reply via email to