I could use some advice on how to randomly partition large arrays into
train and holdout set
I am working on a machine learning related problem where I want to
provide convenience methods to deal with large in-memory data sets (i.e.
a large dense matrix |X| and a target vector |t|)
In particular I want to write a method |splitTrainTest| that splits my
initial data into training and holdout set.
The naive way would be something along the line of the following.
Let |X| be a 10x10000000 Array{Float64,2} where the rows are the
features and the cols are the observations. (I am working in Julia 0.3.11)
|julia> @time trainIdx = collect(RandomSub(10000000, 7000000, 1))[1]
elapsed time: 1.525156191 seconds (136000576 bytes allocated, 13.39% gc
time) julia> @time testIdx = setdiff(1:10000000, trainIdx) elapsed time:
9.610973618 seconds (1499169872 bytes allocated, 22.34% gc time) julia>
@time train = X[:, trainIdx] elapsed time: 0.236620469 seconds
(560000256 bytes allocated) julia> @time test = X[:, testIdx] elapsed
time: 0.151963809 seconds (240000256 bytes allocated) |
Now this doesn’t seem very memory efficient.
The best solution I could come up with is by shuffling |X| in-place, and
then using two array views to define the train and test set
|function shuffleCols!(A::Matrix) rows = size(A, 1) cols = size(A, 2) for
c = 1:cols i = rand(c:cols) for r = 1:rows A[r,c], A[r,i] = A[r,i],
A[r,c] end end A end |
|julia> @time shuffleCols!(X) elapsed time: 1.202112921 seconds (80 bytes
allocated) julia> @time train = view(X, :, 1:7000000) elapsed time:
1.7596e-5 seconds (192 bytes allocated) julia> @time test = view(X, :,
7000001:10000000) elapsed time: 1.3097e-5 seconds (192 bytes allocated) |
I wonder if this is a clean way to tackle this issue. Does anyone have a
better idea on how to approach this? Also, I am not quite confident if
my shuffle method is sufficient to reach a randomly partitioned dataset,
so if someone has an opinion on the shuffle function I would also be
grateful.