I could use some advice on how to randomly partition large arrays into train and holdout set

I am working on a machine learning related problem where I want to provide convenience methods to deal with large in-memory data sets (i.e. a large dense matrix |X| and a target vector |t|) In particular I want to write a method |splitTrainTest| that splits my initial data into training and holdout set.

The naive way would be something along the line of the following.

Let |X| be a 10x10000000 Array{Float64,2} where the rows are the features and the cols are the observations. (I am working in Julia 0.3.11)

|julia> @time trainIdx = collect(RandomSub(10000000, 7000000, 1))[1] elapsed time: 1.525156191 seconds (136000576 bytes allocated, 13.39% gc time) julia> @time testIdx = setdiff(1:10000000, trainIdx) elapsed time: 9.610973618 seconds (1499169872 bytes allocated, 22.34% gc time) julia> @time train = X[:, trainIdx] elapsed time: 0.236620469 seconds (560000256 bytes allocated) julia> @time test = X[:, testIdx] elapsed time: 0.151963809 seconds (240000256 bytes allocated) |

Now this doesn’t seem very memory efficient.

The best solution I could come up with is by shuffling |X| in-place, and then using two array views to define the train and test set

|function shuffleCols!(A::Matrix) rows = size(A, 1) cols = size(A, 2) for c = 1:cols i = rand(c:cols) for r = 1:rows A[r,c], A[r,i] = A[r,i], A[r,c] end end A end |

|julia> @time shuffleCols!(X) elapsed time: 1.202112921 seconds (80 bytes allocated) julia> @time train = view(X, :, 1:7000000) elapsed time: 1.7596e-5 seconds (192 bytes allocated) julia> @time test = view(X, :, 7000001:10000000) elapsed time: 1.3097e-5 seconds (192 bytes allocated) |

I wonder if this is a clean way to tackle this issue. Does anyone have a better idea on how to approach this? Also, I am not quite confident if my shuffle method is sufficient to reach a randomly partitioned dataset, so if someone has an opinion on the shuffle function I would also be grateful.

Reply via email to