The built-in shuffle!
<https://github.com/JuliaLang/julia/blob/bf2e1b54b96ace753ea6cb3f24904151f37f879b/base/random.jl#L1328-L1335>
function implements the Fisher-Yates shuffle
<https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle>, which does
produce a uniformly random permutation of an array. This seems to be pretty
close to what your version does, but very slight differences in the
algorithm can cause it to be subtly incorrect. For speed, you could
short-circuit the Fischer-Yates shuffle when you've shuffled enough
elements. You might want just use the shuffle to select the indices of the
columns and then do the selection after-the-fact (although the F-Y
algorithm does move each value at most once, so if you're going to move
things maybe it's best to just do it).

On Thu, Sep 3, 2015 at 2:18 PM, Christof Stocker <[email protected]
> wrote:

> I could use some advice on how to randomly partition large arrays into
> train and holdout set
>
> I am working on a machine learning related problem where I want to provide
> convenience methods to deal with large in-memory data sets (i.e. a large
> dense matrix X and a target vector t)
> In particular I want to write a method splitTrainTest that splits my
> initial data into training and holdout set.
>
> The naive way would be something along the line of the following.
>
> Let X be a 10x10000000 Array{Float64,2} where the rows are the features
> and the cols are the observations. (I am working in Julia 0.3.11)
>
> julia> @time trainIdx = collect(RandomSub(10000000, 7000000, 1))[1]
> elapsed time: 1.525156191 seconds (136000576 bytes allocated, 13.39% gc time)
>
> julia> @time testIdx = setdiff(1:10000000, trainIdx)
> elapsed time: 9.610973618 seconds (1499169872 bytes allocated, 22.34% gc time)
>
> julia> @time train = X[:, trainIdx]
> elapsed time: 0.236620469 seconds (560000256 bytes allocated)
>
> julia> @time test = X[:, testIdx]
> elapsed time: 0.151963809 seconds (240000256 bytes allocated)
>
> Now this doesn’t seem very memory efficient.
>
> The best solution I could come up with is by shuffling X in-place, and
> then using two array views to define the train and test set
>
> function shuffleCols!(A::Matrix)
>   rows = size(A, 1)
>   cols = size(A, 2)
>   for c = 1:cols
>     i = rand(c:cols)
>     for r = 1:rows
>       A[r,c], A[r,i] = A[r,i], A[r,c]
>     end
>   end
>   A
> end
>
> julia> @time shuffleCols!(X)
> elapsed time: 1.202112921 seconds (80 bytes allocated)
>
> julia> @time train = view(X, :, 1:7000000)
> elapsed time: 1.7596e-5 seconds (192 bytes allocated)
>
> julia> @time test = view(X, :, 7000001:10000000)
> elapsed time: 1.3097e-5 seconds (192 bytes allocated)
>
> I wonder if this is a clean way to tackle this issue. Does anyone have a
> better idea on how to approach this? Also, I am not quite confident if my
> shuffle method is sufficient to reach a randomly partitioned dataset, so if
> someone has an opinion on the shuffle function I would also be grateful.
> ​
>

Reply via email to