Christof: have you looked through my DataSamplers in
https://github.com/tbreloff/OnlineAI.jl/blob/master/src/nnet/data.jl?  It's
not a finished product yet, but I'm curious if that meets your needs (or if
there are small changes that would make it better?). Questions or comments,
let me know.

On Fri, Sep 11, 2015 at 7:55 PM, Stefan Karpinski <[email protected]>
wrote:

> The built-in shuffle!
> <https://github.com/JuliaLang/julia/blob/bf2e1b54b96ace753ea6cb3f24904151f37f879b/base/random.jl#L1328-L1335>
> function implements the Fisher-Yates shuffle
> <https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle>, which does
> produce a uniformly random permutation of an array. This seems to be pretty
> close to what your version does, but very slight differences in the
> algorithm can cause it to be subtly incorrect. For speed, you could
> short-circuit the Fischer-Yates shuffle when you've shuffled enough
> elements. You might want just use the shuffle to select the indices of the
> columns and then do the selection after-the-fact (although the F-Y
> algorithm does move each value at most once, so if you're going to move
> things maybe it's best to just do it).
>
> On Thu, Sep 3, 2015 at 2:18 PM, Christof Stocker <
> [email protected]> wrote:
>
>> I could use some advice on how to randomly partition large arrays into
>> train and holdout set
>>
>> I am working on a machine learning related problem where I want to
>> provide convenience methods to deal with large in-memory data sets (i.e. a
>> large dense matrix X and a target vector t)
>> In particular I want to write a method splitTrainTest that splits my
>> initial data into training and holdout set.
>>
>> The naive way would be something along the line of the following.
>>
>> Let X be a 10x10000000 Array{Float64,2} where the rows are the features
>> and the cols are the observations. (I am working in Julia 0.3.11)
>>
>> julia> @time trainIdx = collect(RandomSub(10000000, 7000000, 1))[1]
>> elapsed time: 1.525156191 seconds (136000576 bytes allocated, 13.39% gc time)
>>
>> julia> @time testIdx = setdiff(1:10000000, trainIdx)
>> elapsed time: 9.610973618 seconds (1499169872 bytes allocated, 22.34% gc 
>> time)
>>
>> julia> @time train = X[:, trainIdx]
>> elapsed time: 0.236620469 seconds (560000256 bytes allocated)
>>
>> julia> @time test = X[:, testIdx]
>> elapsed time: 0.151963809 seconds (240000256 bytes allocated)
>>
>> Now this doesn’t seem very memory efficient.
>>
>> The best solution I could come up with is by shuffling X in-place, and
>> then using two array views to define the train and test set
>>
>> function shuffleCols!(A::Matrix)
>>   rows = size(A, 1)
>>   cols = size(A, 2)
>>   for c = 1:cols
>>     i = rand(c:cols)
>>     for r = 1:rows
>>       A[r,c], A[r,i] = A[r,i], A[r,c]
>>     end
>>   end
>>   A
>> end
>>
>> julia> @time shuffleCols!(X)
>> elapsed time: 1.202112921 seconds (80 bytes allocated)
>>
>> julia> @time train = view(X, :, 1:7000000)
>> elapsed time: 1.7596e-5 seconds (192 bytes allocated)
>>
>> julia> @time test = view(X, :, 7000001:10000000)
>> elapsed time: 1.3097e-5 seconds (192 bytes allocated)
>>
>> I wonder if this is a clean way to tackle this issue. Does anyone have a
>> better idea on how to approach this? Also, I am not quite confident if my
>> shuffle method is sufficient to reach a randomly partitioned dataset, so if
>> someone has an opinion on the shuffle function I would also be grateful.
>> ​
>>
>
>

Reply via email to