If you expect that the vectors will *not* be in the Set most of the time, you should try using a bloom filter. There is a BloomFilters.jl package that should get you started. If that's not fast enough, it might also be useful to tailor the bloom filter implementation to your data. Note that you'll still need a Set of some sort.
Cheers, Kevin On Sun, Mar 2, 2014 at 1:08 AM, Ivar Nesje <[email protected]> wrote: > There is never any one "most performant way in julia". There is often a > pretty/convenient way, that fits well with the language idioms, standard > library and can be implemented in a oneliner, and while a lot of effort has > gone into making that solution performant, it is seldom the best possible. > > The most performant way depends on the number and size of the vectors, and > how different they are. If there is a high probability for the first > element to be unique, you might store a vector of vectors in a dictionary > keyed by the first element. There are also lots of other tricks that > exploits the size and shape of data to get performance. Hashing and random > lookups are often slow, and even though your algorithm book says access is > O(1), it is slower than O(n^2) algorithms for small n. > > The standard Hash function for abstractarray is defined in dict.jl > > function hash(a::AbstractArray) > h::Uint = hash(size(a))+1 > for i=1:length(a) > h = bitmix(h,int(hash(a[i]))) > end > return h > end > > You can actually just use arrays as a key in Dict or Set as they are, but > there are some caveats, that you need to take into account. The dict does > not take a copy of the key, and if you mutate the key, you will lose lookup > access to the value stored. I (and the designers of Python), think this is > a problem that we should not give programmers the possibility to have, but > Stefan disagrees and have tried to make hashing just work. You will then > have to deal with the consequences if you do not understand their thinking. > > Here is an example on how hashing mutable collections work. > > julia> a = [1:10]; > julia> b = Dict{AbstractArray, Int}(); > julia> b[a] = 5 > 5 > julia> b[1:10] > 5 > julia> b[a] > 5 > julia> a[5] = 9 > 9 > julia> b[a] > ERROR: key not found: [1,2,3,4,9,6,7,8,9,10] > in getindex at dict.jl:582 > > julia> b[1:10] > ERROR: key not found: 1:10 > in getindex at dict.jl:582 > > julia> b > [[1,2,3,4,9,6,7,8,9,10]=>5] > > If you change the line where you use a as a key to > julia> b[copy(a)] = 5 > you get more expected results. > > Ivar > > kl. 09:02:52 UTC+1 søndag 2. mars 2014 skrev Jonathan Malmaud følgende: > >> What's the most performant way in Julia to maintain a set of vectors and >> check if a given vector is in the set? I was thinking of converting the >> vectors to tuples and then using "in(::Set, ...)" but I'm worried about the >> performance penalty of continuously converting vectors to tuples. >> >
