http://en.wikipedia.org/wiki/Associative_array
-- John On Dec 2, 2014, at 9:50 AM, Ivar Nesje <[email protected]> wrote: > It's not the obvious choice to me either, but it is in the docs, and has been > since I read it the first time 1.5 years ago. > > kl. 16:10:34 UTC+1 tirsdag 2. desember 2014 skrev David van Leeuwen følgende: > Thanks, > > On Tuesday, December 2, 2014 3:23:49 PM UTC+1, Ivar Nesje wrote: > I think the proposed AbstractDict is the same abstraction that we currently > call Associative. > > Ah---never too late to learn something new. I've seen the AbstractString vs > Integer discussion, but didn't realize Associative fit into this pattern (I > can't even find Associative in my copy of the documentation). > > I could try to replace all the references to Dict by Associative, then, and > see what happens. > > ---david > kl. 14:29:39 UTC+1 tirsdag 2. desember 2014 skrev David van Leeuwen følgende: > Hi, > > On Sunday, November 30, 2014 11:22:39 AM UTC+1, Milan Bouchet-Valat wrote: > Le mercredi 26 novembre 2014 à 09:30 -0800, David van Leeuwen a écrit : > > Hello again, > > > > > > I worked hard on NamedArrays.jl to solve the problems indicated below: > > > > On Monday, November 10, 2014 1:43:57 AM UTC+1, Dahua Lin wrote: > > NamedArrays.jl generally goes along this way. However, it > > remains limited in two aspects: > > > > > > 1. Some fields in NamedArrays are not declared of specific > > types. In particular, the field `dicts` is of the type > > `Vector{Dict}`, and the use of this field is on the critical > > path when looping over the table, e.g. when counting. This > > would potentially lead to substantial impact on performance. > > > > > > A NamedArray is now parameterized by the complete set of Dicts that > > are used for the indices. It took me a while to get the constructors > > right, in intermediate stages of the development I ended up with > > VarType parameters of NamedArray. > > > > > > 2. Currently, it only accepts a limited set of types for > > indices, e.g. Real and String. But in some cases, people may > > go beyond this. I don't think we have to impose this limit. > > > > > > The indexing code is completely overhauled now, and the indices() > > methods are now explicitly parameterized by the dictionary key type, > > their call should be efficient. It should now be possible to index a > > NamedArray with any type, although some types (AbstractVector, Range, > > Int) are interpreted specially. > > > > > > As a consequence, the type of the key for the indices cannot be > > altered after initialization of a NamedArray (the names themselves > > still can). Thus, if you want other types than ASCIIString (which is > > used to give default names to indices), you need to call a constructor > > with your names prepared instead of filling them in afterwards. > > > > > > You can try the code for julia-0.3 with Pkg.checkout("NamedArrays"), > > or read it at Github. > This looks cool. Have you considered allowing any object other than Dict > to be passed at construction? This was requested by Simon here (and > comments below): > https://github.com/JuliaStats/StatsBase.jl/issues/32#issuecomment-43443093 > > I haven't considered that yet. I've restructured the indexing since, and > removing all function prototype ambiguities has become a headache. I fixed > it for julia-0.3 this morning, but now 0.4-dev gives me another gazillion > ambiguities... > > The idea is that any type could be used instead of a Dict, as long as it > can be indexed with a key and return the index. For small NamedArrays, > doing a linear search on an array is faster than using a Dict. And when > > Would this not be better so solve at a lower level, i.e., by introducing an > AbstractDict, and let Dict <: AbstractDict > > > computing frequency tables from PooledDataArrays, we could reuse the > existing pool instead of creating a Dict from it, it would save some > memory. > > and is the pool interface the same as a Dict interface, then? > > > Also, John suggested that the array that a NamedArray wraps could be of > any AbstractArray type, not just Array. Sounds like a good idea (e.g. to > wrap a sparse matrix). > > Oh yes, that is a good idea. It might be that some functions defined for > Array are not defined for other AbstractArray types, where the current > implementation assumes this. Also---this would allow for a > NamedArray(::NamedArray), which, I would guess, leads to another level of > ambiguities in the implementation. > > ---david > > Regards > > > > > Cheers, > > > > > > ---david > > > > > > Dahua > > > > > > On Monday, November 10, 2014 8:35:32 AM UTC+8, Dahua Lin > > wrote: > > I have been observing an interesting differences > > between people coming from stats and machine learning. > > > > > > Stats people tend to favor the approach that allows > > one to directly use the category names to index the > > table, e.g. A["apple"]. This tendency is clearly > > reflected in the design of R, where one can attach a > > name to everything. > > > > > > While in machine learning practice, it is a common > > convention to just encode categories into integers, > > and simply use an ordinary array to represent a > > counting table. Whereas it makes it a little bit > > inconvenient in an interactive environment, this way > > is generally more efficient when you have to deal with > > these categories over a large number of samples. > > > > > > These differences aside, I believe, however, that > > there exist a very generic approach to this problem -- > > a multi-dimensional associative map, which allows one > > to write A[i1, i2, ...] where the indices can be > > arbitrary hashable & equality-comparable instances, > > including integers, strings, symbols, among many other > > things. > > > > > > A multi-dimensional associative map can be considered > > as a multi-dimensional generalization of dictionaries, > > which can be easily implemented via an > > multidimensional array and several dictionaries, each > > for one dimension, to map user-side indexes to integer > > indexes. > > > > > > - Dahua > > > > > > > > > > > > > > > > On Monday, November 10, 2014 8:12:54 AM UTC+8, David > > van Leeuwen wrote: > > Hi, > > > > On Sunday, November 9, 2014 5:10:19 PM UTC+1, > > Milan Bouchet-Valat wrot > > Actually I didn't do it because > > NamedArrays.jl didn't work well on 0.3 > > when I first worked on the package. > > Now I see the tests are still failing. > > Do you know what is needed to make > > them work? > > > > > > What is exactly not working, could you maybe > > file an issue? Travis tells me all is fine. > > > > > > ---david > > > > Another point is that I think this > > deserves going into StatsBase, but > > before that we need everybody to agree > > on a design for NamedArrays. > > > > Regards > > > > > > > On Sunday, November 9, 2014 4:26:45 > > > PM UTC+1, Milan Bouchet-Valat wrote: > > > Le jeudi 06 novembre 2014 à > > > 11:17 -0800, Conrad Stack a > > > écrit : > > > > I was also looking for a > > > > function like this, but > > > > could not find one in > > > > docs.julialang.org. I was > > > > doing this (v0.4.0-dev), > > > > for anyone who is > > > > interested: > > > > > > > > > > > > example = rand(1:10,100) > > > > uexample = > > > > sort(unique(example)) > > > > counts = > > > > > > map(x->count(y->x==y,example),uexample) > > > > > > > > > > > > It's pretty ugly, so > > > > thanks, Johan, for > > > > pointing out the > > > > StatsBase->countmap > > > I've also put together a > > > small package precisely > > > aimed at offering an > > > equivalent of R's table(): > > > > > https://github.com/nalimilan/Tables.jl > > > > > > But there's a more general > > > issue about how to handle > > > arrays with dimension names > > > in Julia. NamedArrays.jl > > > (which is used in my > > > package) attempts to tackle > > > this issue, but I don't > > > think we've reached a > > > consensus yet about the best > > > solution. > > > > > > > > > Regards > > > > > > > > > > > > > > > > > > > On Sunday, August 17, 2014 > > > > 9:56:29 AM UTC-4, Johan > > > > Sigfrids wrote: > > > > I think countmap > > > > comes closest to > > > > giving you what > > > > you want: > > > > > > > > using StatsBase > > > > data = > > > > sample(["a", "b", > > > > "c"], 20) > > > > countmap(data) > > > > > > > > > > > > Dict{ASCIIString,Int64} > > with 3 entries: > > > > "c" => 3 > > > > "b" => 10 > > > > "a" => 7 > > > > > > > > On Sunday, August > > > > 17, 2014 4:45:21 > > > > PM UTC+3, Florian > > > > Oswald wrote: > > > > Hi > > > > > > > > > > > > I'm > > > > looking > > > > for the > > > > best way > > > > to count > > > > how many > > > > times a > > > > certain > > > > value x_i > > > > appears in > > > > vector x, > > > > where x > > > > could be > > > > integers, > > > > floats, > > > > strings. > > > > In R I > > > > would do > > > > table(x). > > > > I found > > > > > > StatsBase.counts(x,k) but I'm a bit confused by k (where k goes into 1:k, > > i.e. the vector is scanned to find how many elements locate at each point > > of 1:k). most of the times I don't know k, and in fact I would do table(x) > > just to find out what k was. Apart from that, I don't think I could use > > this with strings, as I can't construct a range object from strings. > > > > > > > > > > > > I'm > > > > wondering > > > > whether a > > > > method > > > > > > StatsBase.counts(x::Vector) just returning the frequency of each element > > appearing would be useful. > > > > > > > > > > > > The same > > > > applies to > > > > Base.hist > > > > if I > > > > understand > > > > correctly. > > > > I just > > > > don't want > > > > to have to > > > > specify > > > > the edges > > > > of bins. > > > > > > > > > > > ...
