Re: [julia-users] Re: what's the best way to do R table() in julia? (why does StatsBase.count(x,k) need k?)

John Myles White Tue, 02 Dec 2014 09:52:31 -0800

http://en.wikipedia.org/wiki/Associative_array


 -- John

On Dec 2, 2014, at 9:50 AM, Ivar Nesje <[email protected]> wrote:

> It's not the obvious choice to me either, but it is in the docs, and has been 
> since I read it the first time 1.5 years ago.
> 
> kl. 16:10:34 UTC+1 tirsdag 2. desember 2014 skrev David van Leeuwen følgende:
> Thanks, 
> 
> On Tuesday, December 2, 2014 3:23:49 PM UTC+1, Ivar Nesje wrote:
> I think the proposed AbstractDict is the same abstraction that we currently 
> call Associative.
> 
> Ah---never too late to learn something new.  I've seen the AbstractString vs 
> Integer discussion, but didn't realize Associative fit into this pattern (I 
> can't even find Associative in my copy of the documentation). 
> 
> I could try to replace all the references to Dict by Associative, then, and 
> see what happens.  
> 
> ---david 
> kl. 14:29:39 UTC+1 tirsdag 2. desember 2014 skrev David van Leeuwen følgende:
> Hi, 
> 
> On Sunday, November 30, 2014 11:22:39 AM UTC+1, Milan Bouchet-Valat wrote:
> Le mercredi 26 novembre 2014 à 09:30 -0800, David van Leeuwen a écrit : 
> > Hello again, 
> > 
> > 
> > I worked hard on NamedArrays.jl to solve the problems indicated below: 
> > 
> > On Monday, November 10, 2014 1:43:57 AM UTC+1, Dahua Lin wrote: 
> >         NamedArrays.jl generally goes along this way. However, it 
> >         remains limited in two aspects: 
> >         
> >         
> >         1. Some fields in NamedArrays are not declared of specific 
> >         types. In particular, the field `dicts` is of the type 
> >         `Vector{Dict}`, and the use of this field is on the critical 
> >         path when looping over the table, e.g. when counting. This 
> >         would potentially lead to substantial impact on performance. 
> >         
> >         
> > A NamedArray is now parameterized by the complete set of Dicts that 
> > are used for the indices.  It took me a while to get the constructors 
> > right, in intermediate stages of the development I ended up with 
> > VarType parameters of NamedArray.   
> >   
> >         
> >         2. Currently, it only accepts a limited set of types for 
> >         indices, e.g. Real and String. But in some cases, people may 
> >         go beyond this. I don't think we have to impose this limit. 
> >         
> >         
> > The indexing code is completely overhauled now, and the indices() 
> > methods are now explicitly parameterized by the dictionary key type, 
> > their call should be efficient.  It should now be possible to index a 
> > NamedArray with any type, although some types (AbstractVector, Range, 
> > Int) are interpreted specially.   
> > 
> > 
> > As a consequence, the type of the key for the indices cannot be 
> > altered after initialization of a NamedArray (the names themselves 
> > still can).  Thus, if you want other types than ASCIIString (which is 
> > used to give default names to indices), you need to call a constructor 
> > with your names prepared instead of filling them in afterwards. 
> > 
> > 
> > You can try the code for julia-0.3 with Pkg.checkout("NamedArrays"), 
> > or read it at Github. 
> This looks cool. Have you considered allowing any object other than Dict 
> to be passed at construction? This was requested by Simon here (and 
> comments below): 
> https://github.com/JuliaStats/StatsBase.jl/issues/32#issuecomment-43443093 
> 
> I haven't considered that yet.  I've restructured the indexing since, and 
> removing all function prototype ambiguities has become a headache.  I fixed 
> it for julia-0.3 this morning, but now 0.4-dev gives me another gazillion 
> ambiguities... 
>  
> The idea is that any type could be used instead of a Dict, as long as it 
> can be indexed with a key and return the index. For small NamedArrays, 
> doing a linear search on an array is faster than using a Dict. And when
> 
> Would this not be better so solve at a lower level, i.e., by introducing an 
> AbstractDict, and let Dict <: AbstractDict
>  
> 
> computing frequency tables from PooledDataArrays, we could reuse the 
> existing pool instead of creating a Dict from it, it would save some 
> memory. 
> 
> and is the pool interface the same as a Dict interface, then?
>  
> 
> Also, John suggested that the array that a NamedArray wraps could be of 
> any AbstractArray type, not just Array. Sounds like a good idea (e.g. to 
> wrap a sparse matrix). 
> 
> Oh yes, that is a good idea.  It might be that some functions defined for 
> Array are not defined for other AbstractArray types, where the current 
> implementation assumes this.  Also---this would allow for a 
> NamedArray(::NamedArray), which, I would guess, leads to another level of 
> ambiguities in the implementation. 
> 
> ---david
> 
> Regards 
> 
> > 
> > Cheers, 
> > 
> > 
> > ---david 
> >   
> >         
> >         Dahua 
> >         
> >         
> >         On Monday, November 10, 2014 8:35:32 AM UTC+8, Dahua Lin 
> >         wrote: 
> >                 I have been observing an interesting differences 
> >                 between people coming from stats and machine learning. 
> >                 
> >                 
> >                 Stats people tend to favor the approach that allows 
> >                 one to directly use the category names to index the 
> >                 table, e.g. A["apple"]. This tendency is clearly 
> >                 reflected in the design of R, where one can attach a 
> >                 name to everything. 
> >                 
> >                 
> >                 While in machine learning practice, it is a common 
> >                 convention to just encode categories into integers, 
> >                 and simply use an ordinary array to represent a 
> >                 counting table. Whereas it makes it a little bit 
> >                 inconvenient in an interactive environment, this way 
> >                 is generally more efficient when you have to deal with 
> >                 these categories over a large number of samples. 
> >                 
> >                 
> >                 These differences aside, I believe, however, that 
> >                 there exist a very generic approach to this problem -- 
> >                 a multi-dimensional associative map, which allows one 
> >                 to write A[i1, i2, ...] where the indices can be 
> >                 arbitrary hashable & equality-comparable instances, 
> >                 including integers, strings, symbols, among many other 
> >                 things. 
> >                 
> >                 
> >                 A multi-dimensional associative map can be considered 
> >                 as a multi-dimensional generalization of dictionaries, 
> >                 which can be easily implemented via an 
> >                 multidimensional array and several dictionaries, each 
> >                 for one dimension, to map user-side indexes to integer 
> >                 indexes. 
> >                 
> >                 
> >                 - Dahua 
> >                 
> >                 
> >                 
> >                 
> >                 
> >                 
> >                 
> >                 On Monday, November 10, 2014 8:12:54 AM UTC+8, David 
> >                 van Leeuwen wrote: 
> >                         Hi, 
> >                         
> >                         On Sunday, November 9, 2014 5:10:19 PM UTC+1, 
> >                         Milan Bouchet-Valat wrot 
> >                                 Actually I didn't do it because 
> >                                 NamedArrays.jl didn't work well on 0.3 
> >                                 when I first worked on the package. 
> >                                 Now I see the tests are still failing. 
> >                                 Do you know what is needed to make 
> >                                 them work? 
> >                                 
> >                                 
> >                         What is exactly not working, could you maybe 
> >                         file an issue?  Travis tells me all is fine. 
> >                         
> >                         
> >                         ---david 
> >                           
> >                                 Another point is that I think this 
> >                                 deserves going into StatsBase, but 
> >                                 before that we need everybody to agree 
> >                                 on a design for NamedArrays. 
> >                                 
> >                                 Regards 
> >                                 
> >                                 
> >                                 > On Sunday, November 9, 2014 4:26:45 
> >                                 > PM UTC+1, Milan Bouchet-Valat wrote: 
> >                                 >         Le jeudi 06 novembre 2014 à 
> >                                 >         11:17 -0800, Conrad Stack a 
> >                                 >         écrit : 
> >                                 >         > I was also looking for a 
> >                                 >         > function like this, but 
> >                                 >         > could not find one in 
> >                                 >         > docs.julialang.org.  I was 
> >                                 >         > doing this (v0.4.0-dev), 
> >                                 >         > for anyone who is 
> >                                 >         > interested: 
> >                                 >         > 
> >                                 >         > 
> >                                 >         > example = rand(1:10,100) 
> >                                 >         > uexample = 
> >                                 >         > sort(unique(example)) 
> >                                 >         > counts = 
> >                                 >         > 
> > map(x->count(y->x==y,example),uexample) 
> >                                 >         > 
> >                                 >         > 
> >                                 >         > It's pretty ugly, so 
> >                                 >         > thanks, Johan, for 
> >                                 >         > pointing out the 
> >                                 >         > StatsBase->countmap 
> >                                 >         I've also put together a 
> >                                 >         small package precisely 
> >                                 >         aimed at offering an 
> >                                 >         equivalent of R's table(): 
> >                                 >         
> > https://github.com/nalimilan/Tables.jl 
> >                                 >         
> >                                 >         But there's a more general 
> >                                 >         issue about how to handle 
> >                                 >         arrays with dimension names 
> >                                 >         in Julia. NamedArrays.jl 
> >                                 >         (which is used in my 
> >                                 >         package) attempts to tackle 
> >                                 >         this issue, but I don't 
> >                                 >         think we've reached a 
> >                                 >         consensus yet about the best 
> >                                 >         solution. 
> >                                 >         
> >                                 >         
> >                                 >         Regards 
> >                                 >         
> >                                 >         > 
> >                                 >         > 
> >                                 >         > 
> >                                 >         > On Sunday, August 17, 2014 
> >                                 >         > 9:56:29 AM UTC-4, Johan 
> >                                 >         > Sigfrids wrote: 
> >                                 >         >         I think countmap 
> >                                 >         >         comes closest to 
> >                                 >         >         giving you what 
> >                                 >         >         you want: 
> >                                 >         >         
> >                                 >         >         using StatsBase 
> >                                 >         >         data = 
> >                                 >         >         sample(["a", "b", 
> >                                 >         >         "c"], 20) 
> >                                 >         >         countmap(data) 
> >                                 >         >         
> >                                 >         >         
> >                                 >         >         Dict{ASCIIString,Int64} 
> > with 3 entries: 
> >                                 >         >           "c" => 3 
> >                                 >         >           "b" => 10 
> >                                 >         >           "a" => 7 
> >                                 >         >         
> >                                 >         >         On Sunday, August 
> >                                 >         >         17, 2014 4:45:21 
> >                                 >         >         PM UTC+3, Florian 
> >                                 >         >         Oswald wrote: 
> >                                 >         >                 Hi 
> >                                 >         >                 
> >                                 >         >                 
> >                                 >         >                 I'm 
> >                                 >         >                 looking 
> >                                 >         >                 for the 
> >                                 >         >                 best way 
> >                                 >         >                 to count 
> >                                 >         >                 how many 
> >                                 >         >                 times a 
> >                                 >         >                 certain 
> >                                 >         >                 value x_i 
> >                                 >         >                 appears in 
> >                                 >         >                 vector x, 
> >                                 >         >                 where x 
> >                                 >         >                 could be 
> >                                 >         >                 integers, 
> >                                 >         >                 floats, 
> >                                 >         >                 strings. 
> >                                 >         >                 In R I 
> >                                 >         >                 would do 
> >                                 >         >                 table(x). 
> >                                 >         >                 I found 
> >                                 >         >                 
> > StatsBase.counts(x,k) but I'm a bit confused by k (where k goes into 1:k, 
> > i.e. the vector is scanned to find how many elements locate at each point 
> > of 1:k). most of the times I don't know k, and in fact I would do table(x) 
> > just to find out what k was. Apart from that, I don't think I could use 
> > this with strings, as I can't construct a range object from strings. 
> >                                 >         >                 
> >                                 >         >                 
> >                                 >         >                 I'm 
> >                                 >         >                 wondering 
> >                                 >         >                 whether a 
> >                                 >         >                 method 
> >                                 >         >                 
> > StatsBase.counts(x::Vector) just returning the frequency of each element 
> > appearing would be useful. 
> >                                 >         >                 
> >                                 >         >                 
> >                                 >         >                 The same 
> >                                 >         >                 applies to 
> >                                 >         >                 Base.hist 
> >                                 >         >                 if I 
> >                                 >         >                 understand 
> >                                 >         >                 correctly. 
> >                                 >         >                 I just 
> >                                 >         >                 don't want 
> >                                 >         >                 to have to 
> >                                 >         >                 specify 
> >                                 >         >                 the edges 
> >                                 >         >                 of bins. 
> >                                 >         >                 
> >                                 >         >                 
> >  
> ...

Re: [julia-users] Re: what's the best way to do R table() in julia? (why does StatsBase.count(x,k) need k?)

Reply via email to