Re: [julia-users] Re: what's the best way to do R table() in julia? (why does StatsBase.count(x,k) need k?)

Milan Bouchet-Valat Mon, 10 Nov 2014 02:02:44 -0800

Le dimanche 09 novembre 2014 à 23:48 -0800, David van Leeuwen a écrit :
> Hello, 
> 
> On Monday, November 10, 2014 1:43:57 AM UTC+1, Dahua Lin wrote:
>         NamedArrays.jl generally goes along this way. However, it
>         remains limited in two aspects:
>         
>         
>         1. Some fields in NamedArrays are not declared of specific
>         types. In particular, the field `dicts` is of the type
>         `Vector{Dict}`, and the use of this field is on the critical
>         path when looping over the table, e.g. when counting. This
>         would potentially lead to substantial impact on performance.
>         
>         
> In the beginning I have been experimenting with indexing speed, mainly
> to sort out the various forms of getindex(), and I although I don't
> remember the exact result, I do remember that I found the drop in
> performance w.r.t. integer indexing surprisingly small. 
> 
> 
> I suppose the problem you indicate can be alleviated by making
> NamedArray parameterized by the type of the key in the dict as well.  
Right. Sounds reasonable.


>         2. Currently, it only accepts a limited set of types for
>         indices, e.g. Real and String. But in some cases, people may
>         go beyond this. I don't think we have to impose this limit. 
>         
>         
> Ah---I now see what you mean.  I thought I had built in support for
> all types as index, but there obviously is no catch all-rule in
> getindex.  I suppose NamedArray needs an update there. 
I think the last time I looked into this, it was a problem even for
efficiently indexing AbstractArrays:
https://github.com/JuliaLang/julia/pull/4892#issuecomment-31087910

Slow catch-all methods are good, but if we want specialized versions it
will probably need more work. If you want to accept combinations of
Int/String/Complement{T}/anything, the number of specialized methods to
generate explodes. I think the conclusion was that we needed to wait for
staged functions. Since they are implemented now, it may be a good time
to look into this issue for both AbstractArrays and NamedArrays.


Regards

>         On Monday, November 10, 2014 8:35:32 AM UTC+8, Dahua Lin
>         wrote:
>                 I have been observing an interesting differences
>                 between people coming from stats and machine learning.
>                 
>                 
>                 Stats people tend to favor the approach that allows
>                 one to directly use the category names to index the
>                 table, e.g. A["apple"]. This tendency is clearly
>                 reflected in the design of R, where one can attach a
>                 name to everything.
>                 
>                 
>                 While in machine learning practice, it is a common
>                 convention to just encode categories into integers,
>                 and simply use an ordinary array to represent a
>                 counting table. Whereas it makes it a little bit
>                 inconvenient in an interactive environment, this way
>                 is generally more efficient when you have to deal with
>                 these categories over a large number of samples.
>                 
>                 
>                 These differences aside, I believe, however, that
>                 there exist a very generic approach to this problem --
>                 a multi-dimensional associative map, which allows one
>                 to write A[i1, i2, ...] where the indices can be
>                 arbitrary hashable & equality-comparable instances,
>                 including integers, strings, symbols, among many other
>                 things.
>                 
>                 
>                 A multi-dimensional associative map can be considered
>                 as a multi-dimensional generalization of dictionaries,
>                 which can be easily implemented via an
>                 multidimensional array and several dictionaries, each
>                 for one dimension, to map user-side indexes to integer
>                 indexes. 
>                 
>                 
>                 - Dahua
>                 
>                 
>                 
>                 
>                 
>                 
>                 
>                 On Monday, November 10, 2014 8:12:54 AM UTC+8, David
>                 van Leeuwen wrote:
>                         Hi, 
>                         
>                         On Sunday, November 9, 2014 5:10:19 PM UTC+1,
>                         Milan Bouchet-Valat wrot
>                                 Actually I didn't do it because
>                                 NamedArrays.jl didn't work well on 0.3
>                                 when I first worked on the package.
>                                 Now I see the tests are still failing.
>                                 Do you know what is needed to make
>                                 them work?
>                                 
>                                 
>                         What is exactly not working, could you maybe
>                         file an issue?  Travis tells me all is fine. 
>                         
>                         
>                         ---david
>                          
>                                 Another point is that I think this
>                                 deserves going into StatsBase, but
>                                 before that we need everybody to agree
>                                 on a design for NamedArrays.
>                                 
>                                 Regards
>                                 
>                                 
>                                 > On Sunday, November 9, 2014 4:26:45
>                                 > PM UTC+1, Milan Bouchet-Valat wrote:
>                                 >         Le jeudi 06 novembre 2014 à
>                                 >         11:17 -0800, Conrad Stack a
>                                 >         écrit : 
>                                 >         > I was also looking for a
>                                 >         > function like this, but
>                                 >         > could not find one in
>                                 >         > docs.julialang.org.  I was
>                                 >         > doing this (v0.4.0-dev),
>                                 >         > for anyone who is
>                                 >         > interested:
>                                 >         > 
>                                 >         > 
>                                 >         > example = rand(1:10,100)
>                                 >         > uexample =
>                                 >         > sort(unique(example))
>                                 >         > counts =
>                                 >         > 
> map(x->count(y->x==y,example),uexample)
>                                 >         > 
>                                 >         > 
>                                 >         > It's pretty ugly, so
>                                 >         > thanks, Johan, for
>                                 >         > pointing out the
>                                 >         > StatsBase->countmap 
>                                 >         I've also put together a
>                                 >         small package precisely
>                                 >         aimed at offering an
>                                 >         equivalent of R's table():
>                                 >         
> https://github.com/nalimilan/Tables.jl
>                                 >         
>                                 >         But there's a more general
>                                 >         issue about how to handle
>                                 >         arrays with dimension names
>                                 >         in Julia. NamedArrays.jl
>                                 >         (which is used in my
>                                 >         package) attempts to tackle
>                                 >         this issue, but I don't
>                                 >         think we've reached a
>                                 >         consensus yet about the best
>                                 >         solution.
>                                 >         
>                                 >         
>                                 >         Regards
>                                 >         
>                                 >         > 
>                                 >         > 
>                                 >         > 
>                                 >         > On Sunday, August 17, 2014
>                                 >         > 9:56:29 AM UTC-4, Johan
>                                 >         > Sigfrids wrote:
>                                 >         >         I think countmap
>                                 >         >         comes closest to
>                                 >         >         giving you what
>                                 >         >         you want:
>                                 >         >         
>                                 >         >         using StatsBase
>                                 >         >         data =
>                                 >         >         sample(["a", "b",
>                                 >         >         "c"], 20)
>                                 >         >         countmap(data)
>                                 >         >         
>                                 >         >         
>                                 >         >         Dict{ASCIIString,Int64} 
> with 3 entries:
>                                 >         >           "c" => 3
>                                 >         >           "b" => 10
>                                 >         >           "a" => 7
>                                 >         >         
>                                 >         >         On Sunday, August
>                                 >         >         17, 2014 4:45:21
>                                 >         >         PM UTC+3, Florian
>                                 >         >         Oswald wrote: 
>                                 >         >                 Hi 
>                                 >         >                 
>                                 >         >                 
>                                 >         >                 I'm
>                                 >         >                 looking
>                                 >         >                 for the
>                                 >         >                 best way
>                                 >         >                 to count
>                                 >         >                 how many
>                                 >         >                 times a
>                                 >         >                 certain
>                                 >         >                 value x_i
>                                 >         >                 appears in
>                                 >         >                 vector x,
>                                 >         >                 where x
>                                 >         >                 could be
>                                 >         >                 integers,
>                                 >         >                 floats,
>                                 >         >                 strings.
>                                 >         >                 In R I
>                                 >         >                 would do
>                                 >         >                 table(x).
>                                 >         >                 I found
>                                 >         >                 
> StatsBase.counts(x,k) but I'm a bit confused by k (where k goes into 1:k, 
> i.e. the vector is scanned to find how many elements locate at each point of 
> 1:k). most of the times I don't know k, and in fact I would do table(x) just 
> to find out what k was. Apart from that, I don't think I could use this with 
> strings, as I can't construct a range object from strings. 
>                                 >         >                 
>                                 >         >                 
>                                 >         >                 I'm
>                                 >         >                 wondering
>                                 >         >                 whether a
>                                 >         >                 method
>                                 >         >                 
> StatsBase.counts(x::Vector) just returning the frequency of each element 
> appearing would be useful. 
>                                 >         >                 
>                                 >         >                 
>                                 >         >                 The same
>                                 >         >                 applies to
>                                 >         >                 Base.hist
>                                 >         >                 if I
>                                 >         >                 understand
>                                 >         >                 correctly.
>                                 >         >                 I just
>                                 >         >                 don't want
>                                 >         >                 to have to
>                                 >         >                 specify
>                                 >         >                 the edges
>                                 >         >                 of bins. 
>                                 >         >                 
>                                 >         >                 
>                                 >         >                 
>                                 >         >                 
>                                 >         
>                                 >         
>                                 
>

Re: [julia-users] Re: what's the best way to do R table() in julia? (why does StatsBase.count(x,k) need k?)

Reply via email to