Le mercredi 26 novembre 2014 à 09:30 -0800, David van Leeuwen a écrit :
> Hello again,
> 
> 
> I worked hard on NamedArrays.jl to solve the problems indicated below:
> 
> On Monday, November 10, 2014 1:43:57 AM UTC+1, Dahua Lin wrote:
>         NamedArrays.jl generally goes along this way. However, it
>         remains limited in two aspects:
>         
>         
>         1. Some fields in NamedArrays are not declared of specific
>         types. In particular, the field `dicts` is of the type
>         `Vector{Dict}`, and the use of this field is on the critical
>         path when looping over the table, e.g. when counting. This
>         would potentially lead to substantial impact on performance.
>         
>         
> A NamedArray is now parameterized by the complete set of Dicts that
> are used for the indices.  It took me a while to get the constructors
> right, in intermediate stages of the development I ended up with
> VarType parameters of NamedArray.  
>  
>         
>         2. Currently, it only accepts a limited set of types for
>         indices, e.g. Real and String. But in some cases, people may
>         go beyond this. I don't think we have to impose this limit. 
>         
>         
> The indexing code is completely overhauled now, and the indices()
> methods are now explicitly parameterized by the dictionary key type,
> their call should be efficient.  It should now be possible to index a
> NamedArray with any type, although some types (AbstractVector, Range,
> Int) are interpreted specially.  
> 
> 
> As a consequence, the type of the key for the indices cannot be
> altered after initialization of a NamedArray (the names themselves
> still can).  Thus, if you want other types than ASCIIString (which is
> used to give default names to indices), you need to call a constructor
> with your names prepared instead of filling them in afterwards. 
> 
> 
> You can try the code for julia-0.3 with Pkg.checkout("NamedArrays"),
> or read it at Github. 
This looks cool. Have you considered allowing any object other than Dict
to be passed at construction? This was requested by Simon here (and
comments below):
https://github.com/JuliaStats/StatsBase.jl/issues/32#issuecomment-43443093

The idea is that any type could be used instead of a Dict, as long as it
can be indexed with a key and return the index. For small NamedArrays,
doing a linear search on an array is faster than using a Dict. And when
computing frequency tables from PooledDataArrays, we could reuse the
existing pool instead of creating a Dict from it, it would save some
memory.


Also, John suggested that the array that a NamedArray wraps could be of
any AbstractArray type, not just Array. Sounds like a good idea (e.g. to
wrap a sparse matrix).


Regards

> 
> Cheers, 
> 
> 
> ---david
>  
>         
>         Dahua
>         
>         
>         On Monday, November 10, 2014 8:35:32 AM UTC+8, Dahua Lin
>         wrote:
>                 I have been observing an interesting differences
>                 between people coming from stats and machine learning.
>                 
>                 
>                 Stats people tend to favor the approach that allows
>                 one to directly use the category names to index the
>                 table, e.g. A["apple"]. This tendency is clearly
>                 reflected in the design of R, where one can attach a
>                 name to everything.
>                 
>                 
>                 While in machine learning practice, it is a common
>                 convention to just encode categories into integers,
>                 and simply use an ordinary array to represent a
>                 counting table. Whereas it makes it a little bit
>                 inconvenient in an interactive environment, this way
>                 is generally more efficient when you have to deal with
>                 these categories over a large number of samples.
>                 
>                 
>                 These differences aside, I believe, however, that
>                 there exist a very generic approach to this problem --
>                 a multi-dimensional associative map, which allows one
>                 to write A[i1, i2, ...] where the indices can be
>                 arbitrary hashable & equality-comparable instances,
>                 including integers, strings, symbols, among many other
>                 things.
>                 
>                 
>                 A multi-dimensional associative map can be considered
>                 as a multi-dimensional generalization of dictionaries,
>                 which can be easily implemented via an
>                 multidimensional array and several dictionaries, each
>                 for one dimension, to map user-side indexes to integer
>                 indexes. 
>                 
>                 
>                 - Dahua
>                 
>                 
>                 
>                 
>                 
>                 
>                 
>                 On Monday, November 10, 2014 8:12:54 AM UTC+8, David
>                 van Leeuwen wrote:
>                         Hi, 
>                         
>                         On Sunday, November 9, 2014 5:10:19 PM UTC+1,
>                         Milan Bouchet-Valat wrot
>                                 Actually I didn't do it because
>                                 NamedArrays.jl didn't work well on 0.3
>                                 when I first worked on the package.
>                                 Now I see the tests are still failing.
>                                 Do you know what is needed to make
>                                 them work?
>                                 
>                                 
>                         What is exactly not working, could you maybe
>                         file an issue?  Travis tells me all is fine. 
>                         
>                         
>                         ---david
>                          
>                                 Another point is that I think this
>                                 deserves going into StatsBase, but
>                                 before that we need everybody to agree
>                                 on a design for NamedArrays.
>                                 
>                                 Regards
>                                 
>                                 
>                                 > On Sunday, November 9, 2014 4:26:45
>                                 > PM UTC+1, Milan Bouchet-Valat wrote:
>                                 >         Le jeudi 06 novembre 2014 à
>                                 >         11:17 -0800, Conrad Stack a
>                                 >         écrit : 
>                                 >         > I was also looking for a
>                                 >         > function like this, but
>                                 >         > could not find one in
>                                 >         > docs.julialang.org.  I was
>                                 >         > doing this (v0.4.0-dev),
>                                 >         > for anyone who is
>                                 >         > interested:
>                                 >         > 
>                                 >         > 
>                                 >         > example = rand(1:10,100)
>                                 >         > uexample =
>                                 >         > sort(unique(example))
>                                 >         > counts =
>                                 >         > 
> map(x->count(y->x==y,example),uexample)
>                                 >         > 
>                                 >         > 
>                                 >         > It's pretty ugly, so
>                                 >         > thanks, Johan, for
>                                 >         > pointing out the
>                                 >         > StatsBase->countmap 
>                                 >         I've also put together a
>                                 >         small package precisely
>                                 >         aimed at offering an
>                                 >         equivalent of R's table():
>                                 >         
> https://github.com/nalimilan/Tables.jl
>                                 >         
>                                 >         But there's a more general
>                                 >         issue about how to handle
>                                 >         arrays with dimension names
>                                 >         in Julia. NamedArrays.jl
>                                 >         (which is used in my
>                                 >         package) attempts to tackle
>                                 >         this issue, but I don't
>                                 >         think we've reached a
>                                 >         consensus yet about the best
>                                 >         solution.
>                                 >         
>                                 >         
>                                 >         Regards
>                                 >         
>                                 >         > 
>                                 >         > 
>                                 >         > 
>                                 >         > On Sunday, August 17, 2014
>                                 >         > 9:56:29 AM UTC-4, Johan
>                                 >         > Sigfrids wrote:
>                                 >         >         I think countmap
>                                 >         >         comes closest to
>                                 >         >         giving you what
>                                 >         >         you want:
>                                 >         >         
>                                 >         >         using StatsBase
>                                 >         >         data =
>                                 >         >         sample(["a", "b",
>                                 >         >         "c"], 20)
>                                 >         >         countmap(data)
>                                 >         >         
>                                 >         >         
>                                 >         >         Dict{ASCIIString,Int64} 
> with 3 entries:
>                                 >         >           "c" => 3
>                                 >         >           "b" => 10
>                                 >         >           "a" => 7
>                                 >         >         
>                                 >         >         On Sunday, August
>                                 >         >         17, 2014 4:45:21
>                                 >         >         PM UTC+3, Florian
>                                 >         >         Oswald wrote: 
>                                 >         >                 Hi 
>                                 >         >                 
>                                 >         >                 
>                                 >         >                 I'm
>                                 >         >                 looking
>                                 >         >                 for the
>                                 >         >                 best way
>                                 >         >                 to count
>                                 >         >                 how many
>                                 >         >                 times a
>                                 >         >                 certain
>                                 >         >                 value x_i
>                                 >         >                 appears in
>                                 >         >                 vector x,
>                                 >         >                 where x
>                                 >         >                 could be
>                                 >         >                 integers,
>                                 >         >                 floats,
>                                 >         >                 strings.
>                                 >         >                 In R I
>                                 >         >                 would do
>                                 >         >                 table(x).
>                                 >         >                 I found
>                                 >         >                 
> StatsBase.counts(x,k) but I'm a bit confused by k (where k goes into 1:k, 
> i.e. the vector is scanned to find how many elements locate at each point of 
> 1:k). most of the times I don't know k, and in fact I would do table(x) just 
> to find out what k was. Apart from that, I don't think I could use this with 
> strings, as I can't construct a range object from strings. 
>                                 >         >                 
>                                 >         >                 
>                                 >         >                 I'm
>                                 >         >                 wondering
>                                 >         >                 whether a
>                                 >         >                 method
>                                 >         >                 
> StatsBase.counts(x::Vector) just returning the frequency of each element 
> appearing would be useful. 
>                                 >         >                 
>                                 >         >                 
>                                 >         >                 The same
>                                 >         >                 applies to
>                                 >         >                 Base.hist
>                                 >         >                 if I
>                                 >         >                 understand
>                                 >         >                 correctly.
>                                 >         >                 I just
>                                 >         >                 don't want
>                                 >         >                 to have to
>                                 >         >                 specify
>                                 >         >                 the edges
>                                 >         >                 of bins. 
>                                 >         >                 
>                                 >         >                 
>                                 >         >                 
>                                 >         >                 
>                                 >         
>                                 >         
>                                 
>                                 

Reply via email to