Le dimanche 09 novembre 2014 à 23:48 -0800, David van Leeuwen a écrit :
> Hello,
>
> On Monday, November 10, 2014 1:43:57 AM UTC+1, Dahua Lin wrote:
> NamedArrays.jl generally goes along this way. However, it
> remains limited in two aspects:
>
>
> 1. Some fields in NamedArrays are not declared of specific
> types. In particular, the field `dicts` is of the type
> `Vector{Dict}`, and the use of this field is on the critical
> path when looping over the table, e.g. when counting. This
> would potentially lead to substantial impact on performance.
>
>
> In the beginning I have been experimenting with indexing speed, mainly
> to sort out the various forms of getindex(), and I although I don't
> remember the exact result, I do remember that I found the drop in
> performance w.r.t. integer indexing surprisingly small.
>
>
> I suppose the problem you indicate can be alleviated by making
> NamedArray parameterized by the type of the key in the dict as well.
Right. Sounds reasonable.
> 2. Currently, it only accepts a limited set of types for
> indices, e.g. Real and String. But in some cases, people may
> go beyond this. I don't think we have to impose this limit.
>
>
> Ah---I now see what you mean. I thought I had built in support for
> all types as index, but there obviously is no catch all-rule in
> getindex. I suppose NamedArray needs an update there.
I think the last time I looked into this, it was a problem even for
efficiently indexing AbstractArrays:
https://github.com/JuliaLang/julia/pull/4892#issuecomment-31087910
Slow catch-all methods are good, but if we want specialized versions it
will probably need more work. If you want to accept combinations of
Int/String/Complement{T}/anything, the number of specialized methods to
generate explodes. I think the conclusion was that we needed to wait for
staged functions. Since they are implemented now, it may be a good time
to look into this issue for both AbstractArrays and NamedArrays.
Regards
> On Monday, November 10, 2014 8:35:32 AM UTC+8, Dahua Lin
> wrote:
> I have been observing an interesting differences
> between people coming from stats and machine learning.
>
>
> Stats people tend to favor the approach that allows
> one to directly use the category names to index the
> table, e.g. A["apple"]. This tendency is clearly
> reflected in the design of R, where one can attach a
> name to everything.
>
>
> While in machine learning practice, it is a common
> convention to just encode categories into integers,
> and simply use an ordinary array to represent a
> counting table. Whereas it makes it a little bit
> inconvenient in an interactive environment, this way
> is generally more efficient when you have to deal with
> these categories over a large number of samples.
>
>
> These differences aside, I believe, however, that
> there exist a very generic approach to this problem --
> a multi-dimensional associative map, which allows one
> to write A[i1, i2, ...] where the indices can be
> arbitrary hashable & equality-comparable instances,
> including integers, strings, symbols, among many other
> things.
>
>
> A multi-dimensional associative map can be considered
> as a multi-dimensional generalization of dictionaries,
> which can be easily implemented via an
> multidimensional array and several dictionaries, each
> for one dimension, to map user-side indexes to integer
> indexes.
>
>
> - Dahua
>
>
>
>
>
>
>
> On Monday, November 10, 2014 8:12:54 AM UTC+8, David
> van Leeuwen wrote:
> Hi,
>
> On Sunday, November 9, 2014 5:10:19 PM UTC+1,
> Milan Bouchet-Valat wrot
> Actually I didn't do it because
> NamedArrays.jl didn't work well on 0.3
> when I first worked on the package.
> Now I see the tests are still failing.
> Do you know what is needed to make
> them work?
>
>
> What is exactly not working, could you maybe
> file an issue? Travis tells me all is fine.
>
>
> ---david
>
> Another point is that I think this
> deserves going into StatsBase, but
> before that we need everybody to agree
> on a design for NamedArrays.
>
> Regards
>
>
> > On Sunday, November 9, 2014 4:26:45
> > PM UTC+1, Milan Bouchet-Valat wrote:
> > Le jeudi 06 novembre 2014 à
> > 11:17 -0800, Conrad Stack a
> > écrit :
> > > I was also looking for a
> > > function like this, but
> > > could not find one in
> > > docs.julialang.org. I was
> > > doing this (v0.4.0-dev),
> > > for anyone who is
> > > interested:
> > >
> > >
> > > example = rand(1:10,100)
> > > uexample =
> > > sort(unique(example))
> > > counts =
> > >
> map(x->count(y->x==y,example),uexample)
> > >
> > >
> > > It's pretty ugly, so
> > > thanks, Johan, for
> > > pointing out the
> > > StatsBase->countmap
> > I've also put together a
> > small package precisely
> > aimed at offering an
> > equivalent of R's table():
> >
> https://github.com/nalimilan/Tables.jl
> >
> > But there's a more general
> > issue about how to handle
> > arrays with dimension names
> > in Julia. NamedArrays.jl
> > (which is used in my
> > package) attempts to tackle
> > this issue, but I don't
> > think we've reached a
> > consensus yet about the best
> > solution.
> >
> >
> > Regards
> >
> > >
> > >
> > >
> > > On Sunday, August 17, 2014
> > > 9:56:29 AM UTC-4, Johan
> > > Sigfrids wrote:
> > > I think countmap
> > > comes closest to
> > > giving you what
> > > you want:
> > >
> > > using StatsBase
> > > data =
> > > sample(["a", "b",
> > > "c"], 20)
> > > countmap(data)
> > >
> > >
> > > Dict{ASCIIString,Int64}
> with 3 entries:
> > > "c" => 3
> > > "b" => 10
> > > "a" => 7
> > >
> > > On Sunday, August
> > > 17, 2014 4:45:21
> > > PM UTC+3, Florian
> > > Oswald wrote:
> > > Hi
> > >
> > >
> > > I'm
> > > looking
> > > for the
> > > best way
> > > to count
> > > how many
> > > times a
> > > certain
> > > value x_i
> > > appears in
> > > vector x,
> > > where x
> > > could be
> > > integers,
> > > floats,
> > > strings.
> > > In R I
> > > would do
> > > table(x).
> > > I found
> > >
> StatsBase.counts(x,k) but I'm a bit confused by k (where k goes into 1:k,
> i.e. the vector is scanned to find how many elements locate at each point of
> 1:k). most of the times I don't know k, and in fact I would do table(x) just
> to find out what k was. Apart from that, I don't think I could use this with
> strings, as I can't construct a range object from strings.
> > >
> > >
> > > I'm
> > > wondering
> > > whether a
> > > method
> > >
> StatsBase.counts(x::Vector) just returning the frequency of each element
> appearing would be useful.
> > >
> > >
> > > The same
> > > applies to
> > > Base.hist
> > > if I
> > > understand
> > > correctly.
> > > I just
> > > don't want
> > > to have to
> > > specify
> > > the edges
> > > of bins.
> > >
> > >
> > >
> > >
> >
> >
>
>