Hello,
On Monday, November 10, 2014 1:43:57 AM UTC+1, Dahua Lin wrote:
>
> NamedArrays.jl generally goes along this way. However, it remains limited
> in two aspects:
>
> 1. Some fields in NamedArrays are not declared of specific types. In
> particular, the field `dicts` is of the type `Vector{Dict}`, and the use of
> this field is on the critical path when looping over the table, e.g. when
> counting. This would potentially lead to substantial impact on performance.
>
> In the beginning I have been experimenting with indexing speed, mainly to
sort out the various forms of getindex(), and I although I don't remember
the exact result, I do remember that I found the drop in performance w.r.t.
integer indexing surprisingly small.
I suppose the problem you indicate can be alleviated by making NamedArray
parameterized by the type of the key in the dict as well.
2. Currently, it only accepts a limited set of types for indices, e.g. Real
> and String. But in some cases, people may go beyond this. I don't think we
> have to impose this limit.
>
> Ah---I now see what you mean. I thought I had built in support for all
types as index, but there obviously is no catch all-rule in getindex. I
suppose NamedArray needs an update there.
---david
> Dahua
>
>
> On Monday, November 10, 2014 8:35:32 AM UTC+8, Dahua Lin wrote:
>>
>> I have been observing an interesting differences between people coming
>> from stats and machine learning.
>>
>> Stats people tend to favor the approach that allows one to directly use
>> the category names to index the table, e.g. A["apple"]. This tendency is
>> clearly reflected in the design of R, where one can attach a name to
>> everything.
>>
>> While in machine learning practice, it is a common convention to just
>> encode categories into integers, and simply use an ordinary array to
>> represent a counting table. Whereas it makes it a little bit inconvenient
>> in an interactive environment, this way is generally more efficient when
>> you have to deal with these categories over a large number of samples.
>>
>> These differences aside, I believe, however, that there exist a very
>> generic approach to this problem -- a multi-dimensional associative map,
>> which allows one to write A[i1, i2, ...] where the indices can be arbitrary
>> hashable & equality-comparable instances, including integers, strings,
>> symbols, among many other things.
>>
>> A multi-dimensional associative map can be considered as a
>> multi-dimensional generalization of dictionaries, which can be easily
>> implemented via an multidimensional array and several dictionaries, each
>> for one dimension, to map user-side indexes to integer indexes.
>>
>> - Dahua
>>
>>
>>
>>
>> On Monday, November 10, 2014 8:12:54 AM UTC+8, David van Leeuwen wrote:
>>>
>>> Hi,
>>>
>>> On Sunday, November 9, 2014 5:10:19 PM UTC+1, Milan Bouchet-Valat wrot
>>>
>>>> Actually I didn't do it because NamedArrays.jl didn't work well on 0.3
>>>> when I first worked on the package. Now I see the tests are still failing.
>>>> Do you know what is needed to make them work?
>>>>
>>>> What is exactly not working, could you maybe file an issue? Travis
>>> tells me all is fine.
>>>
>>> ---david
>>>
>>>
>>>> Another point is that I think this deserves going into StatsBase, but
>>>> before that we need everybody to agree on a design for NamedArrays.
>>>>
>>>> Regards
>>>>
>>>>
>>>> On Sunday, November 9, 2014 4:26:45 PM UTC+1, Milan Bouchet-Valat
>>>> wrote:
>>>>
>>>> Le jeudi 06 novembre 2014 à 11:17 -0800, Conrad Stack a écrit :
>>>>
>>>> I was also looking for a function like this, but could not find one in
>>>> docs.julialang.org. I was doing this (v0.4.0-dev), for anyone who is
>>>> interested:
>>>>
>>>>
>>>> example = rand(1:10,100)
>>>> uexample = sort(unique(example))
>>>> counts = map(x->count(y->x==y,example),uexample)
>>>>
>>>>
>>>> It's pretty ugly, so thanks, Johan, for pointing out the
>>>> StatsBase->countmap
>>>>
>>>> I've also put together a small package precisely aimed at offering an
>>>> equivalent of R's table():
>>>> https://github.com/nalimilan/ <https://github.com/nalimilan/Tables.jl>
>>>> Tables.jl <https://github.com/nalimilan/Tables.jl>
>>>>
>>>> But there's a more general issue about how to handle arrays with
>>>> dimension names in Julia. NamedArrays.jl (which is used in my package)
>>>> attempts to tackle this issue, but I don't think we've reached a consensus
>>>> yet about the best solution.
>>>>
>>>>
>>>> Regards
>>>>
>>>>
>>>>
>>>>
>>>> On Sunday, August 17, 2014 9:56:29 AM UTC-4, Johan Sigfrids wrote:
>>>>
>>>> I think countmap comes closest to giving you what you want:
>>>>
>>>> using StatsBase
>>>> data = sample(["a", "b", "c"], 20)
>>>> countmap(data)
>>>>
>>>>
>>>> Dict{ASCIIString,Int64} with 3 entries:
>>>> "c" => 3
>>>> "b" => 10
>>>> "a" => 7
>>>>
>>>>
>>>> On Sunday, August 17, 2014 4:45:21 PM UTC+3, Florian Oswald wrote:
>>>>
>>>> Hi
>>>>
>>>>
>>>> I'm looking for the best way to count how many times a certain value
>>>> x_i appears in vector x, where x could be integers, floats, strings. In R
>>>> I
>>>> would do table(x). I found StatsBase.counts(x,k) but I'm a bit confused by
>>>> k (where k goes into 1:k, i.e. the vector is scanned to find how many
>>>> elements locate at each point of 1:k). most of the times I don't know k,
>>>> and in fact I would do table(x) just to find out what k was. Apart from
>>>> that, I don't think I could use this with strings, as I can't construct a
>>>> range object from strings.
>>>>
>>>>
>>>> I'm wondering whether a method StatsBase.counts(x::Vector) just
>>>> returning the frequency of each element appearing would be useful.
>>>>
>>>>
>>>> The same applies to Base.hist if I understand correctly. I just don't
>>>> want to have to specify the edges of bins.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>