Re: [julia-users] Re: what's the best way to do R table() in julia? (why does StatsBase.count(x,k) need k?)

David van Leeuwen Sun, 09 Nov 2014 23:48:46 -0800

Hello, 

On Monday, November 10, 2014 1:43:57 AM UTC+1, Dahua Lin wrote:
>
> NamedArrays.jl generally goes along this way. However, it remains limited 
> in two aspects:
>
> 1. Some fields in NamedArrays are not declared of specific types. In 
> particular, the field `dicts` is of the type `Vector{Dict}`, and the use of 
> this field is on the critical path when looping over the table, e.g. when 
> counting. This would potentially lead to substantial impact on performance.
>
> In the beginning I have been experimenting with indexing speed, mainly to 
sort out the various forms of getindex(), and I although I don't remember 
the exact result, I do remember that I found the drop in performance w.r.t. 
integer indexing surprisingly small.


I suppose the problem you indicate can be alleviated by making NamedArray 
parameterized by the type of the key in the dict as well.  

2. Currently, it only accepts a limited set of types for indices, e.g. Real 
> and String. But in some cases, people may go beyond this. I don't think we 
> have to impose this limit. 
>
> Ah---I now see what you mean.  I thought I had built in support for all 
types as index, but there obviously is no catch all-rule in getindex.  I 
suppose NamedArray needs an update there. 

---david
 

> Dahua
>
>
> On Monday, November 10, 2014 8:35:32 AM UTC+8, Dahua Lin wrote:
>>
>> I have been observing an interesting differences between people coming 
>> from stats and machine learning.
>>
>> Stats people tend to favor the approach that allows one to directly use 
>> the category names to index the table, e.g. A["apple"]. This tendency is 
>> clearly reflected in the design of R, where one can attach a name to 
>> everything.
>>
>> While in machine learning practice, it is a common convention to just 
>> encode categories into integers, and simply use an ordinary array to 
>> represent a counting table. Whereas it makes it a little bit inconvenient 
>> in an interactive environment, this way is generally more efficient when 
>> you have to deal with these categories over a large number of samples.
>>
>> These differences aside, I believe, however, that there exist a very 
>> generic approach to this problem -- a multi-dimensional associative map, 
>> which allows one to write A[i1, i2, ...] where the indices can be arbitrary 
>> hashable & equality-comparable instances, including integers, strings, 
>> symbols, among many other things.
>>
>> A multi-dimensional associative map can be considered as a 
>> multi-dimensional generalization of dictionaries, which can be easily 
>> implemented via an multidimensional array and several dictionaries, each 
>> for one dimension, to map user-side indexes to integer indexes. 
>>
>> - Dahua
>>
>>
>>
>>
>> On Monday, November 10, 2014 8:12:54 AM UTC+8, David van Leeuwen wrote:
>>>
>>> Hi, 
>>>
>>> On Sunday, November 9, 2014 5:10:19 PM UTC+1, Milan Bouchet-Valat wrot
>>>
>>>> Actually I didn't do it because NamedArrays.jl didn't work well on 0.3 
>>>> when I first worked on the package. Now I see the tests are still failing. 
>>>> Do you know what is needed to make them work?
>>>>
>>>> What is exactly not working, could you maybe file an issue?  Travis 
>>> tells me all is fine. 
>>>
>>> ---david
>>>  
>>>
>>>> Another point is that I think this deserves going into StatsBase, but 
>>>> before that we need everybody to agree on a design for NamedArrays.
>>>>
>>>> Regards
>>>>
>>>>
>>>>  On Sunday, November 9, 2014 4:26:45 PM UTC+1, Milan Bouchet-Valat 
>>>> wrote: 
>>>>
>>>>  Le jeudi 06 novembre 2014 à 11:17 -0800, Conrad Stack a écrit : 
>>>>
>>>> I was also looking for a function like this, but could not find one in 
>>>> docs.julialang.org.  I was doing this (v0.4.0-dev), for anyone who is 
>>>> interested:
>>>>
>>>>
>>>> example = rand(1:10,100)
>>>> uexample = sort(unique(example))
>>>> counts = map(x->count(y->x==y,example),uexample)
>>>>
>>>>
>>>> It's pretty ugly, so thanks, Johan, for pointing out the 
>>>> StatsBase->countmap 
>>>>
>>>> I've also put together a small package precisely aimed at offering an 
>>>> equivalent of R's table():
>>>> https://github.com/nalimilan/ <https://github.com/nalimilan/Tables.jl>
>>>> Tables.jl <https://github.com/nalimilan/Tables.jl>
>>>>
>>>> But there's a more general issue about how to handle arrays with 
>>>> dimension names in Julia. NamedArrays.jl (which is used in my package) 
>>>> attempts to tackle this issue, but I don't think we've reached a consensus 
>>>> yet about the best solution.
>>>>
>>>>
>>>> Regards
>>>>
>>>>  
>>>>
>>>>
>>>> On Sunday, August 17, 2014 9:56:29 AM UTC-4, Johan Sigfrids wrote:
>>>>
>>>> I think countmap comes closest to giving you what you want:
>>>>
>>>> using StatsBase
>>>> data = sample(["a", "b", "c"], 20)
>>>> countmap(data)
>>>>
>>>>
>>>> Dict{ASCIIString,Int64} with 3 entries:
>>>>   "c" => 3
>>>>   "b" => 10
>>>>   "a" => 7
>>>>
>>>>
>>>> On Sunday, August 17, 2014 4:45:21 PM UTC+3, Florian Oswald wrote: 
>>>>
>>>> Hi 
>>>>
>>>>
>>>> I'm looking for the best way to count how many times a certain value 
>>>> x_i appears in vector x, where x could be integers, floats, strings. In R 
>>>> I 
>>>> would do table(x). I found StatsBase.counts(x,k) but I'm a bit confused by 
>>>> k (where k goes into 1:k, i.e. the vector is scanned to find how many 
>>>> elements locate at each point of 1:k). most of the times I don't know k, 
>>>> and in fact I would do table(x) just to find out what k was. Apart from 
>>>> that, I don't think I could use this with strings, as I can't construct a 
>>>> range object from strings. 
>>>>
>>>>
>>>> I'm wondering whether a method StatsBase.counts(x::Vector) just 
>>>> returning the frequency of each element appearing would be useful. 
>>>>
>>>>
>>>> The same applies to Base.hist if I understand correctly. I just don't 
>>>> want to have to specify the edges of bins. 
>>>>
>>>>
>>>>
>>>>
>>>>   
>>>>
>>>>  
>>>>

Re: [julia-users] Re: what's the best way to do R table() in julia? (why does StatsBase.count(x,k) need k?)

Reply via email to