Combining the Brian's ideas with the Rosetta code solution, I came up with
this adverb which I quite like:

Idotr=: |.@[ (#@[ - I.) ]
binnedData=: {{
  bidx=. i.@>:@# x                    NB. indicies of bins
  x (Idotr (u@}./.)&(bidx&,) ]) y     NB. apply u to data in bins after
dropping first value
}}

< binnedData          NB. box binned data
# binnedData          NB. tally binned data
(+/ % #) binnedData   NB. average binned data
So histogram could be:
histogram=: # binnedData

However by adding the flexibility, we lose the performance bonus of the
special code for #/.

   limits=: 14 18 249 312 389 392 513 591 634 720
   data=: 1e6 ?@$ 1000
   50 timespacex ' limits2 # binnedData data'
0.0135251 2.517e7
   50 timespacex ' limits2 histogram2 data'
0.00884346 1.67788e7



On Sun, Apr 11, 2021 at 2:21 PM Ric Sherlock <[email protected]> wrote:

> Good question Brian and interesting discussion.
>
> I agree that the key to the problem is to agree exactly what the desired
> behaviour of histogram should be.
> Mathematical/statistical convention seems to be that Intervals should be
> closed on the left and open on the right. This makes life a bit harder for
> us because I. calculates intervals that are natively open on the left and
> closed on the right. Luckily Brian has solved this problem for us with his
> Idotr verb.
>
> The next issue to resolve is how to specify the intervals. That is, what
> format of bins/limits should the desired verb histogram expect and how
> should it interpret them. Should the upper and/or lower limits be provided
> or should __ and _ be implied as Gilles suggests.
>
> I think my preference would be to imply __ and _. When generating a
> histogram for some data it is generally to understand what the data looks
> like. For that reason I'd prefer for all data to be included and not to
> have to spend too much (any!) time working out sensible bin values.
>
> Which leads to ... another nice thing to add would be a verb (calcBins ?)
> to calculate the values to specify the desired number of bins. If we
> decided that the upper and lower limits should be explicitly specified,
> then calcBins could obviously determine the min and max of the data to
> ensure that everything is included, however outliers may provide some
> issues.
>
> BinCounts=: Limits histogram Data
> BinCounts=: 10 (calcBins histogram ]) Data
>
> I recently added a J solution for the "Bin given limits" task on Rosetta
> Code: https://rosettacode.org/wiki/Bin_given_limits
> Brian's histogram2 gives the same bin counts as my solution there, is more
> performant for just calculating the counts and will work better for
> non-integer data. It would currently be my preferred solution for the
> stats/base library.
>
>
>
>
>
>
>
> On Sun, Apr 11, 2021 at 9:33 AM Brian Schott <[email protected]>
> wrote:
>
>> As you thought, Steve Jost kindly confirmed.
>> I had never heard of the upper bin being treated differently.
>>
>> On Sat, Apr 10, 2021 at 1:59 PM Raul Miller <[email protected]>
>> wrote:
>>
>> > I give the english sentence precedence over the label for a specific
>> > row in a table.
>> >
>> > For the example treated by that table, it does not change the numbers.
>> >
>> > That said, I suppose it would be worth talking with Steve Jost about
>> > this issue. He likely has references worth reading that lead him to
>> > write that sentence, and he might even like to hear someone having
>> > noticed the conflict in his treatment of that topic.
>> >
>> > Thanks,
>> >
>> > --
>> > Raul
>> >
>> > --
>> (B=)
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
>>
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to