Re: [math] random boolean arrays

Phil Steitz Mon, 13 Jul 2015 16:14:03 -0700

On 7/13/15 1:19 PM, Otmar Ertl wrote:
> On Mon, Jul 13, 2015 at 9:59 PM, Phil Steitz <[email protected]> wrote:
>> On 7/13/15 12:55 PM, Phil Steitz wrote:
>>> On 7/13/15 12:16 PM, Otmar Ertl wrote:
>>>> On Mon, Jul 13, 2015 at 3:51 PM, Gilles <[email protected]>
>>>> wrote:
>>>>> On Mon, 13 Jul 2015 06:30:44 -0700, Phil Steitz wrote:
>>>>>> On 7/12/15 9:45 PM, Otmar Ertl wrote:
>>>>>>> On Sun, Jul 12, 2015 at 8:16 PM, Gilles <[email protected]>
>>>>>>> wrote:
>>>>>>>> On Sun, 12 Jul 2015 19:38:45 +0200, Thomas Neidhart wrote:
>>>>>>>>> On 07/12/2015 04:58 PM, Phil Steitz wrote:
>>>>>>>>>> On 7/12/15 2:50 AM, Thomas Neidhart wrote:
>>>>>>>>>>> On 07/11/2015 09:43 PM, Phil Steitz wrote:
>>>>>>>>>>>> On 7/11/15 12:29 PM, Thomas Neidhart wrote:
>>>>>>>>>>>>> On 07/11/2015 09:08 PM, Phil Steitz wrote:
>>>>>>>>>>>>>> The code implemented in MATH-1242 to improve performance of KS
>>>>>>>>>>>>>> monteCarloP in-lines efficient generation of random boolean
>>>>>>>>>>>>>> arrays.
>>>>>>>>>>>>>> Unfortunately, I think the implementation is not quite random
>>>>>>>>>>>>>> (see
>>>>>>>>>>>>>> comments on the ticket). To verify it, we need to be able to test
>>>>>>>>>>>>>> the random boolean array generation directly. To do that, we have
>>>>>>>>>>>>>> to either expose the method (at least as protected) in the KS
>>>>>>>>>>>>>> class
>>>>>>>>>>>>>> or add it somewhere else. I propose the latter but am not sure
>>>>>>>>>>>>>> where to put it. For speed, we need to avoid array copies, so the
>>>>>>>>>>>>>> API will have to be something like randomize(boolean[], nTrue).
>>>>>>>>>>>>>> It
>>>>>>>>>>>>>> could go in the swelling MathArrays class, or
>>>> RandomDataGenerator.
>>>>>>>>>>>>>> The latter probably makes more sense, but the API does not fit
>>>> too
>>>>>>>>>>>>>> well. Any ideas?
>>>>>>>>>>>>> If it is just for testing purposes, you can also access the method
>>>>>>>>>>>>> in
>>>>>>>>>>>>> question via reflection, see an example here:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>> http://stackoverflow.com/questions/34571/how-to-test-a-class-that-has-private-methods-fields-or-inner-classes
>>>>>>>>>>>> Do you think it *should* be a private method of the K-S class?
>>>>>>>>>>> Right now, I do not see much uses outside the class, but if we
>>>> decide
>>>>>>>>>>> to
>>>>>>>>>>> make it public then I would prefer a special util class in the
>>>> random
>>>>>>>>>>> package to avoid cluttering the MathArrays class.
>>>>>>>>>> OK, for now we can make it private and use the reflection method
>>>>>>>>>> above to test it.
>>>>>>>>> ok, but I guess it is also fine to make it package private as sebb
>>>>>>>>> suggested. We did something similar recently for some of the improved
>>>>>>>>> sampling methods provided by Otmar.
>>>>>>>>>
>>>>>>>>>>> Regarding the RandomDataGenerator: I think this class should be
>>>>>>>>>>> deprecated and replaced by a Sampler interface as proposed by
>>>> Gilles.
>>>>>>>>>> Please consider keeping this class. Consider this a user request.
>>>>>>>>>> I have quite a few applications that use this class for two reasons:
>>>>>>>>> ok, the reason why I thought the class should be deprecated is because
>>>>>>>>> it was not kept up-to-date with all the new discrete and continuous
>>>>>>>>> distributions that we added in the last 2-3 years. If you think it is
>>>>>>>>> useful, then we can keep it of course.
>>>>>>>>>
>>>>>>>>>> 1. One object instance tied to one PRNG that generates data from
>>>>>>>>>> multiple different distributions. This is convenient. Sure, I
>>>>>>>>>> could refactor all of these apps to instantiate new objects for each
>>>>>>>>>> type of generated data and hopefully still be able to peg them to
>>>>>>>>>> one PRNG; but that is needless work that also complicates the code.
>>>>>>>>>>
>>>>>>>>>> 2. There are quite a few methods in this class that have nothing to
>>>>>>>>>> do with sampling (nextPermutation, nextHexString, nextSecureXxx,
>>>>>>>>>> etc) but which conveniently share the RandomGenerator. I guess the
>>>>>>>>>> utility methods get moved out somewhere else. Again, I end up
>>>>>>>>>> having to refactor all of my code that uses them and when I want
>>>>>>>>>> simulations to be based on a single PRNG, I have to find a way to
>>>>>>>>>> pass the RandomGenerator around to them.
>>>>>>>>>>
>>>>>>>>>> I don't yet see the need to refactor the sampling support in the
>>>>>>>>>> distributions package; but as my own apps are not impacted by this,
>>>>>>>>>> if everyone else sees the user impact of the refactoring as
>>>>>>>>>> outweighed by the benefit, I won't stand in the way. Please lets
>>>>>>>>>> just keep the RandomDataGenerator convenience class in the random
>>>>>>>>>> package in any case. I will take care of whatever adjustments are
>>>>>>>>>> needed to adapt to whatever we settle on for sampling in the
>>>>>>>>>> distributions package.
>>>>>>>>> Well, it is not really necessary to do everything together and
>>>> refactor
>>>>>>>>> the distributions.
>>>>>>>>>
>>>>>>>>> Probably it is better to start the other way round, and describe what
>>>> I
>>>>>>>>> want to add, and see how other things fit in:
>>>>>>>>>
>>>>>>>>> * I want a generic Sampler interface, i.e. something like this:
>>>>>>>>> ** nextSample()
>>>>>>>>> ** nextSamples(int size)
>>>>>>>>> ** nextSamples(double[] samples)
>>>>>>>> +1
>>>>>>>>
>>>> Do we really need all those 3 methods? If the functionality provided by the
>>>> two latter methods is essential for convenience reasons, we could also
>>>> offer utility functions that are able to fill an array with random numbers
>>>> from a given sampler, e.g. MathArrays.fill(array, sampler) or
>>>> array=MathArrays.generate(sampler, size).
>>> In some cases, there may be distribution-specific algorithms for
>>> generating sequences of values that are more efficient than just
>>> calling sample() repeatedly.  So while providing a default impl in
>>> an abstract sampler that just calls sample() repeatedly would make
>>> sense; it might also make sense to allow the distribution to
>>> override with something more efficient.   An example (not exposed
>>> now) is sampling from a Gaussian distribution.  You can see that
>>> under the covers, the BitStreamGenerator generates these in pairs.
> The simultaneous generation of multiple random values, as done by some
> algorithms (I only know the Box-Muller method), does not justify the
> extra interface methods. With negligible overhead the extra values can
> be cached and reused for subsequent calls. I prefer to keep interfaces
> as simple as possible.


+1 if its actually an interface we are talking about.  I was
thinking samplers would be abstract classes and in that case default
implementations could be provided for the array methods. 

Phil
>
> Otmar
>
>>>>>>>>> there could be a DiscreteSampler and ContinuousSampler interface to
>>>>>>>>> handle the cases for int / double values.
>>>>>>>> Perhaps the name should be IntegerSampler and DoubleSampler, to
>>>>>>>> accomodate
>>>>>>>> future needs (LongSampler, BooleanSampler (?)).
>>>>>>> I would prefer consistent names for distributions and corresponding
>>>>>>> samplers. The support of distributions must be of the same data type
>>>>>>> as the return values of the corresponding sampler. Therefore, I would
>>>>>>> call the samplers for RealDistribution and IntegerDistribution
>>>>>>> RealSampler and IntegerSampler, respectively.
>>>>> Ideally yes. But there is always the lingering question of what "Real"
>>>>> means: the mathematical abstraction or the numerical representation?
>>>>> The same "real" distributions could be implemented with "float"s.
>>>>> If we ever need implementing a discrete distribution that is able to
>>>>> use the "long" range, the "Integer" in "IntegerSampler" will clearly
>>>>> refer to the numerical type, not the mathematical abstraction of an
>>>>> "integer" number...
>>>>>
>>>>> Gilles
>>>>>
>>>>>
>>>> The underlying data type of RealDistribution implementations is implicitly
>>>> defined to be "double". It would not make sense that methods such as
>>>> getSupportLowerBound() return a different type than the associated sample
>>>> method. Therefore, in the context of RealDistribution "real" means "double"
>>>> at the moment. However, I am open for renaming RealDistribution as
>>>> DoubleDistribution with a sampler named DoubleSampler. If a float-based
>>>> implementation is really ever needed, interfaces named FloatDistribution
>>>> and FloatSampler could be introduced.
>>> Let's not change the names just to change them.  There is discussion
>>> in the archives that led us to call what is traditionally referred
>>> to as continuous, "Real" - mostly because we support some not
>>> absolutely continuous distribution functions among the non-discrete
>>> distributions.  We call discrete "Integer" because we force all
>>> discrete domains to map into the integers.  In the 10+ years we have
>>> offered both, we have never had a request to either support floats
>>> as representations of reals or any discrete distribution that
>>> requires the full Long value set.
>> More precisely the full Long value set as actual parameters to
>> distribution functions.  We already support distributions with
>> infinite value sets.
>>
>> Phil
>>> Phil
>>>>>> +1
>>>>>>
>>>>>> Phil
>>>>>>>>> The distributions could either be changed to return such a sampler as
>>>>>>>>> Gilles proposed (with the advantage that no random instance is tied to
>>>>>>>>> the distribution itself), or implement the interface directly (with
>>>> the
>>>>>>>>> advantage that we would not need to refactor too much).
>>>>>>>> Unless I'm missing something, the refactoring would be fairly the same:
>>>>>>>> The latter case needs implementing 3 methods (2 new ones, one with a
>>>>>>>> name
>>>>>>>> change).
>>>>>>>> The former needs implementing the factory method proposed in MATH-1158,
>>>>>>>> plus the same methods as above (wrapped in the object returned by the
>>>>>>>> factory method).
>>>>>>>>
>>>>>>>>
>>>>>>>> Gilles
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>>> One can then create a sampler for any distribution or from other
>>>>>>>>>>> sources, e.g. when needing a fast and efficient sampler without
>>>>>>>>>>> replacement (see MATH-1239).
>>>>>>>>>> +1 for sequential sampling. I don't follow exactly why that
>>>>>>>>>> requires refactoring the distributions; but if it helps in a way I
>>>>>>>>>> don't yet understand, that will help convince me that refactoring
>>>>>>>>>> sampling in the distributions package is worth the user pain.
>>>>>>>>> as I said above, I wanted to combine two things in one step, maybe it
>>>>>>>>> is
>>>>>>>>> better to go step by step.
>>>>>>>>>
>>>>>>>>> Thomas
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>>>
>>>>>>> Otmar
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>>
>>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>> For additional commands, e-mail: [email protected]
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>>
>>>> Otmar
>>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [math] random boolean arrays

Reply via email to