On 7/13/15 1:19 PM, Otmar Ertl wrote: > On Mon, Jul 13, 2015 at 9:59 PM, Phil Steitz <[email protected]> wrote: >> On 7/13/15 12:55 PM, Phil Steitz wrote: >>> On 7/13/15 12:16 PM, Otmar Ertl wrote: >>>> On Mon, Jul 13, 2015 at 3:51 PM, Gilles <[email protected]> >>>> wrote: >>>>> On Mon, 13 Jul 2015 06:30:44 -0700, Phil Steitz wrote: >>>>>> On 7/12/15 9:45 PM, Otmar Ertl wrote: >>>>>>> On Sun, Jul 12, 2015 at 8:16 PM, Gilles <[email protected]> >>>>>>> wrote: >>>>>>>> On Sun, 12 Jul 2015 19:38:45 +0200, Thomas Neidhart wrote: >>>>>>>>> On 07/12/2015 04:58 PM, Phil Steitz wrote: >>>>>>>>>> On 7/12/15 2:50 AM, Thomas Neidhart wrote: >>>>>>>>>>> On 07/11/2015 09:43 PM, Phil Steitz wrote: >>>>>>>>>>>> On 7/11/15 12:29 PM, Thomas Neidhart wrote: >>>>>>>>>>>>> On 07/11/2015 09:08 PM, Phil Steitz wrote: >>>>>>>>>>>>>> The code implemented in MATH-1242 to improve performance of KS >>>>>>>>>>>>>> monteCarloP in-lines efficient generation of random boolean >>>>>>>>>>>>>> arrays. >>>>>>>>>>>>>> Unfortunately, I think the implementation is not quite random >>>>>>>>>>>>>> (see >>>>>>>>>>>>>> comments on the ticket). To verify it, we need to be able to test >>>>>>>>>>>>>> the random boolean array generation directly. To do that, we have >>>>>>>>>>>>>> to either expose the method (at least as protected) in the KS >>>>>>>>>>>>>> class >>>>>>>>>>>>>> or add it somewhere else. I propose the latter but am not sure >>>>>>>>>>>>>> where to put it. For speed, we need to avoid array copies, so the >>>>>>>>>>>>>> API will have to be something like randomize(boolean[], nTrue). >>>>>>>>>>>>>> It >>>>>>>>>>>>>> could go in the swelling MathArrays class, or >>>> RandomDataGenerator. >>>>>>>>>>>>>> The latter probably makes more sense, but the API does not fit >>>> too >>>>>>>>>>>>>> well. Any ideas? >>>>>>>>>>>>> If it is just for testing purposes, you can also access the method >>>>>>>>>>>>> in >>>>>>>>>>>>> question via reflection, see an example here: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>> http://stackoverflow.com/questions/34571/how-to-test-a-class-that-has-private-methods-fields-or-inner-classes >>>>>>>>>>>> Do you think it *should* be a private method of the K-S class? >>>>>>>>>>> Right now, I do not see much uses outside the class, but if we >>>> decide >>>>>>>>>>> to >>>>>>>>>>> make it public then I would prefer a special util class in the >>>> random >>>>>>>>>>> package to avoid cluttering the MathArrays class. >>>>>>>>>> OK, for now we can make it private and use the reflection method >>>>>>>>>> above to test it. >>>>>>>>> ok, but I guess it is also fine to make it package private as sebb >>>>>>>>> suggested. We did something similar recently for some of the improved >>>>>>>>> sampling methods provided by Otmar. >>>>>>>>> >>>>>>>>>>> Regarding the RandomDataGenerator: I think this class should be >>>>>>>>>>> deprecated and replaced by a Sampler interface as proposed by >>>> Gilles. >>>>>>>>>> Please consider keeping this class. Consider this a user request. >>>>>>>>>> I have quite a few applications that use this class for two reasons: >>>>>>>>> ok, the reason why I thought the class should be deprecated is because >>>>>>>>> it was not kept up-to-date with all the new discrete and continuous >>>>>>>>> distributions that we added in the last 2-3 years. If you think it is >>>>>>>>> useful, then we can keep it of course. >>>>>>>>> >>>>>>>>>> 1. One object instance tied to one PRNG that generates data from >>>>>>>>>> multiple different distributions. This is convenient. Sure, I >>>>>>>>>> could refactor all of these apps to instantiate new objects for each >>>>>>>>>> type of generated data and hopefully still be able to peg them to >>>>>>>>>> one PRNG; but that is needless work that also complicates the code. >>>>>>>>>> >>>>>>>>>> 2. There are quite a few methods in this class that have nothing to >>>>>>>>>> do with sampling (nextPermutation, nextHexString, nextSecureXxx, >>>>>>>>>> etc) but which conveniently share the RandomGenerator. I guess the >>>>>>>>>> utility methods get moved out somewhere else. Again, I end up >>>>>>>>>> having to refactor all of my code that uses them and when I want >>>>>>>>>> simulations to be based on a single PRNG, I have to find a way to >>>>>>>>>> pass the RandomGenerator around to them. >>>>>>>>>> >>>>>>>>>> I don't yet see the need to refactor the sampling support in the >>>>>>>>>> distributions package; but as my own apps are not impacted by this, >>>>>>>>>> if everyone else sees the user impact of the refactoring as >>>>>>>>>> outweighed by the benefit, I won't stand in the way. Please lets >>>>>>>>>> just keep the RandomDataGenerator convenience class in the random >>>>>>>>>> package in any case. I will take care of whatever adjustments are >>>>>>>>>> needed to adapt to whatever we settle on for sampling in the >>>>>>>>>> distributions package. >>>>>>>>> Well, it is not really necessary to do everything together and >>>> refactor >>>>>>>>> the distributions. >>>>>>>>> >>>>>>>>> Probably it is better to start the other way round, and describe what >>>> I >>>>>>>>> want to add, and see how other things fit in: >>>>>>>>> >>>>>>>>> * I want a generic Sampler interface, i.e. something like this: >>>>>>>>> ** nextSample() >>>>>>>>> ** nextSamples(int size) >>>>>>>>> ** nextSamples(double[] samples) >>>>>>>> +1 >>>>>>>> >>>> Do we really need all those 3 methods? If the functionality provided by the >>>> two latter methods is essential for convenience reasons, we could also >>>> offer utility functions that are able to fill an array with random numbers >>>> from a given sampler, e.g. MathArrays.fill(array, sampler) or >>>> array=MathArrays.generate(sampler, size). >>> In some cases, there may be distribution-specific algorithms for >>> generating sequences of values that are more efficient than just >>> calling sample() repeatedly. So while providing a default impl in >>> an abstract sampler that just calls sample() repeatedly would make >>> sense; it might also make sense to allow the distribution to >>> override with something more efficient. An example (not exposed >>> now) is sampling from a Gaussian distribution. You can see that >>> under the covers, the BitStreamGenerator generates these in pairs. > The simultaneous generation of multiple random values, as done by some > algorithms (I only know the Box-Muller method), does not justify the > extra interface methods. With negligible overhead the extra values can > be cached and reused for subsequent calls. I prefer to keep interfaces > as simple as possible.
+1 if its actually an interface we are talking about. I was thinking samplers would be abstract classes and in that case default implementations could be provided for the array methods. Phil > > Otmar > >>>>>>>>> there could be a DiscreteSampler and ContinuousSampler interface to >>>>>>>>> handle the cases for int / double values. >>>>>>>> Perhaps the name should be IntegerSampler and DoubleSampler, to >>>>>>>> accomodate >>>>>>>> future needs (LongSampler, BooleanSampler (?)). >>>>>>> I would prefer consistent names for distributions and corresponding >>>>>>> samplers. The support of distributions must be of the same data type >>>>>>> as the return values of the corresponding sampler. Therefore, I would >>>>>>> call the samplers for RealDistribution and IntegerDistribution >>>>>>> RealSampler and IntegerSampler, respectively. >>>>> Ideally yes. But there is always the lingering question of what "Real" >>>>> means: the mathematical abstraction or the numerical representation? >>>>> The same "real" distributions could be implemented with "float"s. >>>>> If we ever need implementing a discrete distribution that is able to >>>>> use the "long" range, the "Integer" in "IntegerSampler" will clearly >>>>> refer to the numerical type, not the mathematical abstraction of an >>>>> "integer" number... >>>>> >>>>> Gilles >>>>> >>>>> >>>> The underlying data type of RealDistribution implementations is implicitly >>>> defined to be "double". It would not make sense that methods such as >>>> getSupportLowerBound() return a different type than the associated sample >>>> method. Therefore, in the context of RealDistribution "real" means "double" >>>> at the moment. However, I am open for renaming RealDistribution as >>>> DoubleDistribution with a sampler named DoubleSampler. If a float-based >>>> implementation is really ever needed, interfaces named FloatDistribution >>>> and FloatSampler could be introduced. >>> Let's not change the names just to change them. There is discussion >>> in the archives that led us to call what is traditionally referred >>> to as continuous, "Real" - mostly because we support some not >>> absolutely continuous distribution functions among the non-discrete >>> distributions. We call discrete "Integer" because we force all >>> discrete domains to map into the integers. In the 10+ years we have >>> offered both, we have never had a request to either support floats >>> as representations of reals or any discrete distribution that >>> requires the full Long value set. >> More precisely the full Long value set as actual parameters to >> distribution functions. We already support distributions with >> infinite value sets. >> >> Phil >>> Phil >>>>>> +1 >>>>>> >>>>>> Phil >>>>>>>>> The distributions could either be changed to return such a sampler as >>>>>>>>> Gilles proposed (with the advantage that no random instance is tied to >>>>>>>>> the distribution itself), or implement the interface directly (with >>>> the >>>>>>>>> advantage that we would not need to refactor too much). >>>>>>>> Unless I'm missing something, the refactoring would be fairly the same: >>>>>>>> The latter case needs implementing 3 methods (2 new ones, one with a >>>>>>>> name >>>>>>>> change). >>>>>>>> The former needs implementing the factory method proposed in MATH-1158, >>>>>>>> plus the same methods as above (wrapped in the object returned by the >>>>>>>> factory method). >>>>>>>> >>>>>>>> >>>>>>>> Gilles >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>>>> One can then create a sampler for any distribution or from other >>>>>>>>>>> sources, e.g. when needing a fast and efficient sampler without >>>>>>>>>>> replacement (see MATH-1239). >>>>>>>>>> +1 for sequential sampling. I don't follow exactly why that >>>>>>>>>> requires refactoring the distributions; but if it helps in a way I >>>>>>>>>> don't yet understand, that will help convince me that refactoring >>>>>>>>>> sampling in the distributions package is worth the user pain. >>>>>>>>> as I said above, I wanted to combine two things in one step, maybe it >>>>>>>>> is >>>>>>>>> better to go step by step. >>>>>>>>> >>>>>>>>> Thomas >>>>>>>> >>>>>>>> --------------------------------------------------------------------- >>>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>>> For additional commands, e-mail: [email protected] >>>>>>>> >>>>>>> Otmar >>>>>>> >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>> For additional commands, e-mail: [email protected] >>>>>>> >>>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: [email protected] >>>>>> For additional commands, e-mail: [email protected] >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: [email protected] >>>>> For additional commands, e-mail: [email protected] >>>>> >>>> Otmar >>>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
