Thanks for the feedback and the contribution!
See comments interspersed below.
[EMAIL PROTECTED] wrote:
Hi
I've been looking around for open source mathematical statistics software in Java in the last couple of weeks and have tested the commons-math package.
After looking into it I have a few questions and comments.
1. The MathUtils.factorial(final int n) method throws an IllegalArgumentException for n=0 which is clearly incorrect since 0!=1 by definition.
Some may disagree with that definition (really amounts to a convention on how to handle vacuuous products); but I think that I agree with you. Unless others object, I will consider this a bug and make the change. Thanks for pointing this oddity out. Would you mind opening a bugzilla ticket to track this? <http://issues.apache.org/bugzilla/enter_bug.cgi?product=Commons&component=Math>
2. I think the distinction between discrete and continuous probability distributions manifested in the interfaces DiscreteDistribution and ContinuousDistribution is rather questionable.
Here I don't think I agree. Continuous and discrete distributions are fundamentally different. See below.
There is no requirement that a discrete distribution only takes on integer values so the methods of interface DiscreteDistribution doesn't cover all discrete distributions.
Technically, you are correct - a discrete random variable can take on any (countable) set of values. The key difference, however, is that for a discrete random variable X, if x is one of the values that it takes with positive probability, p(X=x) is non-zero. Therefore, it makes sense to have probability(x) defined (as it is) in the DiscreteDistribution interface, but absent from the ContinuousDistribution interface. More care also needs to be taken in the discrete case to define and implement p(a < X < b) type quantities including or excluding endpoints appropriately (since it makes a difference whether or not endpoints are included).
In fairness, the DiscreteDistribution interface only supports integer-valued distributions. The only ones that I have ever used are integer-valued, and any odd countable set can be mapped into the integers, so I don't personally view this as a serious limitation.
On
the other hand, all of the methods of ContinuousDistribution holds equally well for a discrete probability distribution.
With slightly different semantics, as noted above.
In my opinion, a more
appropriate approach would be to rename ContinuousDistribution to ProbabilityDistribution and drop the DiscreteDistribution interface. Of course, it could be practical to have convenience methods that takes integer arguments for the probability densities for certain distributions but then you can define a new interface IntegerValuedDistribution like
public interface IntegerValuedDistribution extends ProbabilityDistribution { double probability(int x); double cumulativeProbability(int x) throws MathException; }
I see your point here and it is legitimate. I am -0 to making the change, however, since the interface contract (for the base interface) would in practical terms be different for the discrete case, so I think it is better to keep them separate.
3. Wouldn't it be nice to have convenience methods for calculating the moments for each distribution? Something along the lines of public double getMoment(int order) throws MomentDoesNotExistException;
for calculating the moments EX^order (if they exist)? At least there could be methods for obtaining the mean and variance of a distribution.
Yes!! We thought about adding these to the interfaces but were concerned that in some cases they would be hard to implement (or impossible -- if they don't exist ;-) There is nothing preventing us from adding them to the implementations, however, or adding another interface or utilities to compute them. Does anyone see a reason that we need to add them now, before releasing 1.0?
4. Since the chi-squared and exponential distributions are just special cases
of the gamma distribution, there is no need to have separate implementation
classes for these. In my opinion, one should avoid having multiple implementations of the same distribution unless there is some strong reason for it.
Well, the chi-square implementation wraps a gamma instance. The exponential implementation computes the exponential density explicitly. In any case, the distributions are different (though related) and sufficiently useful in their own rights that exposing them separately will be easier for users.
5. There are quite a lot of elementary distributions missing.
I wrote an implementation of the Poisson distribution while testing the package and have attached the files for it.
Yes!! That is why the framework was designed to support adding distributions. Could you open a bugzilla ticket and attach the files there? I will review and get these in ASAP.
Thanks again for the feedback and contribution.
Phil
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
