Github user srowen commented on the issue:
https://github.com/apache/spark/pull/13440
Yeah I take your point that the trait Impurity already defines two methods,
only one of which is implemented for each of the subclasses. It's already a
funky design that probably should have been generalized differently. I think a
rewrite for Spark 3 would be worthwhile, personally. I'm also not quite sure of
the difference between the Impurity and ImpurityCalculator class; it seems like
Impurity should fold into ImpurityCalculator.
Is the single method we really want to define something like
`computeInformationGain(ImpurityCalculator, ImpurityCalculator)`? even the new
method you've added is not directly computing info gain, nor were the existing
ones in Impurity. But that's the thing we need and abstraction for over several
implementations, it seems.
Well, I think either this gets a bigger redesign in 3.0, or we try to get
it into 2.5 and accept some API changes. I think I lean towards a bolder
breaking change to fix it up in 3.0, unless there's a pressing need for this
metric.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]