Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/22236#discussion_r212833878
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala ---
@@ -61,6 +61,18 @@ class AssociationRules private[fpm] (
*/
@Since("1.5.0")
def run[Item: ClassTag](freqItemsets: RDD[FreqItemset[Item]]):
RDD[Rule[Item]] = {
+ run(freqItemsets, Map.empty[Item, Long])
+ }
+
+ /**
+ * Computes the association rules with confidence above `minConfidence`.
+ * @param freqItemsets frequent itemset model obtained from [[FPGrowth]]
+ * @return a `Set[Rule[Item]]` containing the association rules. The
rules will be able to
+ * compute also the lift metric.
+ */
+ @Since("2.4.0")
+ def run[Item: ClassTag](freqItemsets: RDD[FreqItemset[Item]],
+ itemSupport: Map[Item, Long]): RDD[Rule[Item]] = {
--- End diff --
So if I understand this correctly, and I may not, FPGrowthModel just holds
frequent item sets. It's only association rules where the lift computation is
needed. In the course of computing association rules, you can compute item
support here. Why does it need to be saved with the model? I can see it might
be an optimization but also introduces complexity (and compatibility issues?)
here. It may be pretty fast to compute right here though. You already end up
with `(..., (consequent, count))` in candidates, from which you can get the
total consequent counts directly.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]