Github user mgaido91 commented on a diff in the pull request:
https://github.com/apache/spark/pull/22236#discussion_r212940988
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala ---
@@ -61,6 +61,18 @@ class AssociationRules private[fpm] (
*/
@Since("1.5.0")
def run[Item: ClassTag](freqItemsets: RDD[FreqItemset[Item]]):
RDD[Rule[Item]] = {
+ run(freqItemsets, Map.empty[Item, Long])
+ }
+
+ /**
+ * Computes the association rules with confidence above `minConfidence`.
+ * @param freqItemsets frequent itemset model obtained from [[FPGrowth]]
+ * @return a `Set[Rule[Item]]` containing the association rules. The
rules will be able to
+ * compute also the lift metric.
+ */
+ @Since("2.4.0")
+ def run[Item: ClassTag](freqItemsets: RDD[FreqItemset[Item]],
+ itemSupport: Map[Item, Long]): RDD[Rule[Item]] = {
--- End diff --
Actually we can compute it by filtering the `freqItemsets` and get the
items with length one. The reason why I haven't done that is to avoid
performance regression. Since we already computed this before, it seems a
unneeded waste to recompute it here.
I can use this approach of computing them when loading the model, though, I
agree. Also in that case I haven't done for optimization reasons (since we
would need to read 2 times the freqItemsets dataset which is surely much
larger).
If you think this is needed, I can change it, but for performance reasons I
prefer the current approach, in order not to affect existing users not
interested in the `lift` metric. I don't think any compatibility issue can
arise as if the value is not preset, null is returned for the lift metric.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]