[
https://issues.apache.org/jira/browse/MADLIB-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15826667#comment-15826667
]
Frank McQuillan commented on MADLIB-1056:
-----------------------------------------
[~njayaram] made the following comments to me on this topic:
The filtering items part for LHS/RHS can be made quite complex. The simplest
approach is a comma separated list of exact item strings for inclusion in
LHS/RHS of a rule.
Complexity can be increased by supporting regex, and also may be a "~" operator
to say we only want rules that do NOT have the items specified. The complexity
of the story will depend on these requirements, although it shouldn't change
the complexity a lot for the regex requirement (not sure about the NOT
operator, didn't explore that much since it was not part of the original
requirement).
I couldn't identify obvious improvements to the SQL code that is already there.
The existing SQL code does do the apriori based frequent itemset generation,
contrary to my initial thoughts on it. An obvious suggestion would be to
re-write for frequent itemset generation in C++ (the rule generation part is
already in C++). But I really cannot say if that is going to truly outperform
the existing implementation, and if it does, I am not sure if it is worth the
effort (mainly due to the fact that I don't know how much improvement we can
actually gain by using less SQL code).
There seems to be some room for improvement in the C++ code for rule
generation. I think the current code blindly spits out all possible rules
(permutations) given a frequent itemset, and then the rules are pruned out by
their confidence. But we could certainly do a more careful pruning there. The
faster way to do is to construct new rules from a frequent itemset using
apriori again. This would require more effort.
We must change the interface to support this feature.
> Add filtering options to Apriori to improve performance
> -------------------------------------------------------
>
> Key: MADLIB-1056
> URL: https://issues.apache.org/jira/browse/MADLIB-1056
> Project: Apache MADlib
> Issue Type: Improvement
> Components: Module: Association Rules
> Reporter: Frank McQuillan
> Fix For: v2.0
>
>
> Consider adding something like a WHERE clause for LHS and RHS in order to
> reduce execution time, but still need the existence of the filtered
> transactions for support and confidence computation. (That is you can't
> filter them out ahead of time because would skew support and confidence
> values.)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)