[ 
https://issues.apache.org/jira/browse/MADLIB-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15826667#comment-15826667
 ] 

Frank McQuillan commented on MADLIB-1056:
-----------------------------------------

[~njayaram] made the following comments to me on this topic:

The filtering items part for LHS/RHS can be made quite complex. The simplest 
approach is a comma separated list of exact item strings for inclusion in 
LHS/RHS of a rule.

Complexity can be increased by supporting regex, and also may be a "~" operator 
to say we only want rules that do NOT have the items specified. The complexity 
of the story will depend on these requirements, although it shouldn't change 
the complexity a lot for the regex requirement (not sure about the NOT 
operator, didn't explore that much since it was not part of the original 
requirement).
I couldn't identify obvious improvements to the SQL code that is already there. 
The existing SQL code does do the apriori based frequent itemset generation, 
contrary to my initial thoughts on it. An obvious suggestion would be to 
re-write for frequent itemset generation in C++ (the rule generation part is 
already in C++). But I really cannot say if that is going to truly outperform 
the existing implementation, and if it does, I am not sure if it is worth the 
effort (mainly due to the fact that I don't know how much improvement we can 
actually gain by using less SQL code).

There seems to be some room for improvement in the C++ code for rule 
generation. I think the current code blindly spits out all possible rules 
(permutations) given a frequent itemset, and then the rules are pruned out by 
their confidence. But we could certainly do a more careful pruning there. The 
faster way to do is to construct new rules from a frequent itemset using 
apriori again. This would require more effort.

We must change the interface to support this feature.

> Add filtering options to Apriori to improve performance
> -------------------------------------------------------
>
>                 Key: MADLIB-1056
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1056
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Association Rules
>            Reporter: Frank McQuillan
>             Fix For: v2.0
>
>
> Consider adding something like a WHERE clause for LHS and RHS in order to 
> reduce execution time, but still need the existence of the filtered 
> transactions for support and confidence computation. (That is you can't 
> filter them out ahead of time because would skew support and confidence 
> values.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to