[ https://issues.apache.org/jira/browse/HIVEMALL-280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16983834#comment-16983834 ]
Makoto Yui edited comment on HIVEMALL-280 at 11/27/19 7:08 PM: --------------------------------------------------------------- {noformat} supp(x=>y) = |xUy| / tx_total conf(x=>y) = supp(x=>y)/supp(x) = |xUy| / |x| lift(x=>y) = conf(x=>y)/supp(y) = (|xUy| / |x|) / (|y| / tx_total) "A, B, A_count, B_count, AB_count, tx_total" is required to compute support/confidense/lift for each A-B pair. SELECT A, B, A_B_count / tx_total as support, A_B_count / A_count AS confidence, (A_B_count / A_count) / (B_count / tx_total) AS lift FROM transaction ORDER BY 1, 2; Another approach is taking the following table as the input. Transaction table often takes the following form [1] [1] https://www.philippe-fournier-viger.com/spmf/Apriori.php txid items t1 {1, 3, 4} t2 {2, 3, 5} t3 {1, 2, 3, 5} t4 {2, 5} t5 {1, 2, 3, 5} WITH tmp1 as ( select arules(txid, items, '-min_supp 2 -topk 100 -order_by lift') as rules from transaction ), tmp2 as ( select explode(rules) rule from tmp1 ) select rule.item, rule.other, rule.supp, rule.conf, rule.lift from tmp2; explode items internally in UDAF process. a b t1 {1, 3}, t1 {1, 4} t1 {3, 4} t2 {2, 3} t2 {2, 5} t2 {3, 5} .... 1. compute tx_total by aggregation long 2. compute item_count (a_count,b_count) by aggregation map<item, long> 3. compute ab_count by aggregation map<pair<item,item>, long> 4. emit (need to limit returning list size by supp/conf/lift value) ordered_list<struct<item,item,supp,conf,lift>> {noformat} was (Author: myui): supp(x=>y) = |xUy| / tx_total conf(x=>y) = supp(x=>y)/supp(x) = |xUy| / |x| lift(x=>y) = conf(x=>y)/supp(y) = (|xUy| / |x|) / (|y| / tx_total) "A, B, A_count, B_count, AB_count, tx_total" is required to compute support/confidense/lift for each A-B pair. SELECT A, B, A_B_count / tx_total as support, A_B_count / A_count AS confidence, (A_B_count / A_count) / (B_count / tx_total) AS lift FROM transaction ORDER BY 1, 2; Another approach is taking the following table as the input. Transaction table often takes the following form [1] [1] https://www.philippe-fournier-viger.com/spmf/Apriori.php txid items t1 {1, 3, 4} t2 {2, 3, 5} t3 {1, 2, 3, 5} t4 {2, 5} t5 {1, 2, 3, 5} WITH tmp1 as ( select arules(txid, items, '-min_supp 2 -topk 100 -order_by lift') as rules from transaction ), tmp2 as ( select explode(rules) rule from tmp1 ) select rule.item, rule.other, rule.supp, rule.conf, rule.lift from tmp2; explode items internally in UDAF process. a b t1 {1, 3}, t1 {1, 4} t1 {3, 4} t2 {2, 3} t2 {2, 5} t2 {3, 5} .... 1. compute tx_total by aggregation long 2. compute item_count (a_count,b_count) by aggregation map<item, long> 3. compute ab_count by aggregation map<pair<item,item>, long> 4. emit (need to limit returning list size by supp/conf/lift value) ordered_list<struct<item,item,supp,conf,lift>> > Support lift/confidence/support UDF > ----------------------------------- > > Key: HIVEMALL-280 > URL: https://issues.apache.org/jira/browse/HIVEMALL-280 > Project: Hivemall > Issue Type: New Feature > Reporter: Makoto Yui > Assignee: Makoto Yui > Priority: Minor > > Support lift/confidence/support UDAF > [https://en.wikipedia.org/wiki/Lift_(data_mining]) > [https://towardsdatascience.com/a-gentle-introduction-on-market-basket-analysis-association-rules-fa4b986a40ce] > [https://medium.com/@samratjain/explained-market-basket-analysis-using-sql-a7434f30e649] > {code:java} > select > item, other_item, > lift(...) as lift, > confidence (....) as confidence > from > transaction > group by > 1, 2{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)