[ 
https://issues.apache.org/jira/browse/MAHOUT-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006239#comment-13006239
 ] 

Robin Anil commented on MAHOUT-625:
-----------------------------------

Right, I was testing vipuls dataset(MAHOUT-617) and was seeing the same issue. 
Was the header table having the node even after alpha pruning?

bq. I also noticed that fpgrowth implementation can be optimized by not 
calculating patterns ending with given attributes multiple times. Depending on 
for how many features patterns are generated, speedup can be huge. More feature 
included - greater speedup. For mentioned test data, if all features were 
selected (i.e. we want to generate patterns for all items in transactions), 
patterns generation time dropped from 1h 15min to 8sec

This might be useful for single node. For PFPGrowth this used to create issues 
with exact counts of patterns earlier. There is a lot of code here(:thumbs up:) 
for me to verify. Some issues

1) The dataset needs to have a signed agreement before can include in the 
Mahout codebase(see the website). Can you add another test to reproduce the 
test case. See MAHOUT-617
2) Again the comparison code, use a different dataset.
3) Can you split the optimization out of this into another patch. I want to 
test more before checking it in.
4) Bug fix by setting support = 0 maynot save the extra memory such nodes take. 
Its good for now, before a permanent solution is found.



> Some of generated patterns have support higher than in reality
> --------------------------------------------------------------
>
>                 Key: MAHOUT-625
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-625
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.4
>            Reporter: Jaroslaw Odzga
>            Priority: Critical
>         Attachments: MAHOUT-625-patch.txt, mahout-test.zip
>
>
> It turnes out that some of generated patterns have incorrect support. The 
> returned support is slightly higher than the true one.
> I attached the test, which proves that FPGrowth has a bug. Test is using data 
> (retail) found here: http://fimi.ua.ac.be/data/
> The pattern (36, 39, 41) occurs in the transactions 572 times (this is also 
> calculated in test), but the FPGrowth returns pattern (36, 39, 41) with 
> support 573.
> Please note that mentioned pattern is not the only one with incorrect support 
> - the test only point out one example to hace something to focus on. There is 
> plenty more patterns with support higher than the real one. The biggest 
> difference I noticed was support 8 higher than the real one for one of 
> patterns.
> Please find attached failing unit test - it's actually a maven project, which 
> contains test data and is ready to run.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to