Performance issue in FPGrowth
-----------------------------

                 Key: MAHOUT-890
                 URL: https://issues.apache.org/jira/browse/MAHOUT-890
             Project: Mahout
          Issue Type: Bug
          Components: Frequent Itemset/Association Rule Mining
    Affects Versions: 0.6
            Reporter: tom pierce


I've encountered a dataset which indicates there is probably a performance bug 
lurking in the FPGrowth implementation.  This set may be a bit of an unusual 
target for FPG - there's a relatively modest number itemsets, and many items 
with a Zipfy distribution.  I am attaching a patch (addSynth.patch) to add a 
similar dataset as core/src/test/resources/FPGsynth.dat.

FPGsynth.dat can take minutes or a few hours to process, depending on how it is 
grouped out to machines.  If run in sequential mode, or with "-g 50" it will 
take considerable time.  Most reducers/"anchor items" are processed quickly, 
but a small number take a handful of minutes, and one or two take a long time.  
If you experiment with this data, I suggest using  '-s 50 -regex "[ ]+"'. 

Digging into this, I've found that the tree pruning code sometimes creates 
surprising trees.  One oddity I've observed is 0-count nodes, sometimes with 
non-zero children.  The other is that sometimes subtrees seem to get repeated.  
I'm attaching a sample input file (smallexample.dat, use the whitespace regex 
with this one, too) and a patch which adds some logging in pruneFPTree and 
growthBottomUp which will print out some interesting trees when run with the 
smallexample.dat input.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to