Performance issue in FPGrowth
-----------------------------
Key: MAHOUT-890
URL: https://issues.apache.org/jira/browse/MAHOUT-890
Project: Mahout
Issue Type: Bug
Components: Frequent Itemset/Association Rule Mining
Affects Versions: 0.6
Reporter: tom pierce
I've encountered a dataset which indicates there is probably a performance bug
lurking in the FPGrowth implementation. This set may be a bit of an unusual
target for FPG - there's a relatively modest number itemsets, and many items
with a Zipfy distribution. I am attaching a patch (addSynth.patch) to add a
similar dataset as core/src/test/resources/FPGsynth.dat.
FPGsynth.dat can take minutes or a few hours to process, depending on how it is
grouped out to machines. If run in sequential mode, or with "-g 50" it will
take considerable time. Most reducers/"anchor items" are processed quickly,
but a small number take a handful of minutes, and one or two take a long time.
If you experiment with this data, I suggest using '-s 50 -regex "[ ]+"'.
Digging into this, I've found that the tree pruning code sometimes creates
surprising trees. One oddity I've observed is 0-count nodes, sometimes with
non-zero children. The other is that sometimes subtrees seem to get repeated.
I'm attaching a sample input file (smallexample.dat, use the whitespace regex
with this one, too) and a patch which adds some logging in pruneFPTree and
growthBottomUp which will print out some interesting trees when run with the
smallexample.dat input.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira