Thanks tom, The patch looks good, I will take an indepth look after new year and commit it.
Happy 2012. On Fri, Dec 30, 2011 at 3:40 PM, tom pierce (Updated) (JIRA) < [email protected]> wrote: > > [ > https://issues.apache.org/jira/browse/MAHOUT-890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel] > > tom pierce updated MAHOUT-890: > ------------------------------ > > Attachment: MAHOUT-890-2.patch > > This patch (MAHOUT-890-2) adds the new implementation (under fpgrowth2) > alongside the old with a minimal number of boxed primitives in the parallel > version. This patch depends on MAHOUT-920, MAHOUT-921 and MAHOUT-927. > > Tests are included that check the new implementation against known-good > output and the existing implementation. > > The sequential implementation does not have a post-filter to create the > complete per-item itemsets (though all the same itemsets are found - the > new implementation returns de-duped sets). Otherwise the results should be > interchangeable. > > > Performance issue in FPGrowth > > ----------------------------- > > > > Key: MAHOUT-890 > > URL: https://issues.apache.org/jira/browse/MAHOUT-890 > > Project: Mahout > > Issue Type: Bug > > Components: Frequent Itemset/Association Rule Mining > > Affects Versions: 0.6 > > Reporter: tom pierce > > Attachments: MAHOUT-890-2.patch, MAHOUT-890.patch, > addSynth.patch, logtrees.patch, simpleFPG.patch, smallexample.dat > > > > > > I've encountered a dataset which indicates there is probably a > performance bug lurking in the FPGrowth implementation. This set may be a > bit of an unusual target for FPG - there's a relatively modest number > itemsets, and many items with a Zipfy distribution. I am attaching a patch > (addSynth.patch) to add a similar dataset as > core/src/test/resources/FPGsynth.dat. > > FPGsynth.dat can take minutes or a few hours to process, depending on > how it is grouped out to machines. If run in sequential mode, or with "-g > 50" it will take considerable time. Most reducers/"anchor items" are > processed quickly, but a small number take a handful of minutes, and one or > two take a long time. If you experiment with this data, I suggest using > '-s 50 -regex "[ ]+"'. > > Digging into this, I've found that the tree pruning code sometimes > creates surprising trees. One oddity I've observed is 0-count nodes, > sometimes with non-zero children. The other is that sometimes subtrees > seem to get repeated. I'm attaching a sample input file (smallexample.dat, > use the whitespace regex with this one, too) and a patch which adds some > logging in pruneFPTree and growthBottomUp which will print out some > interesting trees when run with the smallexample.dat input. > > -- > This message is automatically generated by JIRA. > If you think it was sent incorrectly, please contact your JIRA > administrators: > https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa > For more information on JIRA, see: http://www.atlassian.com/software/jira > > >
