Re: [jira] [Updated] (MAHOUT-890) Performance issue in FPGrowth

Robin Anil Fri, 30 Dec 2011 14:07:36 -0800

Thanks tom, The patch looks good, I will take an indepth look after new
year and commit it.


Happy 2012.

On Fri, Dec 30, 2011 at 3:40 PM, tom pierce (Updated) (JIRA) <
[email protected]> wrote:

>
>     [
> https://issues.apache.org/jira/browse/MAHOUT-890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> tom pierce updated MAHOUT-890:
> ------------------------------
>
>     Attachment: MAHOUT-890-2.patch
>
> This patch (MAHOUT-890-2) adds the new implementation (under fpgrowth2)
> alongside the old with a minimal number of boxed primitives in the parallel
> version.  This patch depends on MAHOUT-920, MAHOUT-921 and MAHOUT-927.
>
> Tests are included that check the new implementation against known-good
> output and the existing implementation.
>
> The sequential implementation does not have a post-filter to create the
> complete per-item itemsets (though all the same itemsets are found - the
> new implementation returns de-duped sets).  Otherwise the results should be
> interchangeable.
>
> > Performance issue in FPGrowth
> > -----------------------------
> >
> >                 Key: MAHOUT-890
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-890
> >             Project: Mahout
> >          Issue Type: Bug
> >          Components: Frequent Itemset/Association Rule Mining
> >    Affects Versions: 0.6
> >            Reporter: tom pierce
> >         Attachments: MAHOUT-890-2.patch, MAHOUT-890.patch,
> addSynth.patch, logtrees.patch, simpleFPG.patch, smallexample.dat
> >
> >
> > I've encountered a dataset which indicates there is probably a
> performance bug lurking in the FPGrowth implementation.  This set may be a
> bit of an unusual target for FPG - there's a relatively modest number
> itemsets, and many items with a Zipfy distribution.  I am attaching a patch
> (addSynth.patch) to add a similar dataset as
> core/src/test/resources/FPGsynth.dat.
> > FPGsynth.dat can take minutes or a few hours to process, depending on
> how it is grouped out to machines.  If run in sequential mode, or with "-g
> 50" it will take considerable time.  Most reducers/"anchor items" are
> processed quickly, but a small number take a handful of minutes, and one or
> two take a long time.  If you experiment with this data, I suggest using
>  '-s 50 -regex "[ ]+"'.
> > Digging into this, I've found that the tree pruning code sometimes
> creates surprising trees.  One oddity I've observed is 0-count nodes,
> sometimes with non-zero children.  The other is that sometimes subtrees
> seem to get repeated.  I'm attaching a sample input file (smallexample.dat,
> use the whitespace regex with this one, too) and a patch which adds some
> logging in pruneFPTree and growthBottomUp which will print out some
> interesting trees when run with the smallexample.dat input.
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators:
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>

Re: [jira] [Updated] (MAHOUT-890) Performance issue in FPGrowth

Reply via email to