[jira] [Commented] (MAHOUT-890) Performance issue in FPGrowth

tom pierce (Commented) (JIRA) Sun, 20 Nov 2011 08:10:16 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13153822#comment-13153822
 ]


tom pierce commented on MAHOUT-890:
-----------------------------------

There's no fix patch up there (yet); it has been difficult for me to trace 
through this code.  It doesn't quite line up with my reading of the papers it's 
based on, and it is not always clear what assumptions and guarantees 
callers/callees are making.

I thought it would be helpful to show example cases that trigger the problem 
I'm having; maybe someone else who knows this code better could quickly see 
what is wrong.  I'll probably submit a patch in a day or two with a naive 
"straight from the paper" implementation.   

I love the suggestion to have a tests-long target or something similar 
(long-test-disable property?), especially if you can scope it to different 
packages.  


                
> Performance issue in FPGrowth
> -----------------------------
>
>                 Key: MAHOUT-890
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-890
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.6
>            Reporter: tom pierce
>         Attachments: addSynth.patch, logtrees.patch, smallexample.dat
>
>
> I've encountered a dataset which indicates there is probably a performance 
> bug lurking in the FPGrowth implementation.  This set may be a bit of an 
> unusual target for FPG - there's a relatively modest number itemsets, and 
> many items with a Zipfy distribution.  I am attaching a patch 
> (addSynth.patch) to add a similar dataset as 
> core/src/test/resources/FPGsynth.dat.
> FPGsynth.dat can take minutes or a few hours to process, depending on how it 
> is grouped out to machines.  If run in sequential mode, or with "-g 50" it 
> will take considerable time.  Most reducers/"anchor items" are processed 
> quickly, but a small number take a handful of minutes, and one or two take a 
> long time.  If you experiment with this data, I suggest using  '-s 50 -regex 
> "[ ]+"'. 
> Digging into this, I've found that the tree pruning code sometimes creates 
> surprising trees.  One oddity I've observed is 0-count nodes, sometimes with 
> non-zero children.  The other is that sometimes subtrees seem to get 
> repeated.  I'm attaching a sample input file (smallexample.dat, use the 
> whitespace regex with this one, too) and a patch which adds some logging in 
> pruneFPTree and growthBottomUp which will print out some interesting trees 
> when run with the smallexample.dat input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-890) Performance issue in FPGrowth

Reply via email to