[ 
https://issues.apache.org/jira/browse/MAHOUT-890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162249#comment-13162249
 ] 

Robin Anil commented on MAHOUT-890:
-----------------------------------

As I see from the implementation, the mining is done exactly once. This is 
quite drastic a deviation from earlier implementation which was mining top K 
patterns for each level of the tree. I understand that the reason for creating 
a new implementation was due to the surprising trees that you found in the old 
version and also because you wanted to make the code more testable. There could 
be things that can be replaced in a layer. If the problem was in the 
core-fpgrowth, we can simply replace the new fptree and fpgrowth. part. 

Alternately, we can check the code in under fpgrowth2 and continue the rewrite 
and move the functionality over bit by bit.

Looking at the code, I see a lot of List<Integer> and Integer[] in the new 
code. In mahout we went away from the boxed versions due to performance 
reasons. There is an IntArray in Mahout Math which is used by the earlier code, 
I would suggest using that.


                
> Performance issue in FPGrowth
> -----------------------------
>
>                 Key: MAHOUT-890
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-890
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.6
>            Reporter: tom pierce
>         Attachments: MAHOUT-890.patch, addSynth.patch, logtrees.patch, 
> simpleFPG.patch, smallexample.dat
>
>
> I've encountered a dataset which indicates there is probably a performance 
> bug lurking in the FPGrowth implementation.  This set may be a bit of an 
> unusual target for FPG - there's a relatively modest number itemsets, and 
> many items with a Zipfy distribution.  I am attaching a patch 
> (addSynth.patch) to add a similar dataset as 
> core/src/test/resources/FPGsynth.dat.
> FPGsynth.dat can take minutes or a few hours to process, depending on how it 
> is grouped out to machines.  If run in sequential mode, or with "-g 50" it 
> will take considerable time.  Most reducers/"anchor items" are processed 
> quickly, but a small number take a handful of minutes, and one or two take a 
> long time.  If you experiment with this data, I suggest using  '-s 50 -regex 
> "[ ]+"'. 
> Digging into this, I've found that the tree pruning code sometimes creates 
> surprising trees.  One oddity I've observed is 0-count nodes, sometimes with 
> non-zero children.  The other is that sometimes subtrees seem to get 
> repeated.  I'm attaching a sample input file (smallexample.dat, use the 
> whitespace regex with this one, too) and a patch which adds some logging in 
> pruneFPTree and growthBottomUp which will print out some interesting trees 
> when run with the smallexample.dat input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to