[ 
https://issues.apache.org/jira/browse/MAHOUT-629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yarco Hayduk updated MAHOUT-629:
--------------------------------

    Comment: was deleted

(was: Hi Jaroslaw,

I'm currently beginning to review your patch file. I'm not 100% through with it 
yet, but it seems that your patch does some redundant work.
CountDescendingPairComparator ensures the first element of the flist is the 
most frequent singleton item. Thus, this code is not needed:
------
long mostFrequentAttributeToReturn = -1;
        for (Integer att : requiredFeatures) {
                if (attributeFrequency[att] > mostFrequentAttributeToReturn){
                        mostFrequentAttributeToReturn = attributeFrequency[att];
                }
        }
------
Further, you can get "long[] attributeFrequency" from the top-most tree, hence 
you don't need to pass this attribute around. 

Will keep you posted if I find anything else)

> FP Growth performance improvement
> ---------------------------------
>
>                 Key: MAHOUT-629
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-629
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.4, 0.5
>            Reporter: Jaroslaw Odzga
>            Assignee: Robin Anil
>             Fix For: 0.6
>
>         Attachments: performance_improvement.txt
>
>
> Instead of calculating patterns ending with given attribute multiple times 
> they can be calculated just once. Depending on for how many features patterns 
> are generated, speedup can be huge. More feature included - greater speedup. 
> For test data set (88162 real life 'basket' transactions), if all features 
> were selected (i.e. we want to generate patterns for all items in 
> transactions), patterns generation time dropped from 1h 15min to 8sec. For 
> parallel fpgrowth, where the number of requested features is limited the 
> speedup is not that dramatic, but still noticeable. Basically work done is 
> always smaller than before the patch (as patterns for each item are 
> calculated at most once).
> The improvement is a variation of base algorithm in situation where we want 
> to generate patterns for only subset of items (let's call this subset A). 
> Given that items are ordered by descending frequency it is enough to 
> calculate only patterns ending on any item with frequency smaller or equal to 
> the most frequent item in the subset A. The heaps for each item are 
> initialized upfront and merged after processing every item.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to