[
https://issues.apache.org/jira/browse/MAHOUT-629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen updated MAHOUT-629:
-----------------------------
Affects Version/s: 0.5
Fix Version/s: 0.6
Assignee: Robin Anil
> FP Growth performance improvement
> ---------------------------------
>
> Key: MAHOUT-629
> URL: https://issues.apache.org/jira/browse/MAHOUT-629
> Project: Mahout
> Issue Type: Improvement
> Components: Frequent Itemset/Association Rule Mining
> Affects Versions: 0.4, 0.5
> Reporter: Jaroslaw Odzga
> Assignee: Robin Anil
> Fix For: 0.6
>
> Attachments: performance_improvement.txt
>
>
> Instead of calculating patterns ending with given attribute multiple times
> they can be calculated just once. Depending on for how many features patterns
> are generated, speedup can be huge. More feature included - greater speedup.
> For test data set (88162 real life 'basket' transactions), if all features
> were selected (i.e. we want to generate patterns for all items in
> transactions), patterns generation time dropped from 1h 15min to 8sec. For
> parallel fpgrowth, where the number of requested features is limited the
> speedup is not that dramatic, but still noticeable. Basically work done is
> always smaller than before the patch (as patterns for each item are
> calculated at most once).
> The improvement is a variation of base algorithm in situation where we want
> to generate patterns for only subset of items (let's call this subset A).
> Given that items are ordered by descending frequency it is enough to
> calculate only patterns ending on any item with frequency smaller or equal to
> the most frequent item in the subset A. The heaps for each item are
> initialized upfront and merged after processing every item.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira