[
https://issues.apache.org/jira/browse/MAHOUT-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006196#comment-13006196
]
Jaroslaw Odzga commented on MAHOUT-625:
---------------------------------------
Hi,
After a bit of analysis I found the problem. It turned out that the
alpha-pruning method had a bug - it was leaving the node attached to it's
parent after pruning. This was causing issues in situation, when tree was
pruned multiple times and unattached node landed in header table (this is rare
for small data, but pretty common for fairly big number of transactions). In
such case, the support of unattached node was counted even though it shouldn't.
I fixed the problem by simply setting the support of such node to 0. Since it
can not have any children it doesn't add to computation time and it is simpler
solution that trying to unattach it from it's parent (parent has an arbitrary
ordered list of children).
I added a comprehensive test, which uses transactions from the retail data from
http://fimi.ua.ac.be/data/ (88162 transactions) and compares the results with
the results generated by implementation from
http://www.borgelt.net/fpgrowth.html. I added tests for both the serial and
parallel (map-reduce) fpgrowth.
I also noticed that fpgrowth implementation can be optimized by not calculating
patterns ending with given attributes multiple times. Depending on for how many
features patterns are generated, speedup can be huge. More feature included -
greater speedup. For mentioned test data, if all features were selected (i.e.
we want to generate patterns for all items in transactions), patterns
generation time dropped from 1h 15min to 8sec. For parallel fpgrowth, where the
number of requested features is limited the speedup is not that dramatic, but
still very high.
I attached patch (for version 0.5-SNAPSHOT) with all changes (bug fix,
comprehensive tests (serial and parallel) and optimization). I hope it'll find
it's way to the trunk :)
> Some of generated patterns have support higher than in reality
> --------------------------------------------------------------
>
> Key: MAHOUT-625
> URL: https://issues.apache.org/jira/browse/MAHOUT-625
> Project: Mahout
> Issue Type: Bug
> Components: Frequent Itemset/Association Rule Mining
> Affects Versions: 0.4
> Reporter: Jaroslaw Odzga
> Priority: Critical
> Attachments: MAHOUT-625-patch.txt, mahout-test.zip
>
>
> It turnes out that some of generated patterns have incorrect support. The
> returned support is slightly higher than the true one.
> I attached the test, which proves that FPGrowth has a bug. Test is using data
> (retail) found here: http://fimi.ua.ac.be/data/
> The pattern (36, 39, 41) occurs in the transactions 572 times (this is also
> calculated in test), but the FPGrowth returns pattern (36, 39, 41) with
> support 573.
> Please note that mentioned pattern is not the only one with incorrect support
> - the test only point out one example to hace something to focus on. There is
> plenty more patterns with support higher than the real one. The biggest
> difference I noticed was support 8 higher than the real one for one of
> patterns.
> Please find attached failing unit test - it's actually a maven project, which
> contains test data and is ready to run.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira