[
https://issues.apache.org/jira/browse/MAHOUT-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006258#comment-13006258
]
Jaroslaw Odzga commented on MAHOUT-625:
---------------------------------------
I attached isolated patch for bug fix - it's one-liner. The rest is for
optimization.
Answering your questions:
1) Node that was removed was ending up in header table for certain data input -
this is the reason for increased support for some of the generated patterns
2) The author of the dataset writes:
The data are provided ’as is’. Basically, any use of the data is allowed as
long as the proper
acknowledgment is provided and a copy of the work is provided to Tom Brijs (see
details below).
I think author mainly thinks of scientific papers when he mentions "copy of
work". I'm not sure if is it enough to drop an email to the author and just ask
if dataset can be used in mahout?
3) I don't see how we could achieve memory saving since the data is in
preallocated array. Removing node is done merely by detaching it from the
parent, which could be done, but I think benefit is not worth additional effort
of doing it (currently parent has unordered list of children).
4) As to performance improvement, as I said it is dramatic when number of
requested features is high (as in single node scenario or with very big groups
in parallel scenario), it is still noticeable even with small number of
features. Basically work done is always smaller than before the patch (as
patterns for each item are calculated at most once). Obviously in parallel
situation, when groups are small, the performance boost will not be that huge.
If you notice any issues with it, please let me know.
> Some of generated patterns have support higher than in reality
> --------------------------------------------------------------
>
> Key: MAHOUT-625
> URL: https://issues.apache.org/jira/browse/MAHOUT-625
> Project: Mahout
> Issue Type: Bug
> Components: Frequent Itemset/Association Rule Mining
> Affects Versions: 0.4
> Reporter: Jaroslaw Odzga
> Priority: Critical
> Attachments: MAHOUT-625-patch.txt, bugfix-patch.txt, mahout-test.zip
>
>
> It turnes out that some of generated patterns have incorrect support. The
> returned support is slightly higher than the true one.
> I attached the test, which proves that FPGrowth has a bug. Test is using data
> (retail) found here: http://fimi.ua.ac.be/data/
> The pattern (36, 39, 41) occurs in the transactions 572 times (this is also
> calculated in test), but the FPGrowth returns pattern (36, 39, 41) with
> support 573.
> Please note that mentioned pattern is not the only one with incorrect support
> - the test only point out one example to hace something to focus on. There is
> plenty more patterns with support higher than the real one. The biggest
> difference I noticed was support 8 higher than the real one for one of
> patterns.
> Please find attached failing unit test - it's actually a maven project, which
> contains test data and is ready to run.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira