[
https://issues.apache.org/jira/browse/MAHOUT-617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vipul Pandey updated MAHOUT-617:
--------------------------------
Description:
FPGrowth reports the support of itemsets individually - in that - if Item X
appears "individually" 12 times and appears with item Y 10 times (a total of 22
times) AND item Y appears "individually" 4 times (a total of 14 times) then
this is what the output will be (say for min-support 2)
12 X
10 XY
4 Y
Instead of
22 X
10 XY
14 Y
Also, because of this If the minimum support is 5 then the output will look
like :
12 X
10 X Y
Thus totally Ignoring Y
if the minimum support is 11 then the output will look like
12 X
again Ignoring Y
if the minimum support is 13 then there will be NO output. even though all the
way along Xs support was 22 and Y's was 14
Even if we want to show just the maximal itemsets (although i would like to see
ALL the frequent itemsets - maximal or not) this output is wrong as with a
support of 13 we should still have seen X(22) and Y(14)
Now Say you add XYZ 11 times
for support 1 you'd see
12 X
10 X Y
11 X Y Z
4 Y
And for support 11 you'd see
12 X
11 X Y Z
Although I'd expect the output (for both s=1 & s=11) to be
33 X
25 Y
21 XY
11 Z
11 XZ
11 YZ
11 XYZ
attached are the sample inputs:
was:
PFPGrowth with my data is giving out wrong results. Attached are :
- The input data
- The output (sequence file) generated by FPGrowth (PFPGrowth gives the same
results)
- Output as text
$ cat part-r-00000 | grep 1678807047
12 1678807047
38 1678807047 3159925415
which says that the support (12) for the item (1678807047) is lesser than the
support (38) of a pair containing that item.
another example
$ cat part-r-00000 | grep 1441690161
12 1441690161 3910019844
18 1604285941 1441690161 3910019844
75 1441690161
Runtime parameters :
-i baskets/part-r-00000 -o patterns -k 50 -method sequential -g 10 -regex
'[\t]' -s 10
NOTE : Unable to attach files to JIRA. Here's the bundle of files (Input,
SequenceOutput & TextOutput) https://files.me.com/vpandey/glsovt
> FPGrowth/PFPGrowth giving out wrong results.
> ---------------------------------------------
>
> Key: MAHOUT-617
> URL: https://issues.apache.org/jira/browse/MAHOUT-617
> Project: Mahout
> Issue Type: Bug
> Components: Frequent Itemset/Association Rule Mining
> Affects Versions: 0.4
> Environment: Mac OS X, Linux
> Reporter: Vipul Pandey
> Assignee: Robin Anil
> Labels: AssociationMining, FPGrowth, FrequentItemsets
> Attachments: XYZ
>
>
> FPGrowth reports the support of itemsets individually - in that - if Item X
> appears "individually" 12 times and appears with item Y 10 times (a total of
> 22 times) AND item Y appears "individually" 4 times (a total of 14 times)
> then this is what the output will be (say for min-support 2)
> 12 X
> 10 XY
> 4 Y
> Instead of
> 22 X
> 10 XY
> 14 Y
> Also, because of this If the minimum support is 5 then the output will look
> like :
> 12 X
> 10 X Y
> Thus totally Ignoring Y
> if the minimum support is 11 then the output will look like
> 12 X
> again Ignoring Y
> if the minimum support is 13 then there will be NO output. even though all
> the way along Xs support was 22 and Y's was 14
> Even if we want to show just the maximal itemsets (although i would like to
> see ALL the frequent itemsets - maximal or not) this output is wrong as with
> a support of 13 we should still have seen X(22) and Y(14)
> Now Say you add XYZ 11 times
> for support 1 you'd see
> 12 X
> 10 X Y
> 11 X Y Z
> 4 Y
> And for support 11 you'd see
> 12 X
> 11 X Y Z
> Although I'd expect the output (for both s=1 & s=11) to be
> 33 X
> 25 Y
> 21 XY
> 11 Z
> 11 XZ
> 11 YZ
> 11 XYZ
> attached are the sample inputs:
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira