Hello,

I would like to apply for Mahout GSoC 2010. My proposal is to implement
Association Mining algorithm utilizing existing PFPGrowth implementation (
http://cwiki.apache.org/MAHOUT/parallelfrequentpatternmining.html).

As for the Assoiciation Mining I would like to implement a very general
algorithm based on old and established method called GUHA [1]. Short and
informative description of GUHA method can be found here [2]. Very
exhaustive theoretical foundation can be found here [3]. Since the GUHA
method has been developing from 60's and is very rich now and given the
limited time during GSoC program it would be wise to reduce scope of initial
implementation.

There are two main topics that I would like to focus on during GSoC:

1) Enhancing and generalization of PFPGworth:
Frequent pattern mining is usually part of each association maining task. In
Mahout there is existing implementation of PFPGrowth algorithm. I would like
to utilize this implementaion but it would be necessary to enhance or
generalize it first. The goal here would be to keep current functionality
and performance of PFPGrowth but allow more generalized input/output data
and conditions. We will need to get frequencies of very rare patterns thus
if it will show up that mining only top K is limiting factor then we would
need to allow relaxation of this approach. Also we will need to take account
on negative features (those missing in individual transaction). Negative
features can be either directly supported by implementation of FP algorithm
or it can be coded at the feature level (decision about what is better
approach will be part of the GSoC program). It should be noted that for the
GSoC program we will narrow scope of association mining antecedents and
succedents to conjunctions of data features only.

2) API for custom association mining functions based on 4ft table:
Association mining in GUHA sense means testing hypothesis on four-fold table
(4ft) [see 2. item 5]. There has been proposed a lot of association
functions for GUHA, some of them are based on statistical tests (for example
Fischer test, Chi-square test), some are based on frequent analysis while
not explicitly refering to statistical tests but in both cases all
frequencies from four-fold table are needed. Some tests/functions do not
require all frequencies, however; keeping this association mining
implementation general means that we are targeting for all frequencies from
4ft. The goal here would be to provide implementation of few GUHA functions
and design API for custom function based on 4ft table (if the custom
function can be expressed using 4ft table frequencies then it should be very
easy to implement it for association mining job).

General use case scenario:
Before the association mining task is executed one would have to decide
which features can be on the left hand side of the rule (antecedents) and
which on the right hand side of the rule (succedents). It may be practical
to limit number of features on both sides (rules with many features may not
be very useful). Then a specific test or function for the 4ft table will be
specified with additional custom treasholds.

Note: The terminology used in GUHA theory is not be unified. Various
researches used different terminology. This may be a source of confusion.

My background:
I have been working as a full-time Java developer for 9 years. Currently, I
am working for JBoss (thus being paid for working on open source). A few
years ago I also started Ph.D. studies in Ostrava, Czech Republic and
association mining is the topic I am focusing on right now. Implementing
association mining in context of Mahout makes sense because it is one of the
areas of data mining which is not yet covered in Mahout at all. MapReduce
implementation can be one possible way how to tackle the challenge of mining
association rules from large data.

Regards,
Lukas Vlcek

[1]
http://en.wikipedia.org/wiki/Association_rule_learning#GUHA_procedure_ASSOC
[2] http://www.cs.cas.cz/coufal/guha/index.htm
[3] 
http://www.uivt.cas.cz/~hajek/guhabook/index.html<http://www.uivt.cas.cz/%7Ehajek/guhabook/index.html>

Reply via email to