Hello, I would like to apply for Mahout GSoC 2010. My proposal is to implement Association Mining algorithm utilizing existing PFPGrowth implementation ( http://cwiki.apache.org/MAHOUT/parallelfrequentpatternmining.html).
As for the Assoiciation Mining I would like to implement a very general algorithm based on old and established method called GUHA [1]. Short and informative description of GUHA method can be found here [2]. Very exhaustive theoretical foundation can be found here [3]. Since the GUHA method has been developing from 60's and is very rich now and given the limited time during GSoC program it would be wise to reduce scope of initial implementation. There are two main topics that I would like to focus on during GSoC: 1) Enhancing and generalization of PFPGworth: Frequent pattern mining is usually part of each association maining task. In Mahout there is existing implementation of PFPGrowth algorithm. I would like to utilize this implementaion but it would be necessary to enhance or generalize it first. The goal here would be to keep current functionality and performance of PFPGrowth but allow more generalized input/output data and conditions. We will need to get frequencies of very rare patterns thus if it will show up that mining only top K is limiting factor then we would need to allow relaxation of this approach. Also we will need to take account on negative features (those missing in individual transaction). Negative features can be either directly supported by implementation of FP algorithm or it can be coded at the feature level (decision about what is better approach will be part of the GSoC program). It should be noted that for the GSoC program we will narrow scope of association mining antecedents and succedents to conjunctions of data features only. 2) API for custom association mining functions based on 4ft table: Association mining in GUHA sense means testing hypothesis on four-fold table (4ft) [see 2. item 5]. There has been proposed a lot of association functions for GUHA, some of them are based on statistical tests (for example Fischer test, Chi-square test), some are based on frequent analysis while not explicitly refering to statistical tests but in both cases all frequencies from four-fold table are needed. Some tests/functions do not require all frequencies, however; keeping this association mining implementation general means that we are targeting for all frequencies from 4ft. The goal here would be to provide implementation of few GUHA functions and design API for custom function based on 4ft table (if the custom function can be expressed using 4ft table frequencies then it should be very easy to implement it for association mining job). General use case scenario: Before the association mining task is executed one would have to decide which features can be on the left hand side of the rule (antecedents) and which on the right hand side of the rule (succedents). It may be practical to limit number of features on both sides (rules with many features may not be very useful). Then a specific test or function for the 4ft table will be specified with additional custom treasholds. Note: The terminology used in GUHA theory is not be unified. Various researches used different terminology. This may be a source of confusion. My background: I have been working as a full-time Java developer for 9 years. Currently, I am working for JBoss (thus being paid for working on open source). A few years ago I also started Ph.D. studies in Ostrava, Czech Republic and association mining is the topic I am focusing on right now. Implementing association mining in context of Mahout makes sense because it is one of the areas of data mining which is not yet covered in Mahout at all. MapReduce implementation can be one possible way how to tackle the challenge of mining association rules from large data. Regards, Lukas Vlcek [1] http://en.wikipedia.org/wiki/Association_rule_learning#GUHA_procedure_ASSOC [2] http://www.cs.cas.cz/coufal/guha/index.htm [3] http://www.uivt.cas.cz/~hajek/guhabook/index.html<http://www.uivt.cas.cz/%7Ehajek/guhabook/index.html>