Hi Lukas, Sorry for being late to getting back to you on this. Association rule mining is a great addition to FPGrowth. I am not sure I understand GUHA method well but then again I understood Ted's LLR after some deep reading. Could you put up an interesting example to help us understand this method. Maybe starting from a transaction of shopping cart item ? A great demo is big plus for a GSOC project.
Robin On Mon, Mar 29, 2010 at 1:46 AM, Lukáš Vlček <lukas.vl...@gmail.com> wrote: > Hello, > > I would like to apply for Mahout GSoC 2010. My proposal is to implement > Association Mining algorithm utilizing existing PFPGrowth implementation ( > http://cwiki.apache.org/MAHOUT/parallelfrequentpatternmining.html). > > As for the Assoiciation Mining I would like to implement a very general > algorithm based on old and established method called GUHA [1]. Short and > informative description of GUHA method can be found here [2]. Very > exhaustive theoretical foundation can be found here [3]. Since the GUHA > method has been developing from 60's and is very rich now and given the > limited time during GSoC program it would be wise to reduce scope of > initial > implementation. > > There are two main topics that I would like to focus on during GSoC: > > 1) Enhancing and generalization of PFPGworth: > Frequent pattern mining is usually part of each association maining task. > In > Mahout there is existing implementation of PFPGrowth algorithm. I would > like > to utilize this implementaion but it would be necessary to enhance or > generalize it first. The goal here would be to keep current functionality > and performance of PFPGrowth but allow more generalized input/output data > and conditions. We will need to get frequencies of very rare patterns thus > if it will show up that mining only top K is limiting factor then we would > need to allow relaxation of this approach. Also we will need to take > account > on negative features (those missing in individual transaction). Negative > features can be either directly supported by implementation of FP algorithm > or it can be coded at the feature level (decision about what is better > approach will be part of the GSoC program). It should be noted that for the > GSoC program we will narrow scope of association mining antecedents and > succedents to conjunctions of data features only. > > 2) API for custom association mining functions based on 4ft table: > Association mining in GUHA sense means testing hypothesis on four-fold > table > (4ft) [see 2. item 5]. There has been proposed a lot of association > functions for GUHA, some of them are based on statistical tests (for > example > Fischer test, Chi-square test), some are based on frequent analysis while > not explicitly refering to statistical tests but in both cases all > frequencies from four-fold table are needed. Some tests/functions do not > require all frequencies, however; keeping this association mining > implementation general means that we are targeting for all frequencies from > 4ft. The goal here would be to provide implementation of few GUHA functions > and design API for custom function based on 4ft table (if the custom > function can be expressed using 4ft table frequencies then it should be > very > easy to implement it for association mining job). > > General use case scenario: > Before the association mining task is executed one would have to decide > which features can be on the left hand side of the rule (antecedents) and > which on the right hand side of the rule (succedents). It may be practical > to limit number of features on both sides (rules with many features may not > be very useful). Then a specific test or function for the 4ft table will be > specified with additional custom treasholds. > > Note: The terminology used in GUHA theory is not be unified. Various > researches used different terminology. This may be a source of confusion. > > My background: > I have been working as a full-time Java developer for 9 years. Currently, I > am working for JBoss (thus being paid for working on open source). A few > years ago I also started Ph.D. studies in Ostrava, Czech Republic and > association mining is the topic I am focusing on right now. Implementing > association mining in context of Mahout makes sense because it is one of > the > areas of data mining which is not yet covered in Mahout at all. MapReduce > implementation can be one possible way how to tackle the challenge of > mining > association rules from large data. > > Regards, > Lukas Vlcek > > [1] > http://en.wikipedia.org/wiki/Association_rule_learning#GUHA_procedure_ASSOC > [2] http://www.cs.cas.cz/coufal/guha/index.htm > [3] http://www.uivt.cas.cz/~hajek/guhabook/index.html< > http://www.uivt.cas.cz/%7Ehajek/guhabook/index.html> >