Re: Mahout GSoC 2010 proposal: Association Mining

Robin Anil Mon, 05 Apr 2010 15:00:24 -0700

Hi Lukas,
               Sorry for being late to getting back to you on this.
Association rule mining is a great addition to FPGrowth. I am not sure I
understand GUHA method well but then again I understood Ted's LLR after some
deep reading. Could you put up an interesting example to help us understand
this method. Maybe starting from a transaction of shopping cart item ? A
great demo is big plus for a GSOC project.


Robin

On Mon, Mar 29, 2010 at 1:46 AM, Lukáš Vlček <lukas.vl...@gmail.com> wrote:

> Hello,
>
> I would like to apply for Mahout GSoC 2010. My proposal is to implement
> Association Mining algorithm utilizing existing PFPGrowth implementation (
> http://cwiki.apache.org/MAHOUT/parallelfrequentpatternmining.html).
>
> As for the Assoiciation Mining I would like to implement a very general
> algorithm based on old and established method called GUHA [1]. Short and
> informative description of GUHA method can be found here [2]. Very
> exhaustive theoretical foundation can be found here [3]. Since the GUHA
> method has been developing from 60's and is very rich now and given the
> limited time during GSoC program it would be wise to reduce scope of
> initial
> implementation.
>
> There are two main topics that I would like to focus on during GSoC:
>
> 1) Enhancing and generalization of PFPGworth:
> Frequent pattern mining is usually part of each association maining task.
> In
> Mahout there is existing implementation of PFPGrowth algorithm. I would
> like
> to utilize this implementaion but it would be necessary to enhance or
> generalize it first. The goal here would be to keep current functionality
> and performance of PFPGrowth but allow more generalized input/output data
> and conditions. We will need to get frequencies of very rare patterns thus
> if it will show up that mining only top K is limiting factor then we would
> need to allow relaxation of this approach. Also we will need to take
> account
> on negative features (those missing in individual transaction). Negative
> features can be either directly supported by implementation of FP algorithm
> or it can be coded at the feature level (decision about what is better
> approach will be part of the GSoC program). It should be noted that for the
> GSoC program we will narrow scope of association mining antecedents and
> succedents to conjunctions of data features only.
>
> 2) API for custom association mining functions based on 4ft table:
> Association mining in GUHA sense means testing hypothesis on four-fold
> table
> (4ft) [see 2. item 5]. There has been proposed a lot of association
> functions for GUHA, some of them are based on statistical tests (for
> example
> Fischer test, Chi-square test), some are based on frequent analysis while
> not explicitly refering to statistical tests but in both cases all
> frequencies from four-fold table are needed. Some tests/functions do not
> require all frequencies, however; keeping this association mining
> implementation general means that we are targeting for all frequencies from
> 4ft. The goal here would be to provide implementation of few GUHA functions
> and design API for custom function based on 4ft table (if the custom
> function can be expressed using 4ft table frequencies then it should be
> very
> easy to implement it for association mining job).
>
> General use case scenario:
> Before the association mining task is executed one would have to decide
> which features can be on the left hand side of the rule (antecedents) and
> which on the right hand side of the rule (succedents). It may be practical
> to limit number of features on both sides (rules with many features may not
> be very useful). Then a specific test or function for the 4ft table will be
> specified with additional custom treasholds.
>
> Note: The terminology used in GUHA theory is not be unified. Various
> researches used different terminology. This may be a source of confusion.
>
> My background:
> I have been working as a full-time Java developer for 9 years. Currently, I
> am working for JBoss (thus being paid for working on open source). A few
> years ago I also started Ph.D. studies in Ostrava, Czech Republic and
> association mining is the topic I am focusing on right now. Implementing
> association mining in context of Mahout makes sense because it is one of
> the
> areas of data mining which is not yet covered in Mahout at all. MapReduce
> implementation can be one possible way how to tackle the challenge of
> mining
> association rules from large data.
>
> Regards,
> Lukas Vlcek
>
> [1]
> http://en.wikipedia.org/wiki/Association_rule_learning#GUHA_procedure_ASSOC
> [2] http://www.cs.cas.cz/coufal/guha/index.htm
> [3] http://www.uivt.cas.cz/~hajek/guhabook/index.html<
> http://www.uivt.cas.cz/%7Ehajek/guhabook/index.html>
>

Re: Mahout GSoC 2010 proposal: Association Mining

Reply via email to