Re: FP Growth Understanding

Robin Anil Mon, 15 Feb 2010 09:44:34 -0800

It already does check the mahout wiki.

On Mon, Feb 15, 2010 at 10:08 PM, Neal Richter <[email protected]> wrote:


> Note that there is sort-of standard input and output spec for itemset
> mining that was defined for the FIMI'03 and FIMI'04 workshops.
>
> http://fimi.cs.helsinki.fi/
> http://fimi.cs.helsinki.fi/fimi04/rules.html
>
> Having a switch to adhere to that simple standard could be useful as well.
>
> Code submitted to that workshop also implemented open, closed and
> maximal itemsets as well.
>
> - Neal
>
> On Mon, Feb 15, 2010 at 9:25 AM, Robin Anil <[email protected]> wrote:
> > Cool. Thanks for sharing this. I will file a jira issue over this.
> >
> > Robin
> >
> >
> >
> > On Mon, Feb 15, 2010 at 9:52 PM, Neal Richter <[email protected]>
> wrote:
> >
> >> I have no problem with the repetition!
> >>
> >> I'll have to poke at this a bit more, but I like the switches ideas.
> >> I often use Christian Borgelt's itemset implementations for playing
> >> with data.  He's implemented a nice set of switches, see below.
> >> Setting a minimum support threshold and mimimum itemset size are both
> >> convenient and tend to make the algorithm run a bit faster.
> >>
> >> http://www.borgelt.net/software.html
> >>
> >> ne...@nrichter-laptop:~$ fpgrowth_fim
> >> usage: fpgrowth_fim [options] infile outfile
> >> find frequent item sets with the fpgrowth algorithm
> >> version 1.13 (2008.05.02)        (c) 2004-2008   Christian Borgelt
> >> -m#      minimal number of items per item set (default: 1)
> >> -n#      maximal number of items per item set (default: no limit)
> >> -s#      minimal support of an item set (default: 10%)
> >>         (positive: percentage, negative: absolute number)
> >> -d#      minimal binary logarithm of support quotient (default: none)
> >> -p#      output format for the item set support (default: "%.1f")
> >> -a       print absolute support (number of transactions)
> >> -g       write output in scanable form (quote certain characters)
> >> -q#      sort items w.r.t. their frequency (default: -2)
> >>         (1: ascending, -1: descending, 0: do not sort,
> >>          2: ascending, -2: descending w.r.t. transaction size sum)
> >> -u       use alternative tree projection method
> >> -z       do not prune tree projections to bonsai
> >> -j       use quicksort to sort the transactions (default: heapsort)
> >> -i#      ignore records starting with a character in the given string
> >> -b/f/r#  blank characters, field and record separators
> >>         (default: " \t\r", " \t", "\n")
> >> infile   file to read transactions from
> >> outfile  file to write frequent item se
> >>
> >> On Mon, Feb 15, 2010 at 9:14 AM, Robin Anil <[email protected]>
> wrote:
> >> > Hi Neal,
> >> >             I know there is repetition. I tried sticking true to the
> >> > original algorithm that is finding closed patterns and using the
> longest
> >> > one.
> >> >
> >> > Say if 68 and 12 occurs 1000 times
> >> > and 68 12 17 also occurs 1000 times, there so information that former
> >> > pattern gives you. So, you can remove it. Therefore you say that 68 12
> 17
> >> is
> >> > a closed pattern and all the patterns it is enclosing are removed.
> >> >
> >> > had 68 alone occurred 2000 times. It no longer becomes a closed
> pattern..
> >> >
> >> > Things could be made configurable by having a flag to remove closed
> >> patterns
> >> > within a percentage of the support Or mine only patterns > 3 items in
> >> > length. These are tricky but could be done.
> >> >
> >> > Robin
> >> >
> >> >
> >> > On Mon, Feb 15, 2010 at 9:34 PM, Neal Richter <[email protected]>
> >> wrote:
> >> >
> >> >> Grant:  Chapter 5 of Han and Kamber (Data Mining: Concepts and
> >> >> Techniques) detail itemset mining and the fpgrowth alg.  Han is a
> >> >> co-inventor of it.
> >> >>
> >> >> There is a bit of repetition in the output compared to other itemset
> >> >> mining packages, though this structure is convenient for relational
> >> >> indexing by key.
> >> >>
> >> >> - Neal
> >> >>
> >> >> On Mon, Feb 15, 2010 at 6:49 AM, Robin Anil <[email protected]>
> >> wrote:
> >> >> > Ok.. A bit more background..
> >> >> >
> >> >> > An Itemset is a subset I1, I2, I3... In
> >> >> >
> >> >> > so [I2, I4, I7] is an itemset and the support(no of times its
> visible
> >> in
> >> >> the
> >> >> > dataset) is say Y
> >> >> >
> >> >> > A Pattern is Pair<Itemset, support>
> >> >> >
> >> >> > Take a look at in this format
> >> >> >
> >> >> > 68:
> >> >> >     ([68],90692),
> >> >> >     ([17, 68],90683),
> >> >> >     ([12, 68],90490),
> >> >> >     ([17, 12, 68],90481),
> >> >> >     ([18, 68],90291)
> >> >> >
> >> >> > these are top patterns containing 68 and their support in
> descending
> >> >> order
> >> >> > 68 occurs with 12,  90490 times
> >> >> >
> >> >> > Robin
> >> >> >
> >> >> >
> >> >> > On Mon, Feb 15, 2010 at 6:27 PM, Grant Ingersoll <
> [email protected]
> >> >> >wrote:
> >> >> >
> >> >> >>
> >> >> >> On Feb 14, 2010, at 11:37 PM, Robin Anil wrote:
> >> >> >>
> >> >> >> > Each key is a feature and each attribute is the topK frequent
> >> patterns
> >> >> >> where
> >> >> >> > the feature exist
> >> >> >>
> >> >> >> Still a bit confused.
> >> >> >> Given:
> >> >> >> Key: 68: Value: ([68],90692), ([17, 68],90683), ([12, 68],90490),
> >> ([17,
> >> >> 12,
> >> >> >> 68],90481), ([18, 68],90291), ([17, 18, 68],90282), ([12, 18,
> >> >> 68],90229),
> >> >> >> ([17, 12, 18, 68],90220), ([31, 68],89071), ([17, 31, 68],89062),
> >> ([12,
> >> >> 31,
> >> >> >> 68],88874), ([17, 12, 31, 68],88865), ([18, 31, 68],88681), ([17,
> 18,
> >> >> 31,
> >> >> >> 68],88672), ([12, 18, 31, 68],88619), ([17, 12, 18, 31,
> 68],88610),
> >> >> ([16,
> >> >> >> 68],87933),
> >> >> >>
> >> >> >> So, 68 is the feature in question.  That makes sense.  Then, what
> is
> >> the
> >> >> >> significance of the [] areas, as in [68],90692 or [17,12,68],
> 90481.
> >> >>  Why
> >> >> >> all the repetition?
> >> >> >>
> >> >> >> -Grant
> >> >> >
> >> >>
> >> >
> >>
> >
>

Re: FP Growth Understanding

Reply via email to