It already does check the mahout wiki. On Mon, Feb 15, 2010 at 10:08 PM, Neal Richter <[email protected]> wrote:
> Note that there is sort-of standard input and output spec for itemset > mining that was defined for the FIMI'03 and FIMI'04 workshops. > > http://fimi.cs.helsinki.fi/ > http://fimi.cs.helsinki.fi/fimi04/rules.html > > Having a switch to adhere to that simple standard could be useful as well. > > Code submitted to that workshop also implemented open, closed and > maximal itemsets as well. > > - Neal > > On Mon, Feb 15, 2010 at 9:25 AM, Robin Anil <[email protected]> wrote: > > Cool. Thanks for sharing this. I will file a jira issue over this. > > > > Robin > > > > > > > > On Mon, Feb 15, 2010 at 9:52 PM, Neal Richter <[email protected]> > wrote: > > > >> I have no problem with the repetition! > >> > >> I'll have to poke at this a bit more, but I like the switches ideas. > >> I often use Christian Borgelt's itemset implementations for playing > >> with data. He's implemented a nice set of switches, see below. > >> Setting a minimum support threshold and mimimum itemset size are both > >> convenient and tend to make the algorithm run a bit faster. > >> > >> http://www.borgelt.net/software.html > >> > >> ne...@nrichter-laptop:~$ fpgrowth_fim > >> usage: fpgrowth_fim [options] infile outfile > >> find frequent item sets with the fpgrowth algorithm > >> version 1.13 (2008.05.02) (c) 2004-2008 Christian Borgelt > >> -m# minimal number of items per item set (default: 1) > >> -n# maximal number of items per item set (default: no limit) > >> -s# minimal support of an item set (default: 10%) > >> (positive: percentage, negative: absolute number) > >> -d# minimal binary logarithm of support quotient (default: none) > >> -p# output format for the item set support (default: "%.1f") > >> -a print absolute support (number of transactions) > >> -g write output in scanable form (quote certain characters) > >> -q# sort items w.r.t. their frequency (default: -2) > >> (1: ascending, -1: descending, 0: do not sort, > >> 2: ascending, -2: descending w.r.t. transaction size sum) > >> -u use alternative tree projection method > >> -z do not prune tree projections to bonsai > >> -j use quicksort to sort the transactions (default: heapsort) > >> -i# ignore records starting with a character in the given string > >> -b/f/r# blank characters, field and record separators > >> (default: " \t\r", " \t", "\n") > >> infile file to read transactions from > >> outfile file to write frequent item se > >> > >> On Mon, Feb 15, 2010 at 9:14 AM, Robin Anil <[email protected]> > wrote: > >> > Hi Neal, > >> > I know there is repetition. I tried sticking true to the > >> > original algorithm that is finding closed patterns and using the > longest > >> > one. > >> > > >> > Say if 68 and 12 occurs 1000 times > >> > and 68 12 17 also occurs 1000 times, there so information that former > >> > pattern gives you. So, you can remove it. Therefore you say that 68 12 > 17 > >> is > >> > a closed pattern and all the patterns it is enclosing are removed. > >> > > >> > had 68 alone occurred 2000 times. It no longer becomes a closed > pattern.. > >> > > >> > Things could be made configurable by having a flag to remove closed > >> patterns > >> > within a percentage of the support Or mine only patterns > 3 items in > >> > length. These are tricky but could be done. > >> > > >> > Robin > >> > > >> > > >> > On Mon, Feb 15, 2010 at 9:34 PM, Neal Richter <[email protected]> > >> wrote: > >> > > >> >> Grant: Chapter 5 of Han and Kamber (Data Mining: Concepts and > >> >> Techniques) detail itemset mining and the fpgrowth alg. Han is a > >> >> co-inventor of it. > >> >> > >> >> There is a bit of repetition in the output compared to other itemset > >> >> mining packages, though this structure is convenient for relational > >> >> indexing by key. > >> >> > >> >> - Neal > >> >> > >> >> On Mon, Feb 15, 2010 at 6:49 AM, Robin Anil <[email protected]> > >> wrote: > >> >> > Ok.. A bit more background.. > >> >> > > >> >> > An Itemset is a subset I1, I2, I3... In > >> >> > > >> >> > so [I2, I4, I7] is an itemset and the support(no of times its > visible > >> in > >> >> the > >> >> > dataset) is say Y > >> >> > > >> >> > A Pattern is Pair<Itemset, support> > >> >> > > >> >> > Take a look at in this format > >> >> > > >> >> > 68: > >> >> > ([68],90692), > >> >> > ([17, 68],90683), > >> >> > ([12, 68],90490), > >> >> > ([17, 12, 68],90481), > >> >> > ([18, 68],90291) > >> >> > > >> >> > these are top patterns containing 68 and their support in > descending > >> >> order > >> >> > 68 occurs with 12, 90490 times > >> >> > > >> >> > Robin > >> >> > > >> >> > > >> >> > On Mon, Feb 15, 2010 at 6:27 PM, Grant Ingersoll < > [email protected] > >> >> >wrote: > >> >> > > >> >> >> > >> >> >> On Feb 14, 2010, at 11:37 PM, Robin Anil wrote: > >> >> >> > >> >> >> > Each key is a feature and each attribute is the topK frequent > >> patterns > >> >> >> where > >> >> >> > the feature exist > >> >> >> > >> >> >> Still a bit confused. > >> >> >> Given: > >> >> >> Key: 68: Value: ([68],90692), ([17, 68],90683), ([12, 68],90490), > >> ([17, > >> >> 12, > >> >> >> 68],90481), ([18, 68],90291), ([17, 18, 68],90282), ([12, 18, > >> >> 68],90229), > >> >> >> ([17, 12, 18, 68],90220), ([31, 68],89071), ([17, 31, 68],89062), > >> ([12, > >> >> 31, > >> >> >> 68],88874), ([17, 12, 31, 68],88865), ([18, 31, 68],88681), ([17, > 18, > >> >> 31, > >> >> >> 68],88672), ([12, 18, 31, 68],88619), ([17, 12, 18, 31, > 68],88610), > >> >> ([16, > >> >> >> 68],87933), > >> >> >> > >> >> >> So, 68 is the feature in question. That makes sense. Then, what > is > >> the > >> >> >> significance of the [] areas, as in [68],90692 or [17,12,68], > 90481. > >> >> Why > >> >> >> all the repetition? > >> >> >> > >> >> >> -Grant > >> >> > > >> >> > >> > > >> > > >
