[CONF] Apache Lucene Mahout > ParallelFrequentPatternMining

confluence Fri, 12 Feb 2010 03:49:32 -0800

Space: Apache Lucene Mahout (http://cwiki.apache.org/confluence/display/MAHOUT)
Page: ParallelFrequentPatternMining 
(http://cwiki.apache.org/confluence/display/MAHOUT/ParallelFrequentPatternMining)


Added by Robin Anil:
---------------------------------------------------------------------
h1. Frequent Pattern Mining
We have a Top K Parallel FPGrowth Implementation. What it means is that given a 
huge transaction list, we find all unique features(field values) and eliminates 
those features whose frequency in the whole dataset is less that minSupport. 
Using these remaining features N, we find the top K closed patterns for each of 
them, generating NK patterns. FPGrowth Algorithm is a generic implementation, 
we can use any Object type to denote a feature. Current implementation requires 
you to use a String as the object type. You may implement a version for any 
object by creating Iterators, Convertors and TopKPatternWritable for that 
particular object. For more information please refer the package 
org.apache.mahout.fpm.pfpgrowth.convertors.string 
{code}
e.g:
 FPGrowth<String> fp = new FPGrowth<String>();
 Set<String> features = new HashSet<String>();
 fp.generateTopKStringFrequentPatterns(
     new StringRecordIterator(new FileLineIterable(new File(input), encoding, 
false), pattern), 
        fp.generateFList(
          new StringRecordIterator(new FileLineIterable(new File(input), 
encoding, false), pattern), minSupport),
         minSupport,
        maxHeapSize, 
        features,
        new StringOutputConvertor(new SequenceFileOutputCollector<Text, 
TopKStringPatterns>(writer))
  );
{code}
* The first argument is the iterator of transaction in this case its 
Iterator<List<String>>
* The second argument is the output of generateFList function, which returns 
the frequent items and their frequencies from the given database transaction 
iterator
* The third argument is the minimum Support of the pattern to be generated
* The fourth argument is the maximum number of patterns to be mined for each 
feature
* The fifth argument is the set of features for which the frequent patterns has 
to be mined
* The last argument is an output collector which takes [key, value] of Feature 
and TopK Patterns of the format [String, List<Pair<List<String>, Long>>] and 
writes them to the appropriate writer class which takes care of storing the 
object, in this case in a Sequence File Output format

h2. Running Frequent Pattern Growth via command line

The command line launcher for string transaction data 
org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver has other features including 
specifying the regex pattern for spitting a string line of a transaction into 
the constituent features

Input files has to be in the following format.

<optional document id>TAB<TOKEN1>SPACE<TOKEN2>SPACE.....
instead of tab you could use , or | as the default tokenization is done using a 
java Regex pattern [,\t]*[,|\t][ ,\t]*
You can override this parameter to parse your log files or transaction files 

each line is a transaction FPGrowth algorithm mines Top K frequently occurring 
sets of items and their counts from the given input data
To run the example datasets from http://fimi.cs.helsinki.fi/data/. Choose say 
accidents.dat.gz and download it to your mahout folder
{code}
mvn -e  exec:java   
-Dexec.mainClass=org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver -Dexec.args="-i 
accidents.dat.gz -o patterns -k 50 -method sequential -regex [\ ]"
{code}

Note that the input to the algorithm, could be uncompressed or compressed gz 
file or even a directory containing any number of such files.
We modified the regex to use space to split the token. Note that input regex 
string is escaped.
The output will be dumped to a SequenceFile named patterns in 
Text=>TopKStringPatterns format. You can write a small program to read this 
sequence file with key class as Text and Value class as TopKStringPatterns

h2. Running Parallel FPGrowth 

Running parallel FPGrowth is as easy as adding changing the flag -method 
mapreduce and adding the number of groups parameter e.g. -g 20 for 20 groups. 
Put the accidents.dat.gz on the hdfs in a folder named accidents

{code}
To run on a hadoop cluster

<HADOOP-BIN>/hadoop jar mahout-examples-xx.job 
org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver -i accidents.dat.gz -o patterns 
-k 50 -method mapreduce -g 10 -regex [\ ]

OR to demo the algorithm on a single machine

mvn -e  exec:java   
-Dexec.mainClass=org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver -Dexec.args="-i 
accidents.dat.gz -o patterns -k 50 -method mapreduce -g 10 -regex [\ ]"
{code}

Note that accidents have 340 unique features. So we chose -g 10 to split the 
transactions across 10 shards where 34 patterns are mined from each shard. note 
g doesnt need to be exactly divisible. The Algorithm takes care of calculating 
the split. For better performance in large datasets try not to mine for more 
than 20-25 features per shard. So adjust the groups accordingly

The numGroups parameter in FPGrowthJob specifies the number of groups into 
which transactions have to be decomposed. 
The numTreeCacheEntries parameter specifies the number of generated conditional 
FP-Trees to be kept in memory so that subsequent operations do not to 
regenerate them. Increasing this number increases the memory consumption but 
might improve speed until a certain point. This depends entirely on the dataset 
in question. A value of 5-10 is recommended for mining up to top 100 patterns 
for each feature
 

Change your notification preferences: 
http://cwiki.apache.org/confluence/users/viewnotifications.action

[CONF] Apache Lucene Mahout > ParallelFrequentPatternMining

Reply via email to