[CONF] Apache Mahout > Parallel Frequent Pattern Mining

confluence Sun, 31 Jul 2011 20:27:31 -0700

Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Parallel Frequent Pattern Mining 
(https://cwiki.apache.org/confluence/display/MAHOUT/Parallel+Frequent+Pattern+Mining)



Edited by Lance Norskog:
---------------------------------------------------------------------
Mahout has a Top K Parallel FPGrowth Implementation. Its based on the paper 
[http://infolab.stanford.edu/~echang/recsys08-69.pdf] with some optimisations 
in mining the data.

Given a huge transaction list, the algorithm finds all unique features(sets of 
field values) and eliminates those features whose frequency in the whole 
dataset is less that minSupport. Using these remaining features N, we find the 
top K closed patterns for each of them, generating a total of NxK patterns. 
FPGrowth Algorithm is a generic implementation, we can use any Object type to 
denote a feature. Current implementation requires you to use a String as the 
object type. You may implement a version for any object by creating Iterators, 
Convertors and TopKPatternWritable for that particular object. For more 
information please refer the package 
org.apache.mahout.fpm.pfpgrowth.convertors.string
{code}
e.g:
 FPGrowth<String> fp = new FPGrowth<String>();
 Set<String> features = new HashSet<String>();
 fp.generateTopKStringFrequentPatterns(
     new StringRecordIterator(new FileLineIterable(new File(input), encoding, 
false), pattern),
        fp.generateFList(
          new StringRecordIterator(new FileLineIterable(new File(input), 
encoding, false), pattern), minSupport),
         minSupport,
        maxHeapSize,
        features,
        new StringOutputConvertor(new SequenceFileOutputCollector<Text, 
TopKStringPatterns>(writer))
  );
{code}
* The first argument is the iterator of transaction in this case its 
Iterator<List<String>>
* The second argument is the output of generateFList function, which returns 
the frequent items and their frequencies from the given database transaction 
iterator
* The third argument is the minimum Support of the pattern to be generated
* The fourth argument is the maximum number of patterns to be mined for each 
feature
* The fifth argument is the set of features for which the frequent patterns has 
to be mined
* The last argument is an output collector which takes \[key, value\] of 
Feature and TopK Patterns of the format \[String, List<Pair<List<String>, 
Long>>\] and writes them to the appropriate writer class which takes care of 
storing the object, in this case in a Sequence File Output format

h2. Running Frequent Pattern Growth via command line

The command line launcher for string transaction data 
org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver has other features including 
specifying the regex pattern for spitting a string line of a transaction into 
the constituent features.

Input files have to be in the following format.

<optional document id>TAB<TOKEN1>SPACE<TOKEN2>SPACE....

instead of tab you could use , or \| as the default tokenization is done using 
a java Regex pattern {code}[,\t]*[,|\t][ ,\t]*{code}
You can override this parameter to parse your log files or transaction files 
(each line is a transaction.) The FPGrowth algorithm mines the top K frequently 
occurring sets of items and their counts from the given input data

$MAHOUT_HOME/core/src/test/resources/retail.dat is a sample dataset in this 
format. As the FPGrowth algorithm has non-trivial memory requirements, first we 
will create a small dataset and analyze it without bringing up Hadoop:

{code}
head -1000 < core/src/test/resources/retail.dat > /tmp/retail_1k.dat
bin/mahout fpg \
     -i /tmp/retail_1k.dat \
     -o patterns \
     -k 50 \
     -method sequential \
     -regex '[\ ]' \
     -s 2
{code}

The minimum Support parameter is the minimum number of times a pattern or a 
feature needs to occur in the dataset so that it is included in the patterns 
generated. You can speed up the process by having a large value of s. There are 
cases where you will have less than k patterns for a particular feature as the 
rest don't for qualify the minimum support criteria

Note that the input to the algorithm, could be uncompressed or compressed gz 
file or even a directory containing any number of such files.
We modified the regex to use space to split the token. Note that input regex 
string is escaped.

h2. Running Parallel FPGrowth
Bring up Hadoop for the rest of these tests.

Running parallel FPGrowth is as easy as adding changing the flag \-method 
mapreduce and adding the number of groups parameter e.g. \-g 20 for 20 groups. 
First, let's run the above sample test in map-reduce mode:
{code}
bin/mahout fpg \
     -i /tmp/retail_1k.dat \
     -o patterns \
     -k 50 \
     -method mapreduce \
     -g 20
     -regex '[\ ]' \
     -s 2
{code}
The above test takes 20 seconds on the author's laptop, v.s. 30 seconds in the 
sequential mode above. 

Now, let's run some full-size datasets. Use the complete retail.dat file in the 
Mahout source. Then, run the hadoop version of the FPGrowth job:
{code}
bin/mahout fpg \
     -i core/src/test/resources/retail.dat \
     -o patterns \
     -k 50 \
     -method mapreduce \
     -g 10 \
     -regex '[\ ]' \
     -s 2
{code}

OR to run a dataset of this size in sequential mode on a single machine let's 
give Mahout a lot more memory and only keep features with more than 300 members:
{code}
export MAHOUT_HEAPSIZE=-Xmx5000m
bin/mahout fpg \
     -i core/src/test/resources/retail.dat \
     -o patterns \
     -k 50 \
     -method sequential \
     -regex '[\ ]' \
     -s 2
{code}
This following job in sequential mode took 30 minutes, with 1.5G allocated to 
the process.

A much more ambitious task is a file called accidents.dat.gz from 
[http://fimi.cs.helsinki.fi/data/|http://fimi.cs.helsinki.fi/data/]. Working 
with this complete dataset will require a multi-server Hadoop cluster. Get 
accidents.dat.gz and place it on your hdfs in a folder named accidents. Note 
that accidents.dat has 340 unique features. So we chose \-g 10 to split the 
transactions across 10 shards where 34 patterns are mined from each shard. 
(Note: g doesnt need to be exactly divisible.) The Algorithm takes care of 
calculating the split. For better performance in large datasets try not to mine 
for more than 20-25 features per shard. So adjust the groups accordingly.

The numGroups parameter in FPGrowthJob specifies the number of groups into 
which transactions have to be decomposed.
The numTreeCacheEntries parameter specifies the number of generated conditional 
FP-Trees to be kept in memory so that subsequent operations do not to 
regenerate them. Increasing this number increases the memory consumption but 
might improve speed until a certain point. This depends entirely on the dataset 
in question. A value of 5-10 is recommended for mining up to top 100 patterns 
for each feature.

h2. Viewing the results
The output will be dumped to a SequenceFile in the frequentpatterns directory 
in Text=>TopKStringPatterns format. Run this command to see a few of the 
Frequent Patterns:
{code}
bin/mahout seqdumper \
     -s patterns/frequentpatterns/part-?-00000 \
     -n 4
{code}
or replace -n 4 with -c for the count of patterns.
 

Change your notification preferences: 
https://cwiki.apache.org/confluence/users/viewnotifications.action

[CONF] Apache Mahout > Parallel Frequent Pattern Mining

Reply via email to