[jira] Updated: (MAHOUT-157) Frequent Pattern Mining using Parallel FP-Growth

Robin Anil (JIRA) Mon, 12 Oct 2009 07:09:56 -0700

     [ 
https://issues.apache.org/jira/browse/MAHOUT-157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Robin Anil updated MAHOUT-157:
------------------------------

    Attachment: MAHOUT-157-final.patch

Improved FPGrowth mining speed 1.5-2x by caching recently generated conditional 
FPTrees (the parameter can now be configured on large mem systems)
Added comments. Package summary
Tests Coverage > 98%
custom regex splitter pattern can be provided via a parameter to split the 
input line into itemsets(words or group of words etc). This will prove helpful 
for parsing various formats of texts.

Included Isabels Comments. 

e.g. Current usage for String Objects
{noformat}

FPGrowth fp = new FPGrowth();
Set features = new HashSet();
SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, path,Text.class, 
TopKStringPatterns.class);
fp.generateTopKStringFrequentPatterns(
                new StringRecordIterator(new FileLineIterable(new File(input), 
encoding, false), pattern), 
        fp.generateFList(
                new StringRecordIterator(new FileLineIterable(new File(input), 
encoding, false), pattern), minSupport),
        minSupport,
        maxHeapSize, 
        features,
        new StringOutputConvertor(new SequenceFileOutputCollector(writer))
  );
 {noformat}

    * The first argument is the iterator of transaction in this case its 
Iterator<List<String>>
    * The second argument is the output of generateFList function, which 
returns the frequent items and their frequencies from the given database 
transaction iterator
    * The third argument is the minimum Support of the pattern to be generated
    * The fourth argument is the maximum number of patterns to be mined for 
each feature
    * The fifth argument is the set of features for which the frequent patterns 
has to be mined
    * The last argument is an output collector which takes [key, value] of 
Feature and TopK Patterns of the format [String, List<Pair<List<String>, 
Long>>] and writes them to the appropriate writer class which takes care of 
storing the object, in this case in a Sequence File Output format 


The numGroups parameter in FPGrowthJob specifies the number of groups into 
which transactions have to be decomposed. 
The numTreeCacheEntries parameter specifies the number of generated conditional 
FP-Trees to be kept in memory so as not to regenerate them. Increasing this 
number increases the memory consumption but might improve speed until a certain 
point. This depends entirely on the dataset in question. A value of 5-10 is 
recommended for mining upto top 100 patterns for each feature

> Frequent Pattern Mining using Parallel FP-Growth
> ------------------------------------------------
>
>                 Key: MAHOUT-157
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-157
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.2
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.2
>
>         Attachments: MAHOUT-157-August-17.patch, MAHOUT-157-August-24.patch, 
> MAHOUT-157-August-31.patch, MAHOUT-157-August-6.patch, 
> MAHOUT-157-Combinations-BSD-License.patch, 
> MAHOUT-157-Combinations-BSD-License.patch, MAHOUT-157-final.patch, 
> MAHOUT-157-inProgress-August-5.patch, MAHOUT-157-Oct-1.patch, 
> MAHOUT-157-Oct-10.pfpgrowth.patch, MAHOUT-157-Oct-8.pfpgrowth.patch, 
> MAHOUT-157-Oct-8.TestedMapReducePipeline.patch, 
> MAHOUT-157-Oct-9.StreamingDBRead-Inprogress.patch, 
> MAHOUT-157-September-10.patch, MAHOUT-157-September-18.patch, 
> MAHOUT-157-September-5.patch
>
>
> Implement: http://infolab.stanford.edu/~echang/recsys08-69.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-157) Frequent Pattern Mining using Parallel FP-Growth

Reply via email to