[
https://issues.apache.org/jira/browse/MAHOUT-157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robin Anil updated MAHOUT-157:
------------------------------
Attachment: MAHOUT-157-final.patch
Improved FPGrowth mining speed 1.5-2x by caching recently generated conditional
FPTrees (the parameter can now be configured on large mem systems)
Added comments. Package summary
Tests Coverage > 98%
custom regex splitter pattern can be provided via a parameter to split the
input line into itemsets(words or group of words etc). This will prove helpful
for parsing various formats of texts.
Included Isabels Comments.
e.g. Current usage for String Objects
{noformat}
FPGrowth fp = new FPGrowth();
Set features = new HashSet();
SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, path,Text.class,
TopKStringPatterns.class);
fp.generateTopKStringFrequentPatterns(
new StringRecordIterator(new FileLineIterable(new File(input),
encoding, false), pattern),
fp.generateFList(
new StringRecordIterator(new FileLineIterable(new File(input),
encoding, false), pattern), minSupport),
minSupport,
maxHeapSize,
features,
new StringOutputConvertor(new SequenceFileOutputCollector(writer))
);
{noformat}
* The first argument is the iterator of transaction in this case its
Iterator<List<String>>
* The second argument is the output of generateFList function, which
returns the frequent items and their frequencies from the given database
transaction iterator
* The third argument is the minimum Support of the pattern to be generated
* The fourth argument is the maximum number of patterns to be mined for
each feature
* The fifth argument is the set of features for which the frequent patterns
has to be mined
* The last argument is an output collector which takes [key, value] of
Feature and TopK Patterns of the format [String, List<Pair<List<String>,
Long>>] and writes them to the appropriate writer class which takes care of
storing the object, in this case in a Sequence File Output format
The numGroups parameter in FPGrowthJob specifies the number of groups into
which transactions have to be decomposed.
The numTreeCacheEntries parameter specifies the number of generated conditional
FP-Trees to be kept in memory so as not to regenerate them. Increasing this
number increases the memory consumption but might improve speed until a certain
point. This depends entirely on the dataset in question. A value of 5-10 is
recommended for mining upto top 100 patterns for each feature
> Frequent Pattern Mining using Parallel FP-Growth
> ------------------------------------------------
>
> Key: MAHOUT-157
> URL: https://issues.apache.org/jira/browse/MAHOUT-157
> Project: Mahout
> Issue Type: New Feature
> Components: Frequent Itemset/Association Rule Mining
> Affects Versions: 0.2
> Reporter: Robin Anil
> Assignee: Robin Anil
> Fix For: 0.2
>
> Attachments: MAHOUT-157-August-17.patch, MAHOUT-157-August-24.patch,
> MAHOUT-157-August-31.patch, MAHOUT-157-August-6.patch,
> MAHOUT-157-Combinations-BSD-License.patch,
> MAHOUT-157-Combinations-BSD-License.patch, MAHOUT-157-final.patch,
> MAHOUT-157-inProgress-August-5.patch, MAHOUT-157-Oct-1.patch,
> MAHOUT-157-Oct-10.pfpgrowth.patch, MAHOUT-157-Oct-8.pfpgrowth.patch,
> MAHOUT-157-Oct-8.TestedMapReducePipeline.patch,
> MAHOUT-157-Oct-9.StreamingDBRead-Inprogress.patch,
> MAHOUT-157-September-10.patch, MAHOUT-157-September-18.patch,
> MAHOUT-157-September-5.patch
>
>
> Implement: http://infolab.stanford.edu/~echang/recsys08-69.pdf
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.