[
https://issues.apache.org/jira/browse/MAHOUT-157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robin Anil updated MAHOUT-157:
------------------------------
Attachment: MAHOUT-157-September-18.patch
MAHOUT-157-September-10.patch
This is the story of optimization of FPGrowth implementation, step by step and
how linear speedup was achieved.
a 300K transaction with 100 features takes just 5 secs to mine top 1000
patterns(ignoring the time taken to read the data).
Background
FPGrowth algorithm takes a list of itemsets(an itemset has multiple
items) and a support value(that is min count a pattern mined should satify so
as to be considered as a frequent pattern).
Our aim here was to implement Top K frequent patterns.
Algorithm Description
FPTree is a compact tree representation of all itemsets where each branch
denotes one itemset, duplicate itemset are represented in a single branch with
counts increased. Each itemset is sorted according the the counts of each item
in the whole database, The frequent items come at the begining of the tree and
branches out based on if the next highest item is different.
at every recursive stage, a unique item is taken from the tree and a
conditional tree is generated and mined for rest of the patterns.
What ever patterns return from the recursion, the current pattern is added to
each of the pattern
Implementation
I started like every other java programmer, had node class representing the
tree with parent, next(at the same level) and list of children. I was never
able to go past 50K transactions with a minSupport of 10K which is quite large.
Dilligently, i started replacing replacing HashMaps to ArrayLists, Queue to
ArrayDeque, PriorityQueue to TreeMap and with Yourkit i found bottlenecks and
brought down the whole thing to 15 seconds for 50K with minSupport 10K down
from some 1:40 secs
You can find that patch as September-10.patch
Memory used was too huge near about 1Gb for the 50K dataset. Time to find top
1000 patterns for 200K was 50 mins with a heap size of 1.6GB(half the time i
felt the program was swapping in and out)
After reading up the ibm presentation, I went on a crusade to use primitives as
much as possible,
First problem was, TreeNode is still a class and one of the major bottlenecks
1. Converted Tree into an multiple int [] arrays where i keep fixed properties.
Like parent[node] next[node] value[node] childCount[node]
2. Converted Children into a int[][] where i resize the child array and copy
dynamically when number of children exceeds the allocated limit
3. Converted Pattern from List<Integer> to a resizable int[] with automatic
resizing
GROWTH_FACTOR was set as 1.5
instantly the memory for 50K dataset came down to 400MB. and time came down to
8 seconds
Secondly I noticed that at every recursive step a new conditional tree has to
be generation, and Java was reallocating space again for an FPTree.
I implemented a FPTree.clear which just sets the node count as zero and where
ever array allocation was done, i did a check for if already allocated.
Then I kept K trees, where K is the recursion depth which is the largest size
of a transaction. I reused the K Trees where space is already allocated by
passing the List<FPTree> to each recursive step.
Space of 50K transaction came down to 180MB and runtime to 5 seconds.
I tried to do the same for the merge two TopK patterns and get TopK function,
but that didnt give much saving, so i went back into using a TreeMap.
Earlier, Ted had suggested me to check the least element before inserting into
the heap do optimise the merge.
Yesterday it hit me that, If the heap is already full, I not only dont need to
add pattern less than the least, but i also dont need to generate such patterns.
So with a MutableInt, At every recursive step, if the heap is already full, the
minimumSupport was raised. and subsequent recursions prunes away that portion
of the tree.
And to make all these aggressive, instead of the standard bottom up manner of
FPGrowth(where long patterns need to be generated first). I started in the top
down manner(where high support items get generated first). so the minSupport
get raised quickly and we wont ever consider low support patterns.
Runtimes I am seeing is amazing,
for 50K even if i set the minSupport as 1000, it gets raised quickly to 24K and
it takes a second to completely mine top 1000 patterns with no change in the
output.
for 100K support gets raised to 61K time taken around 2 secods.
for 200K support gets raised to 93K time taken is around 3-4 seconds.
for 300K the time taken is around 5 seconds.
(database reading time not included)
memory utilisation also decreased, we use the space for the entire FPtree, and
very low space for each conditional tree.
All these could be found in September-18.patch
I was thinking if we can figure out a way to converge upon the minSupport in
the very begining, then tree could be pruned away and we would be having a very
fast FPGrowth implementation
More next week.
PS: Please dont review these patches(it may or may not have documentation,
sometimes no license, tons of commented code)
> Frequent Pattern Mining using Parallel FP-Growth
> ------------------------------------------------
>
> Key: MAHOUT-157
> URL: https://issues.apache.org/jira/browse/MAHOUT-157
> Project: Mahout
> Issue Type: New Feature
> Affects Versions: 0.2
> Reporter: Robin Anil
> Assignee: Robin Anil
> Fix For: 0.2
>
> Attachments: MAHOUT-157-August-17.patch, MAHOUT-157-August-24.patch,
> MAHOUT-157-August-31.patch, MAHOUT-157-August-6.patch,
> MAHOUT-157-Combinations-BSD-License.patch,
> MAHOUT-157-Combinations-BSD-License.patch,
> MAHOUT-157-inProgress-August-5.patch, MAHOUT-157-September-10.patch,
> MAHOUT-157-September-18.patch, MAHOUT-157-September-5.patch
>
>
> Implement: http://infolab.stanford.edu/~echang/recsys08-69.pdf
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.