[ 
https://issues.apache.org/jira/browse/MAHOUT-157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-157:
------------------------------

    Attachment: MAHOUT-157-September-18.patch
                MAHOUT-157-September-10.patch

This is the story of optimization of FPGrowth implementation,  step by step and 
how linear speedup was achieved. 

a 300K transaction with 100 features takes just 5 secs to mine top 1000 
patterns(ignoring the time taken to read the data).

Background
      FPGrowth algorithm takes a list of itemsets(an itemset has multiple 
items) and a support value(that is min count a pattern mined should satify so 
as to be considered as a frequent pattern).
      Our aim here was to implement Top K frequent patterns.

Algorithm Description
      FPTree is a compact tree representation of all itemsets where each branch 
denotes one itemset, duplicate itemset are represented in a single branch with 
counts increased. Each itemset is sorted according the the counts of each item 
in the whole database, The frequent items come at the begining of the tree and 
branches out based on if the next highest item is different.
at every recursive stage, a unique item is taken from the tree and a 
conditional tree is generated and mined for rest of the patterns. 
What ever patterns return from the recursion, the current pattern is added to 
each of the pattern

Implementation
    I started like every other java programmer, had node class representing the 
tree with parent, next(at the same level) and list of children. I was never 
able to go past 50K transactions with a minSupport of 10K  which is quite large.

Dilligently, i started replacing replacing HashMaps to ArrayLists, Queue to 
ArrayDeque, PriorityQueue to TreeMap and with Yourkit i found bottlenecks and 
brought down the whole thing to 15 seconds for 50K with minSupport 10K down 
from some 1:40 secs

You can find that patch as September-10.patch

Memory used was too huge near about 1Gb for the 50K dataset. Time to find top 
1000 patterns for 200K was 50 mins with a heap size of 1.6GB(half the time i 
felt the program was swapping in and out)

After reading up the ibm presentation, I went on a crusade to use primitives as 
much as possible,

First problem was, TreeNode is still a class and one of the major bottlenecks
1. Converted Tree into an multiple int [] arrays where i keep fixed properties. 
Like parent[node] next[node] value[node] childCount[node]
2. Converted Children into a int[][]  where i resize the child array and copy 
dynamically when number of children exceeds the allocated limit
3. Converted Pattern from List<Integer> to a resizable int[] with automatic 
resizing
GROWTH_FACTOR was set as 1.5
instantly the memory for 50K dataset came down to 400MB. and time came down to 
8 seconds

Secondly I noticed that at every recursive step a new conditional tree has to 
be generation, and Java was reallocating space again for an FPTree. 
I  implemented a FPTree.clear which just sets the node count as zero and where 
ever array allocation was done, i did a check for if already allocated.

Then I kept K trees, where K is the recursion depth which is the largest size 
of a transaction.  I reused the K Trees where space is already allocated by 
passing the List<FPTree> to each recursive step. 

Space of 50K transaction came down to 180MB and runtime to 5 seconds.

I tried to do the same for the merge two TopK patterns and get TopK function, 
but that didnt give much saving, so i went back into using a TreeMap.

Earlier, Ted had suggested me to check the least element before inserting into 
the heap do optimise the merge. 
Yesterday it hit me that, If the heap is already full, I not only dont need to 
add pattern less than the least, but i also dont need to generate such patterns.

So with a MutableInt, At every recursive step, if the heap is already full, the 
minimumSupport was raised. and subsequent recursions prunes away that portion 
of the tree.

And to make all these aggressive, instead of the standard bottom up manner of 
FPGrowth(where long patterns need to be generated first). I started in the top 
down manner(where high support items get generated first). so the minSupport 
get raised quickly and we wont ever consider low support patterns. 

Runtimes I am seeing is amazing, 

for 50K even if i set the minSupport as 1000, it gets raised quickly to 24K and 
it takes a second to completely mine top 1000 patterns with no change in the 
output.
for 100K support gets raised to 61K time taken around 2 secods.
for 200K support gets raised to 93K time taken is around 3-4 seconds.
for 300K the time taken is around 5 seconds.
(database reading time not included)

memory utilisation also decreased, we use the space for the entire FPtree, and 
very low space for each conditional tree.
All these could be found in September-18.patch


I was thinking if we can figure out a way to converge upon the minSupport in 
the very begining, then tree could be pruned away and we would be having a very 
fast FPGrowth implementation

More next week.
PS: Please dont review these patches(it may or may not have documentation, 
sometimes no license, tons of commented code)



> Frequent Pattern Mining using Parallel FP-Growth
> ------------------------------------------------
>
>                 Key: MAHOUT-157
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-157
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.2
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.2
>
>         Attachments: MAHOUT-157-August-17.patch, MAHOUT-157-August-24.patch, 
> MAHOUT-157-August-31.patch, MAHOUT-157-August-6.patch, 
> MAHOUT-157-Combinations-BSD-License.patch, 
> MAHOUT-157-Combinations-BSD-License.patch, 
> MAHOUT-157-inProgress-August-5.patch, MAHOUT-157-September-10.patch, 
> MAHOUT-157-September-18.patch, MAHOUT-157-September-5.patch
>
>
> Implement: http://infolab.stanford.edu/~echang/recsys08-69.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to