[GitHub] spark pull request: [SPARK-4001][MLlib] adding parallel FP-Growth ...

zhangyouhua2014 Fri, 23 Jan 2015 00:50:54 -0800

Github user zhangyouhua2014 commented on the pull request:

    https://github.com/apache/spark/pull/2847#issuecomment-71163811
  
    @mengxr . I am working with Jacky together to develop and test this 
algorithm. I answered this questionï¼
    We refer to the PFP paper, but reduces the process of building the tree, 
omit this process it can use this time to do other things. Specific steps are 
as follows:
    1, the transaction database DB is distributed to more than one worker 
nodes, after two scans transaction database, get conditional pattern sequences.
    Â Â  1.1, the first scan DB, get a frequent itemsets L1. For example: (a, 
6), (b, 5), (c, 3)
    Â Â  1.2, according to 1.1) was L1 scanning DB again, to filter out 
non-frequent item, get conditional pattern sequence conditionSEQ. For example: 
(c, (a, b)), (b, (a)),
    Â Â  After two scans DB get conditionSEQ, conditionSEQ DB is much smaller 
than the amount of information.
    2, reduce operations performed using groupByKey operator will conditionSEQ 
on a machine of the same key into the presence of the same key conditionSEQ 
worker set on each machine after the merger. The following is based 
conditionSEQ to mining frequent item sets.
    3, on each worker, using a priori principle of collective operations 
conditionSEQ find frequent item sets.
    4. Finally, the use of operators collect aggregate results.
    Â  DB algorithm change will spread across multiple worker nodes only need 
to scan twice to obtain the conditions set pattern sequence conditionSEQ small 
amount of information in the collection; frequent item set mining is 
onditionSEQ processed only once reduce, network interaction is small, so fast.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-4001][MLlib] adding parallel FP-Growth ...

Reply via email to