[GitHub] spark pull request: [SPARK-4001][MLlib] adding parallel FP-Growth ...

mengxr Mon, 26 Jan 2015 11:55:51 -0800

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2847#issuecomment-71526153
  
    > 1 I mean I use step 1(that Equivalent to create FPTree and condition 
FPTree ) we have reduce data size and create condition FPTreeï¼only include 
frequently item not transition dataï¼, when using condition FPTree mining 
frequently item setï¼it is have a small candidate set.
    
    The advantage of FP-Growth over Apriori is the tree structure to present 
candidate set. Both algorithms are taking advantage on the fact that the 
candidate set is small. I'm asking whether the current implementation uses the 
tree structure to save communication.
    
    > 2 I have test it and compared mahout pfpï¼it is a good performance that 
about 10 time.
    
    I'm not surprised by the 10x speed-up. It is not equivalent to say the 
current implementation is correct and high-performance. I believe that we can 
be much faster.
    
    > 3 afer use groupByKey,ming frequently item set in each node that include 
Specified keyï¼so it is not network communication overhead.
    
    `groupByKey` collects everything to reducers. `aggregateByKey` does part of 
the aggregation on mappers. There is definitely space for improvement.
    
    > 4 is there have aggregateByKey operator in new spark version?
    
    
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-4001][MLlib] adding parallel FP-Growth ...

Reply via email to