Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/2847#issuecomment-71526153
> 1 I mean I use step 1(that Equivalent to create FPTree and condition
FPTree ) we have reduce data size and create condition FPTreeï¼only include
frequently item not transition dataï¼, when using condition FPTree mining
frequently item setï¼it is have a small candidate set.
The advantage of FP-Growth over Apriori is the tree structure to present
candidate set. Both algorithms are taking advantage on the fact that the
candidate set is small. I'm asking whether the current implementation uses the
tree structure to save communication.
> 2 I have test it and compared mahout pfpï¼it is a good performance that
about 10 time.
I'm not surprised by the 10x speed-up. It is not equivalent to say the
current implementation is correct and high-performance. I believe that we can
be much faster.
> 3 afer use groupByKey,ming frequently item set in each node that include
Specified keyï¼so it is not network communication overhead.
`groupByKey` collects everything to reducers. `aggregateByKey` does part of
the aggregation on mappers. There is definitely space for improvement.
> 4 is there have aggregateByKey operator in new spark version?
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]