Github user chouqin commented on the pull request:
https://github.com/apache/spark/pull/2868#issuecomment-59893259
@codedeft Thanks for your nice work. I have added some comments inline.
Here are some high level comments:
1. Have you tested the performance after this change?As discussed in
[SPARK-3161](https://issues.apache.org/jira/browse/SPARK-3161), This will help
little for shallow trees. Then how much performance gain will this change give
for deep trees? If it gives much gain, I think we should add more unit test for
this option and refactor the code to address code reuse(for example, there are
some duplication between `binSeqOp` and `binSeqOpWithNodeIdCache`)
2. Is checkpoint really necessary to avoid long lineage? Maybe my
understanding is not right, to my knowledge, each time we do aggregation, the
`nodeIdCache` will be computed. If we persist it into disk(using
`persist(StorageLevel.MEMORY_AND_DISK)`) and unpersist it if it is not needed
anymore, then is will persist in the disk each time it gets computed. We can
remove checkpoint and make the code cleaner and faster(checkpoint will persist
RDD into a distributed file system).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]