Github user chouqin commented on the pull request:

    https://github.com/apache/spark/pull/2868#issuecomment-59893259
  
    @codedeft Thanks for your nice work. I have added some comments inline. 
Here are some high level comments:
    
    1. Have you tested the performance after this change?As discussed in 
[SPARK-3161](https://issues.apache.org/jira/browse/SPARK-3161), This will help 
little for shallow trees. Then how much performance gain will this change give 
for deep trees? If it gives much gain, I think we should add more unit test for 
this option and refactor the code to address code reuse(for example, there are 
some duplication between `binSeqOp` and `binSeqOpWithNodeIdCache`)
    
    2. Is checkpoint really necessary to avoid long lineage? Maybe my 
understanding is not right, to my knowledge, each time we do aggregation, the 
`nodeIdCache` will be computed. If we persist it into disk(using 
`persist(StorageLevel.MEMORY_AND_DISK)`) and unpersist it if it is not needed 
anymore, then is will persist in the disk each time it gets computed. We can 
remove checkpoint and make the code cleaner and faster(checkpoint will persist 
RDD into a distributed file system).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to