Github user feynmanliang commented on the pull request:

    https://github.com/apache/spark/pull/7258#issuecomment-119356868
  
    That's all for now! Overall:
    
     * I am concerned with the scalability of grouping all the prefix-projected 
databases by prefix (Prefixspan.scala#109) followed by local processing. This 
may cause the entire dataset to be sent to a single worker (see inline 
comments).
         * The algorithm is not truly distributed: only the first iteration 
(length 1 prefixes) is performed in a distributed manner on `RDD`s; the 
remaining iterations are all done locally.
     * There is a lot of duplication between 
`getPatternsInLocal`+`getPatternsWithPrefix` and the other methods. It seems 
like the former two are just doing the same thing except on local 
`Array[Array[_]]` rather than `RDD[_]`. If we can refactor things into a common 
method operating on `RDD[_]`, I think we can both solve the scalability problem 
and DRY up the code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to