[GitHub] spark pull request: Add normalizeByCol method to mllib.util.MLUtil...

koertkuipers Fri, 01 Aug 2014 08:49:52 -0700

Github user koertkuipers commented on the pull request:

    https://github.com/apache/spark/pull/1698#issuecomment-50900320
  
    i can see your point of 10M columns.
    
    would be really nice if we have a lazy and efficient allReduce(RDD[T], (T,
    T) => T): RDD[T]
    
    a RDD transform not being lazy leading to multiple spark actions that the
    user did not explicitly start is tricky to me. its already difficult enough
    to get the cache and unpersist logic correct without unexpected actions.
    
    
    On Fri, Aug 1, 2014 at 11:43 AM, Xiangrui Meng <notificati...@github.com>
    wrote:
    
    > What if you have 10M columns? I agree that not sending data to the driver
    > is a good practice. But the current operations reduceByKey and cartesian
    > are not optimized for very big data. Please test it on a cluster with many
    > partitions and you should see the bottleneck.
    >
    > â
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/1698#issuecomment-50899568>.
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Add normalizeByCol method to mllib.util.MLUtil...

Reply via email to