Re: GraphX implementation of ALS?
On 5/26/15 5:45 PM, Ankur Dave wrote: This is the latest GraphX-based ALS implementation that I'm aware of: https://github.com/ankurdave/spark/blob/GraphXALS/graphx/src/main/scala/org/apache/spark/graphx/lib/ALS.scala When I benchmarked it last year, it was about twice as slow as MLlib's ALS, and I think the latter has gotten faster since then. The performance gap is because the MLlib version implements some ALS-specific optimizations that are hard to do using GraphX, such as storing the edges twice (partitioned by source and by destination) to reduce communication. Ankur http://www.ankurdave.com/ Great, thanks for the link and explanation!
Re: SparkR and RDDs
From the changes to the namespace file, that appears to be correct, all methods of the RDD API have been made private, which in R means that you may still access them by using the namespace prefix SparkR with three colons, e.g. SparkR:::func(foo, bar). So a starting place for porting old SparkR scripts from before the merge could be to identify those methods in the script belonging to the RDD class and be sure they have the namespace identifier tacked on the front. I hope that helps. Regards, Alek Eskilson From: Andrew Psaltis psaltis.and...@gmail.commailto:psaltis.and...@gmail.com Date: Monday, May 25, 2015 at 6:25 PM To: dev@spark.apache.orgmailto:dev@spark.apache.org dev@spark.apache.orgmailto:dev@spark.apache.org Subject: SparkR and RDDs Hi, I understand from SPARK-6799[1] and the respective merge commit [2] that the RDD class is private in Spark 1.4 . If I wanted to modify the old Kmeans and/or LR examples so that the computation happened in Spark what is the best direction to go? Sorry if I am missing something obvious, but based on the NAMESPACE file [3] in the SparkR codebase I am having trouble seeing the obvious direction to go. Thanks in advance, Andrew [1] https://issues.apache.org/jira/browse/SPARK-6799https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D6799d=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=bawjeA3Y9me3xXGxKghL4_dlf7vHdFHtiV5IhMlOmtce= [2] https://github.com/apache/spark/commit/4b91e18d9b7803dbfe1e1cf20b46163d8cb8716chttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_commit_4b91e18d9b7803dbfe1e1cf20b46163d8cb8716cd=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=Hc7ijtxcnrZ7wSOStlz0-BHH-rUXSFowCpJuNGYu5eoe= [3] https://github.com/apache/spark/blob/branch-1.4/R/pkg/NAMESPACEhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_branch-2D1.4_R_pkg_NAMESPACEd=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=l64LUOvbJ53qsVYphkYJ7_kbNptBdEhsSRSWBg5zqn8e= CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
Re: [VOTE] Release Apache Spark 1.4.0 (RC2)
+1 Tested the SparkR binaries using a Standalone Hadoop 1 cluster, a YARN Hadoop 2.4 cluster and on my Mac. Minor thing I noticed is that on Amazon Linux AMI, the R version is 3.1.1 while the binaries seem to have been built with R 3.1.3. This leads to a warning when we load the package but does not affect any functionality. FWIW 3.1.1 is a year old and 3.2 is the current stable version,. Thanks Shivaram On Tue, May 26, 2015 at 8:58 AM, Iulian Dragoș iulian.dra...@typesafe.com wrote: I tried 1.4.0-rc2 binaries on a 3-node Mesos cluster, everything seemed to work fine, both spark-shell and spark-submit. Cluster mode deployment also worked. +1 (non-binding) iulian On Tue, May 26, 2015 at 4:44 AM, jameszhouyi yiaz...@gmail.com wrote: Compiled: git clone https://github.com/apache/spark.git git checkout tags/v1.4.0-rc2 ./make-distribution.sh --tgz --skip-java-test -Pyarn -Phadoop-2.4 -Dhadoop.version=2.5.0 -Phive -Phive-0.13.1 -Phive-thriftserver -DskipTests Block issue in RC1/RC2: https://issues.apache.org/jira/browse/SPARK-7119 -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-4-0-RC2-tp12420p12444.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org -- -- Iulian Dragos -- Reactive Apps on the JVM www.typesafe.com
Re: SparkR and RDDs
You definitely don't want to implement kmeans in R, since it would be very slow. Just providing R wrappers for the MLlib implementation is the way to go. I believe one of the major items in SparkR next is the MLlib wrappers. On Tue, May 26, 2015 at 7:46 AM, Andrew Psaltis psaltis.and...@gmail.com wrote: Hi Alek, Thanks for the info. You are correct ,that using the three colons does work. Admittedly I am a R novice, but since the three colons is used to access hidden methods, it seems pretty dirty. Can someone shed light on the design direction being taken with SparkR? Should I really be accessing hidden methods or will better approach prevail? For instance, it feels like the k-means sample should really use MLlib and not just be a port the k-means sample using hidden methods. Am I looking at this incorrectly? Thanks, Andrew On Tue, May 26, 2015 at 6:56 AM, Eskilson,Aleksander alek.eskil...@cerner.com wrote: From the changes to the namespace file, that appears to be correct, all methods of the RDD API have been made private, which in R means that you may still access them by using the namespace prefix SparkR with three colons, e.g. SparkR:::func(foo, bar). So a starting place for porting old SparkR scripts from before the merge could be to identify those methods in the script belonging to the RDD class and be sure they have the namespace identifier tacked on the front. I hope that helps. Regards, Alek Eskilson From: Andrew Psaltis psaltis.and...@gmail.com Date: Monday, May 25, 2015 at 6:25 PM To: dev@spark.apache.org dev@spark.apache.org Subject: SparkR and RDDs Hi, I understand from SPARK-6799[1] and the respective merge commit [2] that the RDD class is private in Spark 1.4 . If I wanted to modify the old Kmeans and/or LR examples so that the computation happened in Spark what is the best direction to go? Sorry if I am missing something obvious, but based on the NAMESPACE file [3] in the SparkR codebase I am having trouble seeing the obvious direction to go. Thanks in advance, Andrew [1] https://issues.apache.org/jira/browse/SPARK-6799 https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D6799d=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=bawjeA3Y9me3xXGxKghL4_dlf7vHdFHtiV5IhMlOmtce= [2] https://github.com/apache/spark/commit/4b91e18d9b7803dbfe1e1cf20b46163d8cb8716c https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_commit_4b91e18d9b7803dbfe1e1cf20b46163d8cb8716cd=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=Hc7ijtxcnrZ7wSOStlz0-BHH-rUXSFowCpJuNGYu5eoe= [3] https://github.com/apache/spark/blob/branch-1.4/R/pkg/NAMESPACE https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_branch-2D1.4_R_pkg_NAMESPACEd=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=l64LUOvbJ53qsVYphkYJ7_kbNptBdEhsSRSWBg5zqn8e= CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
Re: [VOTE] Release Apache Spark 1.4.0 (RC2)
I tried 1.4.0-rc2 binaries on a 3-node Mesos cluster, everything seemed to work fine, both spark-shell and spark-submit. Cluster mode deployment also worked. +1 (non-binding) iulian On Tue, May 26, 2015 at 4:44 AM, jameszhouyi yiaz...@gmail.com wrote: Compiled: git clone https://github.com/apache/spark.git git checkout tags/v1.4.0-rc2 ./make-distribution.sh --tgz --skip-java-test -Pyarn -Phadoop-2.4 -Dhadoop.version=2.5.0 -Phive -Phive-0.13.1 -Phive-thriftserver -DskipTests Block issue in RC1/RC2: https://issues.apache.org/jira/browse/SPARK-7119 -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-4-0-RC2-tp12420p12444.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org -- -- Iulian Dragos -- Reactive Apps on the JVM www.typesafe.com
Re: Spark 1.4.0 pyspark and pylint breaking
I think relative imports can not help in this case. When you run scripts in pyspark/sql, it doesn't know anything about pyspark.sql, it just see types.py as a separate module. On Tue, May 26, 2015 at 12:44 PM, Punyashloka Biswal punya.bis...@gmail.com wrote: Davies: Can we use relative imports (import .types) in the unit tests in order to disambiguate between the global and local module? Punya On Tue, May 26, 2015 at 3:09 PM Justin Uang justin.u...@gmail.com wrote: Thanks for clarifying! I don't understand python package and modules names that well, but I thought that the package namespacing would've helped, since you are in pyspark.sql.types. I guess not? On Tue, May 26, 2015 at 3:03 PM Davies Liu dav...@databricks.com wrote: There is a module called 'types' in python 3: davies@localhost:~/work/spark$ python3 Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 00:54:21) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin Type help, copyright, credits or license for more information. import types types module 'types' from '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/types.py' Without renaming, our `types.py` will conflict with it when you run unittests in pyspark/sql/ . On Tue, May 26, 2015 at 11:57 AM, Justin Uang justin.u...@gmail.com wrote: In commit 04e44b37, the migration to Python 3, pyspark/sql/types.py was renamed to pyspark/sql/_types.py and then some magic in pyspark/sql/__init__.py dynamically renamed the module back to types. I imagine that this is some naming conflict with Python 3, but what was the error that showed up? The reason why I'm asking about this is because it's messing with pylint, since pylint cannot now statically find the module. I tried also importing the package so that __init__ would be run in a init-hook, but that isn't what the discovery mechanism is using. I imagine it's probably just crawling the directory structure. One way to work around this would be something akin to this (http://stackoverflow.com/questions/9602811/how-to-tell-pylint-to-ignore-certain-imports), where I would have to create a fake module, but I would probably be missing a ton of pylint features on users of that module, and it's pretty hacky. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spark 1.4.0 pyspark and pylint breaking
When you run the test in python/pyspark/sql/ by bin/spark-submit python/pyspark/sql/dataframe.py the the current directory is the first item in sys.path, sql/types.py will have higher priority then python3.4/types.py, the tests will fail. On Tue, May 26, 2015 at 12:08 PM, Justin Uang justin.u...@gmail.com wrote: Thanks for clarifying! I don't understand python package and modules names that well, but I thought that the package namespacing would've helped, since you are in pyspark.sql.types. I guess not? On Tue, May 26, 2015 at 3:03 PM Davies Liu dav...@databricks.com wrote: There is a module called 'types' in python 3: davies@localhost:~/work/spark$ python3 Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 00:54:21) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin Type help, copyright, credits or license for more information. import types types module 'types' from '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/types.py' Without renaming, our `types.py` will conflict with it when you run unittests in pyspark/sql/ . On Tue, May 26, 2015 at 11:57 AM, Justin Uang justin.u...@gmail.com wrote: In commit 04e44b37, the migration to Python 3, pyspark/sql/types.py was renamed to pyspark/sql/_types.py and then some magic in pyspark/sql/__init__.py dynamically renamed the module back to types. I imagine that this is some naming conflict with Python 3, but what was the error that showed up? The reason why I'm asking about this is because it's messing with pylint, since pylint cannot now statically find the module. I tried also importing the package so that __init__ would be run in a init-hook, but that isn't what the discovery mechanism is using. I imagine it's probably just crawling the directory structure. One way to work around this would be something akin to this (http://stackoverflow.com/questions/9602811/how-to-tell-pylint-to-ignore-certain-imports), where I would have to create a fake module, but I would probably be missing a ton of pylint features on users of that module, and it's pretty hacky. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Spark 1.4.0 pyspark and pylint breaking
In commit 04e44b37, the migration to Python 3, pyspark/sql/types.py was renamed to pyspark/sql/_types.py and then some magic in pyspark/sql/__init__.py dynamically renamed the module back to types. I imagine that this is some naming conflict with Python 3, but what was the error that showed up? The reason why I'm asking about this is because it's messing with pylint, since pylint cannot now statically find the module. I tried also importing the package so that __init__ would be run in a init-hook, but that isn't what the discovery mechanism is using. I imagine it's probably just crawling the directory structure. One way to work around this would be something akin to this ( http://stackoverflow.com/questions/9602811/how-to-tell-pylint-to-ignore-certain-imports), where I would have to create a fake module, but I would probably be missing a ton of pylint features on users of that module, and it's pretty hacky.
Re: Spark 1.4.0 pyspark and pylint breaking
Thanks for clarifying! I don't understand python package and modules names that well, but I thought that the package namespacing would've helped, since you are in pyspark.sql.types. I guess not? On Tue, May 26, 2015 at 3:03 PM Davies Liu dav...@databricks.com wrote: There is a module called 'types' in python 3: davies@localhost:~/work/spark$ python3 Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 00:54:21) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin Type help, copyright, credits or license for more information. import types types module 'types' from '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/types.py' Without renaming, our `types.py` will conflict with it when you run unittests in pyspark/sql/ . On Tue, May 26, 2015 at 11:57 AM, Justin Uang justin.u...@gmail.com wrote: In commit 04e44b37, the migration to Python 3, pyspark/sql/types.py was renamed to pyspark/sql/_types.py and then some magic in pyspark/sql/__init__.py dynamically renamed the module back to types. I imagine that this is some naming conflict with Python 3, but what was the error that showed up? The reason why I'm asking about this is because it's messing with pylint, since pylint cannot now statically find the module. I tried also importing the package so that __init__ would be run in a init-hook, but that isn't what the discovery mechanism is using. I imagine it's probably just crawling the directory structure. One way to work around this would be something akin to this ( http://stackoverflow.com/questions/9602811/how-to-tell-pylint-to-ignore-certain-imports ), where I would have to create a fake module, but I would probably be missing a ton of pylint features on users of that module, and it's pretty hacky.
Re: Spark 1.4.0 pyspark and pylint breaking
Davies: Can we use relative imports (import .types) in the unit tests in order to disambiguate between the global and local module? Punya On Tue, May 26, 2015 at 3:09 PM Justin Uang justin.u...@gmail.com wrote: Thanks for clarifying! I don't understand python package and modules names that well, but I thought that the package namespacing would've helped, since you are in pyspark.sql.types. I guess not? On Tue, May 26, 2015 at 3:03 PM Davies Liu dav...@databricks.com wrote: There is a module called 'types' in python 3: davies@localhost:~/work/spark$ python3 Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 00:54:21) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin Type help, copyright, credits or license for more information. import types types module 'types' from '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/types.py' Without renaming, our `types.py` will conflict with it when you run unittests in pyspark/sql/ . On Tue, May 26, 2015 at 11:57 AM, Justin Uang justin.u...@gmail.com wrote: In commit 04e44b37, the migration to Python 3, pyspark/sql/types.py was renamed to pyspark/sql/_types.py and then some magic in pyspark/sql/__init__.py dynamically renamed the module back to types. I imagine that this is some naming conflict with Python 3, but what was the error that showed up? The reason why I'm asking about this is because it's messing with pylint, since pylint cannot now statically find the module. I tried also importing the package so that __init__ would be run in a init-hook, but that isn't what the discovery mechanism is using. I imagine it's probably just crawling the directory structure. One way to work around this would be something akin to this ( http://stackoverflow.com/questions/9602811/how-to-tell-pylint-to-ignore-certain-imports ), where I would have to create a fake module, but I would probably be missing a ton of pylint features on users of that module, and it's pretty hacky.
Re: [VOTE] Release Apache Spark 1.4.0 (RC2)
-1 Found a new blocker SPARK-7864 https://issues.apache.org/jira/browse/SPARK-7864 that is being resolved by https://github.com/apache/spark/pull/6419. 2015-05-26 11:32 GMT-07:00 Shivaram Venkataraman shiva...@eecs.berkeley.edu : +1 Tested the SparkR binaries using a Standalone Hadoop 1 cluster, a YARN Hadoop 2.4 cluster and on my Mac. Minor thing I noticed is that on Amazon Linux AMI, the R version is 3.1.1 while the binaries seem to have been built with R 3.1.3. This leads to a warning when we load the package but does not affect any functionality. FWIW 3.1.1 is a year old and 3.2 is the current stable version,. Thanks Shivaram On Tue, May 26, 2015 at 8:58 AM, Iulian Dragoș iulian.dra...@typesafe.com wrote: I tried 1.4.0-rc2 binaries on a 3-node Mesos cluster, everything seemed to work fine, both spark-shell and spark-submit. Cluster mode deployment also worked. +1 (non-binding) iulian On Tue, May 26, 2015 at 4:44 AM, jameszhouyi yiaz...@gmail.com wrote: Compiled: git clone https://github.com/apache/spark.git git checkout tags/v1.4.0-rc2 ./make-distribution.sh --tgz --skip-java-test -Pyarn -Phadoop-2.4 -Dhadoop.version=2.5.0 -Phive -Phive-0.13.1 -Phive-thriftserver -DskipTests Block issue in RC1/RC2: https://issues.apache.org/jira/browse/SPARK-7119 -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-4-0-RC2-tp12420p12444.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org -- -- Iulian Dragos -- Reactive Apps on the JVM www.typesafe.com
Re: SparkR and RDDs
Hi Alek, Thanks for the info. You are correct ,that using the three colons does work. Admittedly I am a R novice, but since the three colons is used to access hidden methods, it seems pretty dirty. Can someone shed light on the design direction being taken with SparkR? Should I really be accessing hidden methods or will better approach prevail? For instance, it feels like the k-means sample should really use MLlib and not just be a port the k-means sample using hidden methods. Am I looking at this incorrectly? Thanks, Andrew On Tue, May 26, 2015 at 6:56 AM, Eskilson,Aleksander alek.eskil...@cerner.com wrote: From the changes to the namespace file, that appears to be correct, all methods of the RDD API have been made private, which in R means that you may still access them by using the namespace prefix SparkR with three colons, e.g. SparkR:::func(foo, bar). So a starting place for porting old SparkR scripts from before the merge could be to identify those methods in the script belonging to the RDD class and be sure they have the namespace identifier tacked on the front. I hope that helps. Regards, Alek Eskilson From: Andrew Psaltis psaltis.and...@gmail.com Date: Monday, May 25, 2015 at 6:25 PM To: dev@spark.apache.org dev@spark.apache.org Subject: SparkR and RDDs Hi, I understand from SPARK-6799[1] and the respective merge commit [2] that the RDD class is private in Spark 1.4 . If I wanted to modify the old Kmeans and/or LR examples so that the computation happened in Spark what is the best direction to go? Sorry if I am missing something obvious, but based on the NAMESPACE file [3] in the SparkR codebase I am having trouble seeing the obvious direction to go. Thanks in advance, Andrew [1] https://issues.apache.org/jira/browse/SPARK-6799 https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D6799d=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=bawjeA3Y9me3xXGxKghL4_dlf7vHdFHtiV5IhMlOmtce= [2] https://github.com/apache/spark/commit/4b91e18d9b7803dbfe1e1cf20b46163d8cb8716c https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_commit_4b91e18d9b7803dbfe1e1cf20b46163d8cb8716cd=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=Hc7ijtxcnrZ7wSOStlz0-BHH-rUXSFowCpJuNGYu5eoe= [3] https://github.com/apache/spark/blob/branch-1.4/R/pkg/NAMESPACE https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_branch-2D1.4_R_pkg_NAMESPACEd=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=l64LUOvbJ53qsVYphkYJ7_kbNptBdEhsSRSWBg5zqn8e= CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
GraphX implementation of ALS?
Hi all, I've heard in a number of presentations Spark's ALS implementation was going to be moved over to a GraphX version. For example, this presentation on GraphX https://databricks-training.s3.amazonaws.com/slides/graphx@sparksummit_2014-07.pdf(slide #23) at the Spark Summit mentioned a 40 LOC version using the Pregel API. Looking at the ALS source on master https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala it looks like the original implementation is still being used and no use of GraphX can be seen. Other algorithms mentioned in the GraphX presentation can be found in the repo https://github.com/apache/spark/tree/master/graphx/src/main/scala/org/apache/spark/graphx/lib already but I don't see ALS. Could someone link me to the GraphX version for comparison purposes? Also, could someone comment on why the the newer version isn't in use yet (i.e. are there tradeoffs with using the GraphX version that makes it less desirable)? Thanks, Ben
Re: GraphX implementation of ALS?
This is the latest GraphX-based ALS implementation that I'm aware of: https://github.com/ankurdave/spark/blob/GraphXALS/graphx/src/main/scala/org/apache/spark/graphx/lib/ALS.scala When I benchmarked it last year, it was about twice as slow as MLlib's ALS, and I think the latter has gotten faster since then. The performance gap is because the MLlib version implements some ALS-specific optimizations that are hard to do using GraphX, such as storing the edges twice (partitioned by source and by destination) to reduce communication. Ankur http://www.ankurdave.com/ On Tue, May 26, 2015 at 3:36 PM, Ben Mabey b...@benmabey.com wrote: I've heard in a number of presentations Spark's ALS implementation was going to be moved over to a GraphX version. For example, this presentation on GraphX https://databricks-training.s3.amazonaws.com/slides/graphx@sparksummit_2014-07.pdf(slide #23) at the Spark Summit mentioned a 40 LOC version using the Pregel API. Looking at the ALS source on master https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala it looks like the original implementation is still being used and no use of GraphX can be seen. Other algorithms mentioned in the GraphX presentation can be found in the repo https://github.com/apache/spark/tree/master/graphx/src/main/scala/org/apache/spark/graphx/lib already but I don't see ALS. Could someone link me to the GraphX version for comparison purposes? Also, could someone comment on why the the newer version isn't in use yet (i.e. are there tradeoffs with using the GraphX version that makes it less desirable)?