Re: GraphX implementation of ALS?

2015-05-26 Thread Ben Mabey

On 5/26/15 5:45 PM, Ankur Dave wrote:
This is the latest GraphX-based ALS implementation that I'm aware of: 
https://github.com/ankurdave/spark/blob/GraphXALS/graphx/src/main/scala/org/apache/spark/graphx/lib/ALS.scala


When I benchmarked it last year, it was about twice as slow as MLlib's 
ALS, and I think the latter has gotten faster since then. The 
performance gap is because the MLlib version implements some 
ALS-specific optimizations that are hard to do using GraphX, such as 
storing the edges twice (partitioned by source and by destination) to 
reduce communication.


Ankur http://www.ankurdave.com/


Great, thanks for the link and explanation!


Re: SparkR and RDDs

2015-05-26 Thread Eskilson,Aleksander
From the changes to the namespace file, that appears to be correct, all 
methods of the RDD API have been made private, which in R means that you may 
still access them by using the namespace prefix SparkR with three colons, e.g. 
SparkR:::func(foo, bar).

So a starting place for porting old SparkR scripts from before the merge could 
be to identify those methods in the script belonging to the RDD class and be 
sure they have the namespace identifier tacked on the front. I hope that helps.

Regards,
Alek Eskilson

From: Andrew Psaltis psaltis.and...@gmail.commailto:psaltis.and...@gmail.com
Date: Monday, May 25, 2015 at 6:25 PM
To: dev@spark.apache.orgmailto:dev@spark.apache.org 
dev@spark.apache.orgmailto:dev@spark.apache.org
Subject: SparkR and RDDs

Hi,
I understand from SPARK-6799[1] and the respective merge commit [2]  that the 
RDD class is private in Spark 1.4 . If I wanted to modify the old Kmeans and/or 
LR examples so that the computation happened in Spark what is the best 
direction to go? Sorry if I am missing something obvious, but based on the 
NAMESPACE file [3] in the SparkR codebase I am having trouble seeing the 
obvious direction to go.

Thanks in advance,
Andrew

[1] 
https://issues.apache.org/jira/browse/SPARK-6799https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D6799d=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=bawjeA3Y9me3xXGxKghL4_dlf7vHdFHtiV5IhMlOmtce=
[2] 
https://github.com/apache/spark/commit/4b91e18d9b7803dbfe1e1cf20b46163d8cb8716chttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_commit_4b91e18d9b7803dbfe1e1cf20b46163d8cb8716cd=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=Hc7ijtxcnrZ7wSOStlz0-BHH-rUXSFowCpJuNGYu5eoe=
[3] 
https://github.com/apache/spark/blob/branch-1.4/R/pkg/NAMESPACEhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_branch-2D1.4_R_pkg_NAMESPACEd=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=l64LUOvbJ53qsVYphkYJ7_kbNptBdEhsSRSWBg5zqn8e=

CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.


Re: [VOTE] Release Apache Spark 1.4.0 (RC2)

2015-05-26 Thread Shivaram Venkataraman
+1

Tested the SparkR binaries using a Standalone Hadoop 1 cluster, a YARN
Hadoop 2.4 cluster and on my Mac.

Minor thing I noticed is that on Amazon Linux AMI, the R version is 3.1.1
while the binaries seem to have been built with R 3.1.3. This leads to a
warning when we load the package but does not affect any functionality.
FWIW 3.1.1 is a year old and 3.2 is the current stable version,.

Thanks
Shivaram

On Tue, May 26, 2015 at 8:58 AM, Iulian Dragoș iulian.dra...@typesafe.com
wrote:

 I tried 1.4.0-rc2 binaries on a 3-node Mesos cluster, everything seemed to
 work fine, both spark-shell and spark-submit. Cluster mode deployment also
 worked.

 +1 (non-binding)

 iulian

 On Tue, May 26, 2015 at 4:44 AM, jameszhouyi yiaz...@gmail.com wrote:

 Compiled:
 git clone https://github.com/apache/spark.git
 git checkout tags/v1.4.0-rc2
 ./make-distribution.sh --tgz --skip-java-test -Pyarn -Phadoop-2.4
 -Dhadoop.version=2.5.0 -Phive -Phive-0.13.1 -Phive-thriftserver
 -DskipTests

 Block issue in RC1/RC2:
 https://issues.apache.org/jira/browse/SPARK-7119





 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-4-0-RC2-tp12420p12444.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




 --

 --
 Iulian Dragos

 --
 Reactive Apps on the JVM
 www.typesafe.com




Re: SparkR and RDDs

2015-05-26 Thread Reynold Xin
You definitely don't want to implement kmeans in R, since it would be very
slow. Just providing R wrappers for the MLlib implementation is the way to
go. I believe one of the major items in SparkR next is the MLlib wrappers.



On Tue, May 26, 2015 at 7:46 AM, Andrew Psaltis psaltis.and...@gmail.com
wrote:

 Hi Alek,
 Thanks for the info. You are correct ,that using the three colons does
 work. Admittedly I am a R novice, but since the three colons is used to
 access hidden methods, it seems pretty dirty.

 Can someone shed light on the design direction being taken with SparkR?
 Should I really be accessing hidden methods or will better approach
 prevail? For instance, it feels like the k-means sample should really use
 MLlib and not just be a port the k-means sample using hidden methods. Am I
 looking at this incorrectly?

 Thanks,
 Andrew

 On Tue, May 26, 2015 at 6:56 AM, Eskilson,Aleksander 
 alek.eskil...@cerner.com wrote:

  From the changes to the namespace file, that appears to be correct, all
 methods of the RDD API have been made private, which in R means that you
 may still access them by using the namespace prefix SparkR with three
 colons, e.g. SparkR:::func(foo, bar).

  So a starting place for porting old SparkR scripts from before the
 merge could be to identify those methods in the script belonging to the RDD
 class and be sure they have the namespace identifier tacked on the front. I
 hope that helps.

  Regards,
 Alek Eskilson

   From: Andrew Psaltis psaltis.and...@gmail.com
 Date: Monday, May 25, 2015 at 6:25 PM
 To: dev@spark.apache.org dev@spark.apache.org
 Subject: SparkR and RDDs

   Hi,
 I understand from SPARK-6799[1] and the respective merge commit [2]  that
 the RDD class is private in Spark 1.4 . If I wanted to modify the old
 Kmeans and/or LR examples so that the computation happened in Spark what is
 the best direction to go? Sorry if I am missing something obvious, but
 based on the NAMESPACE file [3] in the SparkR codebase I am having trouble
 seeing the obvious direction to go.

  Thanks in advance,
 Andrew

  [1] https://issues.apache.org/jira/browse/SPARK-6799
 https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D6799d=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=bawjeA3Y9me3xXGxKghL4_dlf7vHdFHtiV5IhMlOmtce=
 [2]
 https://github.com/apache/spark/commit/4b91e18d9b7803dbfe1e1cf20b46163d8cb8716c
 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_commit_4b91e18d9b7803dbfe1e1cf20b46163d8cb8716cd=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=Hc7ijtxcnrZ7wSOStlz0-BHH-rUXSFowCpJuNGYu5eoe=
 [3] https://github.com/apache/spark/blob/branch-1.4/R/pkg/NAMESPACE
 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_branch-2D1.4_R_pkg_NAMESPACEd=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=l64LUOvbJ53qsVYphkYJ7_kbNptBdEhsSRSWBg5zqn8e=

CONFIDENTIALITY NOTICE This message and any included attachments are
 from Cerner Corporation and are intended only for the addressee. The
 information contained in this message is confidential and may constitute
 inside or non-public information under international, federal, or state
 securities laws. Unauthorized forwarding, printing, copying, distribution,
 or use of such information is strictly prohibited and may be unlawful. If
 you are not the addressee, please promptly delete this message and notify
 the sender of the delivery error by e-mail or you may call Cerner's
 corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.





Re: [VOTE] Release Apache Spark 1.4.0 (RC2)

2015-05-26 Thread Iulian Dragoș
I tried 1.4.0-rc2 binaries on a 3-node Mesos cluster, everything seemed to
work fine, both spark-shell and spark-submit. Cluster mode deployment also
worked.

+1 (non-binding)

iulian

On Tue, May 26, 2015 at 4:44 AM, jameszhouyi yiaz...@gmail.com wrote:

 Compiled:
 git clone https://github.com/apache/spark.git
 git checkout tags/v1.4.0-rc2
 ./make-distribution.sh --tgz --skip-java-test -Pyarn -Phadoop-2.4
 -Dhadoop.version=2.5.0 -Phive -Phive-0.13.1 -Phive-thriftserver -DskipTests

 Block issue in RC1/RC2:
 https://issues.apache.org/jira/browse/SPARK-7119





 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-4-0-RC2-tp12420p12444.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




-- 

--
Iulian Dragos

--
Reactive Apps on the JVM
www.typesafe.com


Re: Spark 1.4.0 pyspark and pylint breaking

2015-05-26 Thread Davies Liu
I think relative imports can not help in this case.

When you run scripts in pyspark/sql, it doesn't know anything about
pyspark.sql, it
just see types.py as a separate module.

On Tue, May 26, 2015 at 12:44 PM, Punyashloka Biswal
punya.bis...@gmail.com wrote:
 Davies: Can we use relative imports (import .types) in the unit tests in
 order to disambiguate between the global and local module?

 Punya

 On Tue, May 26, 2015 at 3:09 PM Justin Uang justin.u...@gmail.com wrote:

 Thanks for clarifying! I don't understand python package and modules names
 that well, but I thought that the package namespacing would've helped, since
 you are in pyspark.sql.types. I guess not?

 On Tue, May 26, 2015 at 3:03 PM Davies Liu dav...@databricks.com wrote:

 There is a module called 'types' in python 3:

 davies@localhost:~/work/spark$ python3
 Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 00:54:21)
 [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
 Type help, copyright, credits or license for more information.
  import types
  types
 module 'types' from

 '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/types.py'

 Without renaming, our `types.py` will conflict with it when you run
 unittests in pyspark/sql/ .

 On Tue, May 26, 2015 at 11:57 AM, Justin Uang justin.u...@gmail.com
 wrote:
  In commit 04e44b37, the migration to Python 3, pyspark/sql/types.py was
  renamed to pyspark/sql/_types.py and then some magic in
  pyspark/sql/__init__.py dynamically renamed the module back to types. I
  imagine that this is some naming conflict with Python 3, but what was
  the
  error that showed up?
 
  The reason why I'm asking about this is because it's messing with
  pylint,
  since pylint cannot now statically find the module. I tried also
  importing
  the package so that __init__ would be run in a init-hook, but that
  isn't
  what the discovery mechanism is using. I imagine it's probably just
  crawling
  the directory structure.
 
  One way to work around this would be something akin to this
 
  (http://stackoverflow.com/questions/9602811/how-to-tell-pylint-to-ignore-certain-imports),
  where I would have to create a fake module, but I would probably be
  missing
  a ton of pylint features on users of that module, and it's pretty
  hacky.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark 1.4.0 pyspark and pylint breaking

2015-05-26 Thread Davies Liu
When you run the test in python/pyspark/sql/ by

bin/spark-submit python/pyspark/sql/dataframe.py

the the current directory is the first item in sys.path, sql/types.py
will have higher priority then python3.4/types.py, the tests will
fail.

On Tue, May 26, 2015 at 12:08 PM, Justin Uang justin.u...@gmail.com wrote:
 Thanks for clarifying! I don't understand python package and modules names
 that well, but I thought that the package namespacing would've helped, since
 you are in pyspark.sql.types. I guess not?

 On Tue, May 26, 2015 at 3:03 PM Davies Liu dav...@databricks.com wrote:

 There is a module called 'types' in python 3:

 davies@localhost:~/work/spark$ python3
 Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 00:54:21)
 [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
 Type help, copyright, credits or license for more information.
  import types
  types
 module 'types' from

 '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/types.py'

 Without renaming, our `types.py` will conflict with it when you run
 unittests in pyspark/sql/ .

 On Tue, May 26, 2015 at 11:57 AM, Justin Uang justin.u...@gmail.com
 wrote:
  In commit 04e44b37, the migration to Python 3, pyspark/sql/types.py was
  renamed to pyspark/sql/_types.py and then some magic in
  pyspark/sql/__init__.py dynamically renamed the module back to types. I
  imagine that this is some naming conflict with Python 3, but what was
  the
  error that showed up?
 
  The reason why I'm asking about this is because it's messing with
  pylint,
  since pylint cannot now statically find the module. I tried also
  importing
  the package so that __init__ would be run in a init-hook, but that isn't
  what the discovery mechanism is using. I imagine it's probably just
  crawling
  the directory structure.
 
  One way to work around this would be something akin to this
 
  (http://stackoverflow.com/questions/9602811/how-to-tell-pylint-to-ignore-certain-imports),
  where I would have to create a fake module, but I would probably be
  missing
  a ton of pylint features on users of that module, and it's pretty hacky.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Spark 1.4.0 pyspark and pylint breaking

2015-05-26 Thread Justin Uang
In commit 04e44b37, the migration to Python 3, pyspark/sql/types.py was
renamed to pyspark/sql/_types.py and then some magic in
pyspark/sql/__init__.py dynamically renamed the module back to types. I
imagine that this is some naming conflict with Python 3, but what was the
error that showed up?

The reason why I'm asking about this is because it's messing with pylint,
since pylint cannot now statically find the module. I tried also importing
the package so that __init__ would be run in a init-hook, but that isn't
what the discovery mechanism is using. I imagine it's probably just
crawling the directory structure.

One way to work around this would be something akin to this (
http://stackoverflow.com/questions/9602811/how-to-tell-pylint-to-ignore-certain-imports),
where I would have to create a fake module, but I would probably be missing
a ton of pylint features on users of that module, and it's pretty hacky.


Re: Spark 1.4.0 pyspark and pylint breaking

2015-05-26 Thread Justin Uang
Thanks for clarifying! I don't understand python package and modules names
that well, but I thought that the package namespacing would've helped,
since you are in pyspark.sql.types. I guess not?

On Tue, May 26, 2015 at 3:03 PM Davies Liu dav...@databricks.com wrote:

 There is a module called 'types' in python 3:

 davies@localhost:~/work/spark$ python3
 Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 00:54:21)
 [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
 Type help, copyright, credits or license for more information.
  import types
  types
 module 'types' from
 '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/types.py'

 Without renaming, our `types.py` will conflict with it when you run
 unittests in pyspark/sql/ .

 On Tue, May 26, 2015 at 11:57 AM, Justin Uang justin.u...@gmail.com
 wrote:
  In commit 04e44b37, the migration to Python 3, pyspark/sql/types.py was
  renamed to pyspark/sql/_types.py and then some magic in
  pyspark/sql/__init__.py dynamically renamed the module back to types. I
  imagine that this is some naming conflict with Python 3, but what was the
  error that showed up?
 
  The reason why I'm asking about this is because it's messing with pylint,
  since pylint cannot now statically find the module. I tried also
 importing
  the package so that __init__ would be run in a init-hook, but that isn't
  what the discovery mechanism is using. I imagine it's probably just
 crawling
  the directory structure.
 
  One way to work around this would be something akin to this
  (
 http://stackoverflow.com/questions/9602811/how-to-tell-pylint-to-ignore-certain-imports
 ),
  where I would have to create a fake module, but I would probably be
 missing
  a ton of pylint features on users of that module, and it's pretty hacky.



Re: Spark 1.4.0 pyspark and pylint breaking

2015-05-26 Thread Punyashloka Biswal
Davies: Can we use relative imports (import .types) in the unit tests in
order to disambiguate between the global and local module?

Punya

On Tue, May 26, 2015 at 3:09 PM Justin Uang justin.u...@gmail.com wrote:

 Thanks for clarifying! I don't understand python package and modules names
 that well, but I thought that the package namespacing would've helped,
 since you are in pyspark.sql.types. I guess not?

 On Tue, May 26, 2015 at 3:03 PM Davies Liu dav...@databricks.com wrote:

 There is a module called 'types' in python 3:

 davies@localhost:~/work/spark$ python3
 Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 00:54:21)
 [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
 Type help, copyright, credits or license for more information.
  import types
  types
 module 'types' from

 '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/types.py'

 Without renaming, our `types.py` will conflict with it when you run
 unittests in pyspark/sql/ .

 On Tue, May 26, 2015 at 11:57 AM, Justin Uang justin.u...@gmail.com
 wrote:
  In commit 04e44b37, the migration to Python 3, pyspark/sql/types.py was
  renamed to pyspark/sql/_types.py and then some magic in
  pyspark/sql/__init__.py dynamically renamed the module back to types. I
  imagine that this is some naming conflict with Python 3, but what was
 the
  error that showed up?
 
  The reason why I'm asking about this is because it's messing with
 pylint,
  since pylint cannot now statically find the module. I tried also
 importing
  the package so that __init__ would be run in a init-hook, but that isn't
  what the discovery mechanism is using. I imagine it's probably just
 crawling
  the directory structure.
 
  One way to work around this would be something akin to this
  (
 http://stackoverflow.com/questions/9602811/how-to-tell-pylint-to-ignore-certain-imports
 ),
  where I would have to create a fake module, but I would probably be
 missing
  a ton of pylint features on users of that module, and it's pretty hacky.




Re: [VOTE] Release Apache Spark 1.4.0 (RC2)

2015-05-26 Thread Andrew Or
-1

Found a new blocker SPARK-7864
https://issues.apache.org/jira/browse/SPARK-7864 that is being resolved
by https://github.com/apache/spark/pull/6419.

2015-05-26 11:32 GMT-07:00 Shivaram Venkataraman shiva...@eecs.berkeley.edu
:

 +1

 Tested the SparkR binaries using a Standalone Hadoop 1 cluster, a YARN
 Hadoop 2.4 cluster and on my Mac.

 Minor thing I noticed is that on Amazon Linux AMI, the R version is 3.1.1
 while the binaries seem to have been built with R 3.1.3. This leads to a
 warning when we load the package but does not affect any functionality.
 FWIW 3.1.1 is a year old and 3.2 is the current stable version,.

 Thanks
 Shivaram

 On Tue, May 26, 2015 at 8:58 AM, Iulian Dragoș iulian.dra...@typesafe.com
  wrote:

 I tried 1.4.0-rc2 binaries on a 3-node Mesos cluster, everything seemed
 to work fine, both spark-shell and spark-submit. Cluster mode deployment
 also worked.

 +1 (non-binding)

 iulian

 On Tue, May 26, 2015 at 4:44 AM, jameszhouyi yiaz...@gmail.com wrote:

 Compiled:
 git clone https://github.com/apache/spark.git
 git checkout tags/v1.4.0-rc2
 ./make-distribution.sh --tgz --skip-java-test -Pyarn -Phadoop-2.4
 -Dhadoop.version=2.5.0 -Phive -Phive-0.13.1 -Phive-thriftserver
 -DskipTests

 Block issue in RC1/RC2:
 https://issues.apache.org/jira/browse/SPARK-7119





 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-4-0-RC2-tp12420p12444.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




 --

 --
 Iulian Dragos

 --
 Reactive Apps on the JVM
 www.typesafe.com





Re: SparkR and RDDs

2015-05-26 Thread Andrew Psaltis
Hi Alek,
Thanks for the info. You are correct ,that using the three colons does
work. Admittedly I am a R novice, but since the three colons is used to
access hidden methods, it seems pretty dirty.

Can someone shed light on the design direction being taken with SparkR?
Should I really be accessing hidden methods or will better approach
prevail? For instance, it feels like the k-means sample should really use
MLlib and not just be a port the k-means sample using hidden methods. Am I
looking at this incorrectly?

Thanks,
Andrew

On Tue, May 26, 2015 at 6:56 AM, Eskilson,Aleksander 
alek.eskil...@cerner.com wrote:

  From the changes to the namespace file, that appears to be correct, all
 methods of the RDD API have been made private, which in R means that you
 may still access them by using the namespace prefix SparkR with three
 colons, e.g. SparkR:::func(foo, bar).

  So a starting place for porting old SparkR scripts from before the merge
 could be to identify those methods in the script belonging to the RDD class
 and be sure they have the namespace identifier tacked on the front. I hope
 that helps.

  Regards,
 Alek Eskilson

   From: Andrew Psaltis psaltis.and...@gmail.com
 Date: Monday, May 25, 2015 at 6:25 PM
 To: dev@spark.apache.org dev@spark.apache.org
 Subject: SparkR and RDDs

   Hi,
 I understand from SPARK-6799[1] and the respective merge commit [2]  that
 the RDD class is private in Spark 1.4 . If I wanted to modify the old
 Kmeans and/or LR examples so that the computation happened in Spark what is
 the best direction to go? Sorry if I am missing something obvious, but
 based on the NAMESPACE file [3] in the SparkR codebase I am having trouble
 seeing the obvious direction to go.

  Thanks in advance,
 Andrew

  [1] https://issues.apache.org/jira/browse/SPARK-6799
 https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D6799d=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=bawjeA3Y9me3xXGxKghL4_dlf7vHdFHtiV5IhMlOmtce=
 [2]
 https://github.com/apache/spark/commit/4b91e18d9b7803dbfe1e1cf20b46163d8cb8716c
 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_commit_4b91e18d9b7803dbfe1e1cf20b46163d8cb8716cd=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=Hc7ijtxcnrZ7wSOStlz0-BHH-rUXSFowCpJuNGYu5eoe=
 [3] https://github.com/apache/spark/blob/branch-1.4/R/pkg/NAMESPACE
 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_branch-2D1.4_R_pkg_NAMESPACEd=AwMFaQc=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJor=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPMm=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKhos=l64LUOvbJ53qsVYphkYJ7_kbNptBdEhsSRSWBg5zqn8e=

CONFIDENTIALITY NOTICE This message and any included attachments are
 from Cerner Corporation and are intended only for the addressee. The
 information contained in this message is confidential and may constitute
 inside or non-public information under international, federal, or state
 securities laws. Unauthorized forwarding, printing, copying, distribution,
 or use of such information is strictly prohibited and may be unlawful. If
 you are not the addressee, please promptly delete this message and notify
 the sender of the delivery error by e-mail or you may call Cerner's
 corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.



GraphX implementation of ALS?

2015-05-26 Thread Ben Mabey

Hi all,
I've heard in a number of presentations Spark's ALS implementation was 
going to be moved over to a GraphX version. For example, this 
presentation on GraphX 
https://databricks-training.s3.amazonaws.com/slides/graphx@sparksummit_2014-07.pdf(slide 
#23) at the Spark Summit mentioned a 40 LOC version using the Pregel 
API. Looking at the ALS source on master 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala 
it looks like the original implementation is still being used and no use 
of GraphX can be seen. Other algorithms mentioned in the GraphX 
presentation can be found in the repo 
https://github.com/apache/spark/tree/master/graphx/src/main/scala/org/apache/spark/graphx/lib 
already but I don't see ALS. Could someone link me to the GraphX version 
for comparison purposes?  Also, could someone comment on why the the 
newer version isn't in use yet (i.e. are there tradeoffs with using the 
GraphX version that makes it less desirable)?


Thanks,
Ben



Re: GraphX implementation of ALS?

2015-05-26 Thread Ankur Dave
This is the latest GraphX-based ALS implementation that I'm aware of:
https://github.com/ankurdave/spark/blob/GraphXALS/graphx/src/main/scala/org/apache/spark/graphx/lib/ALS.scala

When I benchmarked it last year, it was about twice as slow as MLlib's ALS,
and I think the latter has gotten faster since then. The performance gap is
because the MLlib version implements some ALS-specific optimizations that
are hard to do using GraphX, such as storing the edges twice (partitioned
by source and by destination) to reduce communication.

Ankur http://www.ankurdave.com/

On Tue, May 26, 2015 at 3:36 PM, Ben Mabey b...@benmabey.com wrote:

 I've heard in a number of presentations Spark's ALS implementation was
 going to be moved over to a GraphX version. For example, this
 presentation on GraphX
 https://databricks-training.s3.amazonaws.com/slides/graphx@sparksummit_2014-07.pdf(slide
 #23) at the Spark Summit mentioned a 40 LOC version using the Pregel API.
 Looking at the ALS source on master
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
 it looks like the original implementation is still being used and no use of
 GraphX can be seen. Other algorithms mentioned in the GraphX presentation
 can be found in the repo
 https://github.com/apache/spark/tree/master/graphx/src/main/scala/org/apache/spark/graphx/lib
 already but I don't see ALS. Could someone link me to the GraphX version
 for comparison purposes?  Also, could someone comment on why the the newer
 version isn't in use yet (i.e. are there tradeoffs with using the GraphX
 version that makes it less desirable)?