Re: SparkR and RDDs

Andrew Psaltis Wed, 27 May 2015 20:46:29 -0700

Hi Shivaram,
Thanks for the details, it is greatly appreciated.

Thanks


On Wed, May 27, 2015 at 7:25 PM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> Sorry for the delay in getting back on this. So the RDD interface is
> private in the 1.4 release but as Alek mentioned you can still use it by
> prefixing `SparkR:::`.
>
> Regarding design direction -- there are two JIRAs which cover major
> features we plan to work on for 1.5. SPARK-6805 tracks porting high-level
> machine learning operations like `glm` and `kmeans` to SparkR using the ML
> Pipeline implementation in Scala as the backend.
>
> We are also planning to develop a parallel API where users can run native
> R functions in a distributed setting and SPARK-7264 tracks this effort. If
> you have specific use cases feel free to chime in on the JIRA or on the dev
> mailing list.
>
> Thanks
> Shivaram
>
> On Tue, May 26, 2015 at 11:40 AM, Reynold Xin <r...@databricks.com> wrote:
>
>> You definitely don't want to implement kmeans in R, since it would be
>> very slow. Just providing R wrappers for the MLlib implementation is the
>> way to go. I believe one of the major items in SparkR next is the MLlib
>> wrappers.
>>
>>
>>
>> On Tue, May 26, 2015 at 7:46 AM, Andrew Psaltis <psaltis.and...@gmail.com
>> > wrote:
>>
>>> Hi Alek,
>>> Thanks for the info. You are correct ,that using the three colons does
>>> work. Admittedly I am a R novice, but since the three colons is used to
>>> access hidden methods, it seems pretty dirty.
>>>
>>> Can someone shed light on the design direction being taken with SparkR?
>>> Should I really be accessing hidden methods or will better approach
>>> prevail? For instance, it feels like the k-means sample should really use
>>> MLlib and not just be a port the k-means sample using hidden methods. Am I
>>> looking at this incorrectly?
>>>
>>> Thanks,
>>> Andrew
>>>
>>> On Tue, May 26, 2015 at 6:56 AM, Eskilson,Aleksander <
>>> alek.eskil...@cerner.com> wrote:
>>>
>>>>  From the changes to the namespace file, that appears to be correct,
>>>> all methods of the RDD API have been made private, which in R means that
>>>> you may still access them by using the namespace prefix SparkR with three
>>>> colons, e.g. SparkR:::func(foo, bar).
>>>>
>>>>  So a starting place for porting old SparkR scripts from before the
>>>> merge could be to identify those methods in the script belonging to the RDD
>>>> class and be sure they have the namespace identifier tacked on the front. I
>>>> hope that helps.
>>>>
>>>>  Regards,
>>>> Alek Eskilson
>>>>
>>>>   From: Andrew Psaltis <psaltis.and...@gmail.com>
>>>> Date: Monday, May 25, 2015 at 6:25 PM
>>>> To: "dev@spark.apache.org" <dev@spark.apache.org>
>>>> Subject: SparkR and RDDs
>>>>
>>>>   Hi,
>>>> I understand from SPARK-6799[1] and the respective merge commit [2]
>>>>  that the RDD class is private in Spark 1.4 . If I wanted to modify the old
>>>> Kmeans and/or LR examples so that the computation happened in Spark what is
>>>> the best direction to go? Sorry if I am missing something obvious, but
>>>> based on the NAMESPACE file [3] in the SparkR codebase I am having trouble
>>>> seeing the obvious direction to go.
>>>>
>>>>  Thanks in advance,
>>>> Andrew
>>>>
>>>>  [1] https://issues.apache.org/jira/browse/SPARK-6799
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D6799&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKho&s=bawjeA3Y9me3xXGxKghL4_dlf7vHdFHtiV5IhMlOmtc&e=>
>>>> [2]
>>>> https://github.com/apache/spark/commit/4b91e18d9b7803dbfe1e1cf20b46163d8cb8716c
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_commit_4b91e18d9b7803dbfe1e1cf20b46163d8cb8716c&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKho&s=Hc7ijtxcnrZ7wSOStlz0-BHH-rUXSFowCpJuNGYu5eo&e=>
>>>> [3] https://github.com/apache/spark/blob/branch-1.4/R/pkg/NAMESPACE
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_branch-2D1.4_R_pkg_NAMESPACE&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=T9sfWUgCtxLUJ9F4B-MAmBhrH4e3aGvb_hbrENoIKho&s=l64LUOvbJ53qsVYphkYJ7_kbNptBdEhsSRSWBg5zqn8&e=>
>>>>
>>>>    CONFIDENTIALITY NOTICE This message and any included attachments
>>>> are from Cerner Corporation and are intended only for the addressee. The
>>>> information contained in this message is confidential and may constitute
>>>> inside or non-public information under international, federal, or state
>>>> securities laws. Unauthorized forwarding, printing, copying, distribution,
>>>> or use of such information is strictly prohibited and may be unlawful. If
>>>> you are not the addressee, please promptly delete this message and notify
>>>> the sender of the delivery error by e-mail or you may call Cerner's
>>>> corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024
>>>> .
>>>>
>>>
>>>
>>
>

Re: SparkR and RDDs

Reply via email to