Re: [ANNOUNCE] New SAMBA Package = Spark + AWS Lambda

2016-02-02 Thread David Russell
Hi Ben,

> My company uses Lamba to do simple data moving and processing using python
> scripts. I can see using Spark instead for the data processing would make it
> into a real production level platform.

That may be true. Spark has first class support for Python which
should make your life easier if you do go this route. Once you've
fleshed out your ideas I'm sure folks on this mailing list can provide
helpful guidance based on their real world experience with Spark.

> Does this pave the way into replacing
> the need of a pre-instantiated cluster in AWS or bought hardware in a
> datacenter?

In a word, no. SAMBA is designed to extend-not-replace the traditional
Spark computation and deployment model. At it's most basic, the
traditional Spark computation model distributes data and computations
across worker nodes in the cluster.

SAMBA simply allows some of those computations to be performed by AWS
Lambda rather than locally on your worker nodes. There are I believe a
number of potential benefits to using SAMBA in some circumstances:

1. It can help reduce some of the workload on your Spark cluster by
moving that workload onto AWS Lambda, an infrastructure on-demand
compute service.

2. It allows Spark applications written in Java or Scala to make use
of libraries and features offered by Python and JavaScript (Node.js)
today, and potentially, more libraries and features offered by
additional languages in the future as AWS Lambda language support
evolves.

3. It provides a simple, clean API for integration with REST APIs that
may be a benefit to Spark applications that form part of a broader
data pipeline or solution.

> If so, then this would be a great efficiency and make an easier
> entry point for Spark usage. I hope the vision is to get rid of all cluster
> management when using Spark.

You might find one of the hosted Spark platform solutions such as
Databricks or Amazon EMR that handle cluster management for you a good
place to start. At least in my experience, they got me up and running
without difficulty.

David

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Guidelines for writing SPARK packages

2016-02-01 Thread David Russell
Hi Praveen,

The basic requirements for releasing a Spark package on
spark-packages.org are as follows:

1. The package content must be hosted by GitHub in a public repo under
the owner's account.
2. The repo name must match the package name.
3. The master branch of the repo must contain "README.md" and "LICENSE".

Per the doc on spark-packages.org site an example package that meets
those requirements can be found at
https://github.com/databricks/spark-avro. My own recently released
SAMBA package also meets these requirements:
https://github.com/onetapbeyond/lambda-spark-executor.

As you can see there is nothing in this list of requirements that
demands the implementation of specific interfaces. What you'll need to
implement will depend entirely on what you want to accomplish. If you
want to register a release for your package you will also need to push
the artifacts for your package to Maven central.

David


On Mon, Feb 1, 2016 at 7:03 AM, Praveen Devarao  wrote:
> Hi,
>
> Is there any guidelines or specs to write a Spark package? I would
> like to implement a spark package and would like to know the way it needs to
> be structured (implement some interfaces etc) so that it can plug into Spark
> for extended functionality.
>
> Could any one help me point to docs or links on the above?
>
> Thanking You
>
> Praveen Devarao



-- 
"All that is gold does not glitter, Not all those who wander are lost."

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[ANNOUNCE] New SAMBA Package = Spark + AWS Lambda

2016-02-01 Thread David Russell
Hi all,

Just sharing news of the release of a newly available Spark package, SAMBA
.


https://github.com/onetapbeyond/lambda-spark-executor

SAMBA is an Apache Spark package offering seamless integration with the AWS
Lambda  compute service for Spark batch and
streaming applications on the JVM.

Within traditional Spark deployments RDD tasks are executed using fixed
compute resources on worker nodes within the Spark cluster. With SAMBA,
application developers can delegate selected RDD tasks to execute using
on-demand AWS Lambda compute infrastructure in the cloud.

Not unlike the recently released ROSE
 package that
extends the capabilities of traditional Spark applications with support for
CRAN R analytics, SAMBA provides another (hopefully) useful extension for
Spark application developers on the JVM.

SAMBA Spark Package: https://github.com/onetapbeyond/lambda-spark-executor

ROSE Spark Package: https://github.com/onetapbeyond/opencpu-spark-executor


Questions, suggestions, feedback welcome.

David

-- 
"*All that is gold does not glitter,** Not all those who wander are lost."*


Re: rdd.foreach return value

2016-01-18 Thread David Russell
The foreach operation on RDD has a void (Unit) return type. See attached. So 
there is no return value to the driver.

David

"All that is gold does not glitter, Not all those who wander are lost."



 Original Message 
Subject: rdd.foreach return value
Local Time: January 18 2016 10:34 pm
UTC Time: January 19 2016 3:34 am
From: charles.up...@gmail.com
To: user@spark.apache.org


code snippet




the 'print' actually print info on the worker node, but I feel confused where 
the 'return' value

goes to. for I get nothing on the driver node.
--


--
a spark lover, a quant, a developer and a good man.

http://github.com/litaotao

foreach.png
Description: Binary data

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: ROSE: Spark + R on the JVM.

2016-01-13 Thread David Russell
Hi Richard,

Thanks for providing the background on your application.

> the user types or copy-pastes his R code,
> the system should then send this R code (using ROSE) to R

Unfortunately this type of ad hoc R analysis is not supported. ROSE supports 
the execution of any R function or script within an existing R package on CRAN, 
Bioc, or github. It does not support the direct execution of arbitrary blocks 
of R code as you described.

You may want to look at [DeployR](http://deployr.revolutionanalytics.com/), 
it's an open source R integration server that provides APIs in Java, JavaScript 
and .NET that can easily support your use case. The outputs of your DeployR 
integration could then become inputs to your data processing system.

David

"All that is gold does not glitter, Not all those who wander are lost."



 Original Message 
Subject: Re: ROSE: Spark + R on the JVM.
Local Time: January 13 2016 3:18 am
UTC Time: January 13 2016 8:18 am
From: rsiebel...@gmail.com
To: themarchoffo...@protonmail.com
CC: 
m...@vijaykiran.com,cjno...@gmail.com,user@spark.apache.org,d...@spark.apache.org


Hi David,

the use case is that we're building a data processing system with an intuitive 
user interface where Spark is used as the data processing framework.
We would like to provide a HTML user interface to R where the user types or 
copy-pastes his R code, the system should then send this R code (using ROSE) to 
R, process it and give the results back to the user. The RDD would be used so 
that the data can be further processed by the system but we would like to also 
show or be able to show the messages printed to STDOUT and also the images 
(plots) that are generated by R. The plots seems to be available in the OpenCPU 
API, see below

Inline image 1

So the case is not that we're trying to process millions of images but rather 
that we would like to show the generated plots (like a regression plot) that's 
generated in R to the user. There could be several plots generated by the code, 
but certainly not thousands or even hundreds, only a few.

Hope that this would be possible using ROSE because it seems a really good fit,
thanks in advance,
Richard



On Wed, Jan 13, 2016 at 3:39 AM, David Russell <themarchoffo...@protonmail.com> 
wrote:

Hi Richard,


> Would it be possible to access the session API from within ROSE,
> to get for example the images that are generated by R / openCPU

Technically it would be possible although there would be some potentially 
significant runtime costs per task in doing so, primarily those related to 
extracting image data from the R session, serializing and then moving that data 
across the cluster for each and every image.

From a design perspective ROSE was intended to be used within Spark scale 
applications where R object data was seen as the primary task output. An output 
in a format that could be rapidly serialized and easily processed. Are there 
real world use cases where Spark scale applications capable of generating 10k, 
100k, or even millions of image files would actually need to capture and store 
images? If so, how practically speaking, would these images ever be used? I'm 
just not sure. Maybe you could describe your own use case to provide some 
insights?


> and the logging to stdout that is logged by R?

If you are referring to the R console output (generated within the R session 
during the execution of an OCPUTask) then this data could certainly 
(optionally) be captured and returned on an OCPUResult. Again, can you provide 
any details for how you might use this console output in a real world 
application?

As an aside, for simple standalone Spark applications that will only ever run 
on a single host (no cluster) you could consider using an alternative library 
called fluent-r. This library is also available under my GitHub repo, [see 
here](https://github.com/onetapbeyond/fluent-r). The fluent-r library already 
has support for the retrieval of R objects, R console output and R graphics 
device image/plots. However it is not as lightweight as ROSE and it not 
designed to work in a clustered environment. ROSE on the other hand is designed 
for scale.


David

"All that is gold does not glitter, Not all those who wander are lost."




 Original Message 
Subject: Re: ROSE: Spark + R on the JVM.


Local Time: January 12 2016 6:56 pm
UTC Time: January 12 2016 11:56 pm
From: rsiebel...@gmail.com
To: m...@vijaykiran.com
CC: 
cjno...@gmail.com,themarchoffo...@protonmail.com,user@spark.apache.org,d...@spark.apache.org



Hi,

this looks great and seems to be very usable.
Would it be possible to access the session API from within ROSE, to get for 
example the images that are generated by R / openCPU and the logging to stdout 
that is logged by R?

thanks in advance,
Richard



On Tue, Jan 12, 2016 at 10:16 PM, Vijay Kiran <m...@vijaykiran.com> wrote:

I think it would be this: https:/

Re: ROSE: Spark + R on the JVM.

2016-01-12 Thread David Russell
Hi Richard,

> Would it be possible to access the session API from within ROSE,
> to get for example the images that are generated by R / openCPU

Technically it would be possible although there would be some potentially 
significant runtime costs per task in doing so, primarily those related to 
extracting image data from the R session, serializing and then moving that data 
across the cluster for each and every image.

From a design perspective ROSE was intended to be used within Spark scale 
applications where R object data was seen as the primary task output. An output 
in a format that could be rapidly serialized and easily processed. Are there 
real world use cases where Spark scale applications capable of generating 10k, 
100k, or even millions of image files would actually need to capture and store 
images? If so, how practically speaking, would these images ever be used? I'm 
just not sure. Maybe you could describe your own use case to provide some 
insights?

> and the logging to stdout that is logged by R?

If you are referring to the R console output (generated within the R session 
during the execution of an OCPUTask) then this data could certainly 
(optionally) be captured and returned on an OCPUResult. Again, can you provide 
any details for how you might use this console output in a real world 
application?

As an aside, for simple standalone Spark applications that will only ever run 
on a single host (no cluster) you could consider using an alternative library 
called fluent-r. This library is also available under my GitHub repo, [see 
here](https://github.com/onetapbeyond/fluent-r). The fluent-r library already 
has support for the retrieval of R objects, R console output and R graphics 
device image/plots. However it is not as lightweight as ROSE and it not 
designed to work in a clustered environment. ROSE on the other hand is designed 
for scale.

David

"All that is gold does not glitter, Not all those who wander are lost."



 Original Message 
Subject: Re: ROSE: Spark + R on the JVM.
Local Time: January 12 2016 6:56 pm
UTC Time: January 12 2016 11:56 pm
From: rsiebel...@gmail.com
To: m...@vijaykiran.com
CC: 
cjno...@gmail.com,themarchoffo...@protonmail.com,user@spark.apache.org,d...@spark.apache.org



Hi,

this looks great and seems to be very usable.
Would it be possible to access the session API from within ROSE, to get for 
example the images that are generated by R / openCPU and the logging to stdout 
that is logged by R?

thanks in advance,
Richard



On Tue, Jan 12, 2016 at 10:16 PM, Vijay Kiran  wrote:

I think it would be this: https://github.com/onetapbeyond/opencpu-spark-executor

> On 12 Jan 2016, at 18:32, Corey Nolet  wrote:
>


> David,
>
> Thank you very much for announcing this! It looks like it could be very 
> useful. Would you mind providing a link to the github?
>
> On Tue, Jan 12, 2016 at 10:03 AM, David  
> wrote:
> Hi all,
>
> I'd like to share news of the recent release of a new Spark package, ROSE.
>
> ROSE is a Scala library offering access to the full scientific computing 
> power of the R programming language to Apache Spark batch and streaming 
> applications on the JVM. Where Apache SparkR lets data scientists use Spark 
> from R, ROSE is designed to let Scala and Java developers use R from Spark.
>
> The project is available and documented on GitHub and I would encourage you 
> to take a look. Any feedback, questions etc very welcome.
>
> David
>
> "All that is gold does not glitter, Not all those who wander are lost."
>




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

ROSE: Spark + R on the JVM, now available.

2016-01-12 Thread David Russell
Hi all,

I'd like to share news of the recent release of a new Spark package, 
[ROSE](http://spark-packages.org/package/onetapbeyond/opencpu-spark-executor).

ROSE is a Scala library offering access to the full scientific computing power 
of the R programming language to Apache Spark batch and streaming applications 
on the JVM. Where Apache SparkR lets data scientists use Spark from R, ROSE is 
designed to let Scala and Java developers use R from Spark.

The project is available and documented [on 
GitHub](https://github.com/onetapbeyond/opencpu-spark-executor) and I would 
encourage you to [take a 
look](https://github.com/onetapbeyond/opencpu-spark-executor). Any feedback, 
questions etc very welcome.

David

"All that is gold does not glitter, Not all those who wander are lost."

Re: ROSE: Spark + R on the JVM.

2016-01-12 Thread David Russell
Hi Corey,

> Would you mind providing a link to the github?

Sure, here is the github link you're looking for:

https://github.com/onetapbeyond/opencpu-spark-executor

David

"All that is gold does not glitter, Not all those who wander are lost."



 Original Message 
Subject: Re: ROSE: Spark + R on the JVM.
Local Time: January 12 2016 12:32 pm
UTC Time: January 12 2016 5:32 pm
From: cjno...@gmail.com
To: themarchoffo...@protonmail.com
CC: user@spark.apache.org,d...@spark.apache.org



David,
Thank you very much for announcing this! It looks like it could be very useful. 
Would you mind providing a link to the github?



On Tue, Jan 12, 2016 at 10:03 AM, David  wrote:

Hi all,

I'd like to share news of the recent release of a new Spark package, ROSE.

ROSE is a Scala library offering access to the full scientific computing power 
of the R programming language to Apache Spark batch and streaming applications 
on the JVM. Where Apache SparkR lets data scientists use Spark from R, ROSE is 
designed to let Scala and Java developers use R from Spark.

The project is available and documented on GitHub and I would encourage you to 
take a look. Any feedback, questions etc very welcome.

David

"All that is gold does not glitter, Not all those who wander are lost."