Re: Python to Java object conversion of numpy array

2015-01-12 Thread Meethu Mathew

Hi,

This is the function defined in PythonMLLibAPI.scala
def findPredict(
  data: JavaRDD[Vector],
  wt: Object,
  mu: Array[Object],
  si: Array[Object]):  RDD[Array[Double]]  = {
}

So the parameter mu should be converted to Array[object].

mu = (Vectors.dense([0.8786, -0.7855]),Vectors.dense([-0.1863, 0.7799]))

def _py2java(sc, obj):

if isinstance(obj, RDD):
...
elif isinstance(obj, SparkContext):
  ...
elif isinstance(obj, dict):
   ...
elif isinstance(obj, (list, tuple)):
obj = ListConverter().convert(obj, sc._gateway._gateway_client)
elif isinstance(obj, JavaObject):
pass
elif isinstance(obj, (int, long, float, bool, basestring)):
pass
else:
bytes = bytearray(PickleSerializer().dumps(obj))
obj = sc._jvm.SerDe.loads(bytes)
return obj

Since its a tuple of Densevectors, in _py2java() its entering the 
isinstance(obj, (list, tuple)) condition and throwing exception(happens 
because the dimension of tuple >1). However the conversion occurs 
correctly if the Pickle conversion is done (last else part).


Hope its clear now.

Regards,
Meethu

On Monday 12 January 2015 11:35 PM, Davies Liu wrote:

On Sun, Jan 11, 2015 at 10:21 PM, Meethu Mathew
 wrote:

Hi,

This is the code I am running.

mu = (Vectors.dense([0.8786, -0.7855]),Vectors.dense([-0.1863, 0.7799]))

membershipMatrix = callMLlibFunc("findPredict", rdd.map(_convert_to_vector),
mu)

What's the Java API looks like? all the arguments of findPredict
should be converted
into java objects, so what should `mu` be converted to?


Regards,
Meethu
On Monday 12 January 2015 11:46 AM, Davies Liu wrote:

Could you post a piece of code here?

On Sun, Jan 11, 2015 at 9:28 PM, Meethu Mathew 
wrote:

Hi,
Thanks Davies .

I added a new class GaussianMixtureModel in clustering.py and the method
predict in it and trying to pass numpy array from this method.I converted it
to DenseVector and its solved now.

Similarly I tried passing a List  of more than one dimension to the function
_py2java , but now the exception is

'list' object has no attribute '_get_object_id'

and when I give a tuple input (Vectors.dense([0.8786,
-0.7855]),Vectors.dense([-0.1863, 0.7799])) exception is like

'numpy.ndarray' object has no attribute '_get_object_id'

Regards,



Meethu Mathew

Engineer

Flytxt

www.flytxt.com | Visit our blog  |  Follow us | Connect on Linkedin



On Friday 09 January 2015 11:37 PM, Davies Liu wrote:

Hey Meethu,

The Java API accepts only Vector, so you should convert the numpy array into
pyspark.mllib.linalg.DenseVector.

BTW, which class are you using? the KMeansModel.predict() accept
numpy.array,
it will do the conversion for you.

Davies

On Fri, Jan 9, 2015 at 4:45 AM, Meethu Mathew 
wrote:

Hi,
I am trying to send a numpy array as an argument to a function predict() in
a class in spark/python/pyspark/mllib/clustering.py which is passed to the
function callMLlibFunc(name, *args)  in
spark/python/pyspark/mllib/common.py.

Now the value is passed to the function  _py2java(sc, obj) .Here I am
getting an exception

Py4JJavaError: An error occurred while calling
z:org.apache.spark.mllib.api.python.SerDe.loads.
: net.razorvine.pickle.PickleException: expected zero arguments for
construction of ClassDict (for numpy.core.multiarray._reconstruct)
 at
net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
 at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
 at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
 at net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
 at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)


Why common._py2java(sc, obj) is not handling numpy array type?

Please help..


--

Regards,

*Meethu Mathew*

*Engineer*

*Flytxt*

www.flytxt.com | Visit our blog  | Follow us
 | _Connect on Linkedin
_







Re: Re-use scaling means and variances from StandardScalerModel

2015-01-12 Thread Octavian Geagla
Thanks for the suggestions.  

I've opened this JIRA ticket:
https://issues.apache.org/jira/browse/SPARK-5207
Feel free to modify it, assign it to me, kick off a discussion, etc.  

I'd be more than happy to own this feature and PR.

Thanks,
-Octavian



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Re-use-scaling-means-and-variances-from-StandardScalerModel-tp10073p10092.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Python to Java object conversion of numpy array

2015-01-12 Thread Davies Liu
On Sun, Jan 11, 2015 at 10:21 PM, Meethu Mathew
 wrote:
> Hi,
>
> This is the code I am running.
>
> mu = (Vectors.dense([0.8786, -0.7855]),Vectors.dense([-0.1863, 0.7799]))
>
> membershipMatrix = callMLlibFunc("findPredict", rdd.map(_convert_to_vector),
> mu)

What's the Java API looks like? all the arguments of findPredict
should be converted
into java objects, so what should `mu` be converted to?

> Regards,
> Meethu
> On Monday 12 January 2015 11:46 AM, Davies Liu wrote:
>
> Could you post a piece of code here?
>
> On Sun, Jan 11, 2015 at 9:28 PM, Meethu Mathew 
> wrote:
>
> Hi,
> Thanks Davies .
>
> I added a new class GaussianMixtureModel in clustering.py and the method
> predict in it and trying to pass numpy array from this method.I converted it
> to DenseVector and its solved now.
>
> Similarly I tried passing a List  of more than one dimension to the function
> _py2java , but now the exception is
>
> 'list' object has no attribute '_get_object_id'
>
> and when I give a tuple input (Vectors.dense([0.8786,
> -0.7855]),Vectors.dense([-0.1863, 0.7799])) exception is like
>
> 'numpy.ndarray' object has no attribute '_get_object_id'
>
> Regards,
>
>
>
> Meethu Mathew
>
> Engineer
>
> Flytxt
>
> www.flytxt.com | Visit our blog  |  Follow us | Connect on Linkedin
>
>
>
> On Friday 09 January 2015 11:37 PM, Davies Liu wrote:
>
> Hey Meethu,
>
> The Java API accepts only Vector, so you should convert the numpy array into
> pyspark.mllib.linalg.DenseVector.
>
> BTW, which class are you using? the KMeansModel.predict() accept
> numpy.array,
> it will do the conversion for you.
>
> Davies
>
> On Fri, Jan 9, 2015 at 4:45 AM, Meethu Mathew 
> wrote:
>
> Hi,
> I am trying to send a numpy array as an argument to a function predict() in
> a class in spark/python/pyspark/mllib/clustering.py which is passed to the
> function callMLlibFunc(name, *args)  in
> spark/python/pyspark/mllib/common.py.
>
> Now the value is passed to the function  _py2java(sc, obj) .Here I am
> getting an exception
>
> Py4JJavaError: An error occurred while calling
> z:org.apache.spark.mllib.api.python.SerDe.loads.
> : net.razorvine.pickle.PickleException: expected zero arguments for
> construction of ClassDict (for numpy.core.multiarray._reconstruct)
> at
> net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
> at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
> at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
> at net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
> at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)
>
>
> Why common._py2java(sc, obj) is not handling numpy array type?
>
> Please help..
>
>
> --
>
> Regards,
>
> *Meethu Mathew*
>
> *Engineer*
>
> *Flytxt*
>
> www.flytxt.com | Visit our blog  | Follow us
>  | _Connect on Linkedin
> _
>
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Discussion | SparkContext 's setJobGroup and clearJobGroup should return a new instance of SparkContext

2015-01-12 Thread Erik Erlandson
setJobGroup needs fixing:
https://issues.apache.org/jira/browse/SPARK-4514

I'm interested in any community input on what the semantics or design "ought" 
to be changed to.


- Original Message -
> Hi spark committers
> 
> I would like to discuss the possibility of changing the signature
> of SparkContext 's setJobGroup and clearJobGroup functions to return a
> replica of SparkContext with the job group set/unset instead of mutating
> the original context. I am building a spark job server and I am assigning
> job groups before passing control to user provided logic that uses spark
> context to define and execute a job (very much like job-server). The issue
> is that I can't reliably know when to clear the job group as user defined
> code can use futures to submit multiple tasks in parallel. In fact, I am
> even allowing users to return a future from their function on which spark
> server can register callbacks to know when the user defined job is
> complete. Now, if I set the job group before passing control to user
> function and wait on future to complete so that I can clear the job group,
> I can no longer use that SparkContext for any other job. This means I will
> have to lock on the SparkContext which seems like a bad idea. Therefore, my
> proposal would be to return new instance of SparkContext (a replica with
> just job group set/unset) that can further be used in concurrent
> environment safely. I am also happy mutating the original SparkContext just
> not break backward compatibility as long as the returned SparkContext is
> not affected by set/unset of job groups on original SparkContext.
> 
> Thoughts please?
> 
> Thanks,
> Aniket
> 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Apache Spark client high availability

2015-01-12 Thread Akhil Das
We usually run Spark in HA with the following stack:

-> Apache Mesos
-> Marathon - init/control system for starting, stopping, and maintaining
always-on applications.(Mainly SparkStreaming)
-> Chronos - general-purpose scheduler for Mesos, supports job dependency
graphs.
-> Spark Job Server - primarily for it's ability to reuse shared contexts
with multiple jobs

​This thread has a better discussion
http://apache-spark-user-list.1001560.n3.nabble.com/How-do-you-run-your-spark-app-td7935.html
​


Thanks
Best Regards

On Mon, Jan 12, 2015 at 10:08 PM, preeze  wrote:

> Dear community,
>
> I've been searching the internet for quite a while to find out what is the
> best architecture to support HA for a spark client.
>
> We run an application that connects to a standalone Spark cluster and
> caches
> a big chuck of data for subsequent intensive calculations. To achieve HA
> we'll need to run several instances of the application on different hosts.
>
> Initially I explored the option to reuse (i.e. share) the same executors
> set
> between SparkContext instances of all running applications. Found it
> impossible.
>
> So, every application, which creates an instance of SparkContext, has to
> spawn its own executors. Externalizing and sharing executors' memory cache
> with Tachyon is a semi-solution since each application's executors will
> keep
> using their own set of CPU cores.
>
> Spark-jobserver is another possibility. It manages SparkContext itself and
> accepts job requests from multiple clients for the same context which is
> brilliant. However, this becomes a new single point of failure.
>
> Now I am exploring if it's possible to run the Spark cluster in YARN
> cluster
> mode and connect to the driver from multiple clients.
>
> Is there anything I am missing guys?
> Any suggestion is highly appreciated!
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Apache-Spark-client-high-availability-tp10088.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Apache Spark client high availability

2015-01-12 Thread preeze
Dear community,

I've been searching the internet for quite a while to find out what is the
best architecture to support HA for a spark client.

We run an application that connects to a standalone Spark cluster and caches
a big chuck of data for subsequent intensive calculations. To achieve HA
we'll need to run several instances of the application on different hosts.

Initially I explored the option to reuse (i.e. share) the same executors set
between SparkContext instances of all running applications. Found it
impossible.

So, every application, which creates an instance of SparkContext, has to
spawn its own executors. Externalizing and sharing executors' memory cache
with Tachyon is a semi-solution since each application's executors will keep
using their own set of CPU cores.

Spark-jobserver is another possibility. It manages SparkContext itself and
accepts job requests from multiple clients for the same context which is
brilliant. However, this becomes a new single point of failure.

Now I am exploring if it's possible to run the Spark cluster in YARN cluster
mode and connect to the driver from multiple clients.

Is there anything I am missing guys?
Any suggestion is highly appreciated!



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Apache-Spark-client-high-availability-tp10088.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: YARN | SPARK-5164 | Submitting jobs from windows to linux YARN

2015-01-12 Thread Aniket Bhatnagar
Ohh right. It is. I will mark my defect as duplicate and cross check my
notes with the fixes in the pull request. Thanks for pointing out Zsolt :)

On Mon, Jan 12, 2015, 7:42 PM Zsolt Tóth  wrote:

> Hi Aniket,
>
> I think this is a duplicate of SPARK-1825, isn't it?
>
> Zsolt
>
> 2015-01-12 14:38 GMT+01:00 Aniket Bhatnagar :
>
>> Hi Spark YARN maintainers
>>
>> Can anyone please look and comment on SPARK-5164? Basically, this stops
>> users from submitting jobs (or using spark shell) from a windows machine
>> to
>> a a YARN cluster running on linux. I should be able to submit a pull
>> request for this provided the community agrees. This would be a great help
>> for windows users (like me).
>>
>> Thanks,
>> Aniket
>>
>
>


Re: YARN | SPARK-5164 | Submitting jobs from windows to linux YARN

2015-01-12 Thread Zsolt Tóth
Hi Aniket,

I think this is a duplicate of SPARK-1825, isn't it?

Zsolt

2015-01-12 14:38 GMT+01:00 Aniket Bhatnagar :

> Hi Spark YARN maintainers
>
> Can anyone please look and comment on SPARK-5164? Basically, this stops
> users from submitting jobs (or using spark shell) from a windows machine to
> a a YARN cluster running on linux. I should be able to submit a pull
> request for this provided the community agrees. This would be a great help
> for windows users (like me).
>
> Thanks,
> Aniket
>


YARN | SPARK-5164 | Submitting jobs from windows to linux YARN

2015-01-12 Thread Aniket Bhatnagar
Hi Spark YARN maintainers

Can anyone please look and comment on SPARK-5164? Basically, this stops
users from submitting jobs (or using spark shell) from a windows machine to
a a YARN cluster running on linux. I should be able to submit a pull
request for this provided the community agrees. This would be a great help
for windows users (like me).

Thanks,
Aniket


Discussion | SparkContext 's setJobGroup and clearJobGroup should return a new instance of SparkContext

2015-01-12 Thread Aniket Bhatnagar
Hi spark committers

I would like to discuss the possibility of changing the signature
of SparkContext 's setJobGroup and clearJobGroup functions to return a
replica of SparkContext with the job group set/unset instead of mutating
the original context. I am building a spark job server and I am assigning
job groups before passing control to user provided logic that uses spark
context to define and execute a job (very much like job-server). The issue
is that I can't reliably know when to clear the job group as user defined
code can use futures to submit multiple tasks in parallel. In fact, I am
even allowing users to return a future from their function on which spark
server can register callbacks to know when the user defined job is
complete. Now, if I set the job group before passing control to user
function and wait on future to complete so that I can clear the job group,
I can no longer use that SparkContext for any other job. This means I will
have to lock on the SparkContext which seems like a bad idea. Therefore, my
proposal would be to return new instance of SparkContext (a replica with
just job group set/unset) that can further be used in concurrent
environment safely. I am also happy mutating the original SparkContext just
not break backward compatibility as long as the returned SparkContext is
not affected by set/unset of job groups on original SparkContext.

Thoughts please?

Thanks,
Aniket