unsubscribe

2020-01-08 Thread MEETHU MATHEW



Thanks & Regards, Meethu M

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Filtering based on a float value with more than one decimal place not working correctly in Pyspark dataframe

2018-09-25 Thread Meethu Mathew
Hi all,

I tried the following code and the output was not as expected.

schema = StructType([StructField('Id', StringType(), False),
>  StructField('Value', FloatType(), False)])
> df_test = spark.createDataFrame([('a',5.0),('b',1.236),('c',-0.31)],schema)

df_test


Output :  DataFrame[Id: string, Value: float]
[image: image.png]
But when the value is given as a string, it worked.

[image: image.png]
Again tried with a floating point number with one decimal place and it
worked.
[image: image.png]
And when the equals operation is changed to greater than or less than, its
working with more than one decimal place numbers
[image: image.png]
Is this a bug?

Regards,
Meethu Mathew


Re: Failed to run spark jobs on mesos due to "hadoop" not found.

2016-11-18 Thread Meethu Mathew
Hi,

Add HADOOP_HOME=/path/to/hadoop/folder in /etc/default/mesos-slave in all
mesos agents and restart mesos

Regards,
Meethu Mathew


On Thu, Nov 10, 2016 at 4:57 PM, Yu Wei <yu20...@hotmail.com> wrote:

> Hi Guys,
>
> I failed to launch spark jobs on mesos. Actually I submitted the job to
> cluster successfully.
>
> But the job failed to run.
>
> I1110 18:25:11.095507   301 fetcher.cpp:498] Fetcher Info:
> {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/1f8e621b-3cbf-4b86-a1c1-
> 9e2cf77265ee-S7\/root","items":[{"action":"BYPASS_CACHE","
> uri":{"extract":true,"value":"hdfs:\/\/192.168.111.74:9090\/
> bigdata\/package\/spark-examples_2.11-2.0.1.jar"}}],"
> sandbox_directory":"\/var\/lib\/mesos\/agent\/slaves\/
> 1f8e621b-3cbf-4b86-a1c1-9e2cf77265ee-S7\/frameworks\/
> 1f8e621b-3cbf-4b86-a1c1-9e2cf77265ee-0002\/executors\/
> driver-20161110182510-0001\/runs\/b561328e-9110-4583-b740-
> 98f9653e7fc2","user":"root"}
> I1110 18:25:11.099799   301 fetcher.cpp:409] Fetching URI 'hdfs://
> 192.168.111.74:9090/bigdata/package/spark-examples_2.11-2.0.1.jar'
> I1110 18:25:11.099820   301 fetcher.cpp:250] Fetching directly into the
> sandbox directory
> I1110 18:25:11.099862   301 fetcher.cpp:187] Fetching URI 'hdfs://
> 192.168.111.74:9090/bigdata/package/spark-examples_2.11-2.0.1.jar'
> E1110 18:25:11.101842   301 shell.hpp:106] Command 'hadoop version 2>&1'
> failed; this is the output:
> sh: hadoop: command not found
> Failed to fetch 'hdfs://192.168.111.74:9090/bigdata/package/spark-
> examples_2.11-2.0.1.jar': Failed to create HDFS client: Failed to execute
> 'hadoop version 2>&1'; the command was either not found or exited with a
> non-zero exit status: 127
> Failed to synchronize with agent (it's probably exited
>
> Actually I installed hadoop on each agent node.
>
>
> Any advice?
>
>
> Thanks,
>
> Jared, (韦煜)
> Software developer
> Interested in open source software, big data, Linux
>


Re: Please reply if you use Mesos fine grained mode

2015-11-03 Thread MEETHU MATHEW
Hi,
We are using Mesos fine grained mode because we can have multiple instances of 
spark to share machines and each application get resources dynamically 
allocated.  Thanks & Regards,  Meethu M 


 On Wednesday, 4 November 2015 5:24 AM, Reynold Xin  
wrote:
   

 If you are using Spark with Mesos fine grained mode, can you please respond to 
this email explaining why you use it over the coarse grained mode?
Thanks.


  

Spark 1.6 Release window is not updated in Spark-wiki

2015-10-01 Thread Meethu Mathew
Hi,
In the https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage the
current release window has not been changed from 1.5. Can anybody give an
idea of the expected dates for 1.6 version?

Regards,

Meethu Mathew
Senior Engineer
Flytxt


[MLlib] Contributing algorithm for DP means clustering

2015-06-17 Thread Meethu Mathew
Hi all,

At present, all the clustering algorithms in MLlib require the number of
clusters to be specified in advance.

The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
model that allows for flexible clustering of data without having to specify
apriori the number of clusters. DP means is a non-parametric clustering
algorithm that uses a scale parameter 'lambda' to control the creation of
new clusters.

We have followed the distributed implementation of DP means which has been
proposed in the paper titled MLbase: Distributed Machine Learning Made
Easy by Xinghao Pan, Evan R. Sparks, Andre Wibisono.

I have raised a JIRA ticket at
https://issues.apache.org/jira/browse/SPARK-8402

Suggestions and guidance are welcome.

Regards,

Meethu Mathew
Senior Engineer
Flytxt
www.flytxt.com | Visit our blog http://blog.flytxt.com/ | Follow us
http://www.twitter.com/flytxt | Connect on LinkedIn
http://www.linkedin.com/company/22166?goback=%2Efcs_GLHD_flytxt_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2trk=ncsrch_hits


Re: Anyone facing problem in incremental building of individual project

2015-06-04 Thread Meethu Mathew
Hi,
I added
​
 createDependencyReducedPom in my pom.xml and  the problem is solved.
!-- Work around MSHADE-148 --
+
​​
 createDependencyReducedPomfalse/createDependencyReducedPom

​Thank you @Steve​  and @Ted


Regards,

Meethu Mathew
Senior Engineer
Flytxt
On Thu, Jun 4, 2015 at 9:51 PM, Ted Yu yuzhih...@gmail.com wrote:

 Andrew Or put in this workaround :

 diff --git a/pom.xml b/pom.xml
 index 0b1aaad..d03d33b 100644
 --- a/pom.xml
 +++ b/pom.xml
 @@ -1438,6 +1438,8 @@
  version2.3/version
  configuration
shadedArtifactAttachedfalse/shadedArtifactAttached
 +  !-- Work around MSHADE-148 --
 +  createDependencyReducedPomfalse/createDependencyReducedPom
artifactSet
  includes
!-- At a minimum we must include this to force effective
 pom generation --

 FYI

 On Thu, Jun 4, 2015 at 6:25 AM, Steve Loughran ste...@hortonworks.com
 wrote:


  On 4 Jun 2015, at 11:16, Meethu Mathew meethu.mat...@flytxt.com wrote:

  Hi all,

  ​I added some new code to MLlib. When I am trying to build only the
 mllib project using  *mvn --projects mllib/ -DskipTests clean install*
 *​ *after setting
  export S
 PARK_PREPEND_CLASSES=true
 ​, the build is getting stuck with the following message.



  Excluding org.jpmml:pmml-schema:jar:1.1.15 from the shaded jar.
 [INFO] Excluding com.sun.xml.bind:jaxb-impl:jar:2.2.7 from the shaded
 jar.
 [INFO] Excluding com.sun.xml.bind:jaxb-core:jar:2.2.7 from the shaded
 jar.
 [INFO] Excluding javax.xml.bind:jaxb-api:jar:2.2.7 from the shaded jar.
 [INFO] Including org.spark-project.spark:unused:jar:1.0.0 in the shaded
 jar.
 [INFO] Excluding org.scala-lang:scala-reflect:jar:2.10.4 from the shaded
 jar.
 [INFO] Replacing original artifact with shaded artifact.
 [INFO] Replacing
 /home/meethu/git/FlytxtRnD/spark/mllib/target/spark-mllib_2.10-1.4.0-SNAPSHOT.jar
 with
 /home/meethu/git/FlytxtRnD/spark/mllib/target/spark-mllib_2.10-1.4.0-SNAPSHOT-shaded.jar
 [INFO] Dependency-reduced POM written at:
 /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml
 [INFO] Dependency-reduced POM written at:
 /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml
 [INFO] Dependency-reduced POM written at:
 /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml
 [INFO] Dependency-reduced POM written at:
 /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml

.



  I've seen something similar in a different build,

  It looks like MSHADE-148:
 https://issues.apache.org/jira/browse/MSHADE-148
 if you apply Tom White's patch, does your problem go away?





Anyone facing problem in incremental building of individual project

2015-06-04 Thread Meethu Mathew
Hi all,

​I added some new code to MLlib. When I am trying to build only the mllib
project using  *mvn --projects mllib/ -DskipTests clean install*
*​ *after setting
export S
PARK_PREPEND_CLASSES=true
​, the build is getting stuck with the following message.



  Excluding org.jpmml:pmml-schema:jar:1.1.15 from the shaded jar.
 [INFO] Excluding com.sun.xml.bind:jaxb-impl:jar:2.2.7 from the shaded jar.
 [INFO] Excluding com.sun.xml.bind:jaxb-core:jar:2.2.7 from the shaded jar.
 [INFO] Excluding javax.xml.bind:jaxb-api:jar:2.2.7 from the shaded jar.
 [INFO] Including org.spark-project.spark:unused:jar:1.0.0 in the shaded
 jar.
 [INFO] Excluding org.scala-lang:scala-reflect:jar:2.10.4 from the shaded
 jar.
 [INFO] Replacing original artifact with shaded artifact.
 [INFO] Replacing
 /home/meethu/git/FlytxtRnD/spark/mllib/target/spark-mllib_2.10-1.4.0-SNAPSHOT.jar
 with
 /home/meethu/git/FlytxtRnD/spark/mllib/target/spark-mllib_2.10-1.4.0-SNAPSHOT-shaded.jar
 [INFO] Dependency-reduced POM written at:
 /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml
 [INFO] Dependency-reduced POM written at:
 /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml
 [INFO] Dependency-reduced POM written at:
 /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml
 [INFO] Dependency-reduced POM written at:
 /home/meethu/git/FlytxtRnD/spark/mllib/dependency-reduced-pom.xml

   .

​But  a full build completes as usual. Please help if anyone is facing the
same issue.

Regards,

Meethu Mathew
Senior Engineer
Flytxt


Regarding Connecting spark to Mesos documentation

2015-05-20 Thread Meethu Mathew

Hi List,

In  the documentation of Connecting Spark to Mesos 
http://spark.apache.org/docs/latest/running-on-mesos.html#connecting-spark-to-mesos, 
is it possible to modify and write in detail the step  Create a binary 
package using make-distribution.sh --tgz ? When we use custom compiled 
version of Spark, mostly we specify a hadoop version (which is not the 
default one). In this case, make-distribution.sh should be supplied the 
same maven options we used for building spark. This is not specified  in 
the documentation. Please correct me , if I am wrong.


Regards,
Meethu Mathew


Re: Speeding up Spark build during development

2015-05-04 Thread Meethu Mathew

*
*
** ** ** ** **  **  Hi,

 Is it really necessary to run **mvn --projects assembly/ -DskipTests 
install ? Could you please explain why this is needed?
I got the changes after running mvn --projects streaming/ -DskipTests 
package.


Regards,
Meethu

On Monday 04 May 2015 02:20 PM, Emre Sevinc wrote:

Just to give you an example:

When I was trying to make a small change only to the Streaming component of
Spark, first I built and installed the whole Spark project (this took about
15 minutes on my 4-core, 4 GB RAM laptop). Then, after having changed files
only in Streaming, I ran something like (in the top-level directory):

mvn --projects streaming/ -DskipTests package

and then

mvn --projects assembly/ -DskipTests install


This was much faster than trying to build the whole Spark from scratch,
because Maven was only building one component, in my case the Streaming
component, of Spark. I think you can use a very similar approach.

--
Emre Sevinç



On Mon, May 4, 2015 at 10:44 AM, Pramod Biligiri pramodbilig...@gmail.com
wrote:


No, I just need to build one project at a time. Right now SparkSql.

Pramod

On Mon, May 4, 2015 at 12:09 AM, Emre Sevinc emre.sev...@gmail.com
wrote:


Hello Pramod,

Do you need to build the whole project every time? Generally you don't,
e.g., when I was changing some files that belong only to Spark Streaming, I
was building only the streaming (of course after having build and installed
the whole project, but that was done only once), and then the assembly.
This was much faster than trying to build the whole Spark every time.

--
Emre Sevinç

On Mon, May 4, 2015 at 9:01 AM, Pramod Biligiri pramodbilig...@gmail.com

wrote:
Using the inbuilt maven and zinc it takes around 10 minutes for each
build.
Is that reasonable?
My maven opts looks like this:
$ echo $MAVEN_OPTS
-Xmx12000m -XX:MaxPermSize=2048m

I'm running it as build/mvn -DskipTests package

Should I be tweaking my Zinc/Nailgun config?

Pramod

On Sun, May 3, 2015 at 3:40 PM, Mark Hamstra m...@clearstorydata.com
wrote:




https://spark.apache.org/docs/latest/building-spark.html#building-with-buildmvn

On Sun, May 3, 2015 at 2:54 PM, Pramod Biligiri 

pramodbilig...@gmail.com

wrote:


This is great. I didn't know about the mvn script in the build

directory.

Pramod

On Fri, May 1, 2015 at 9:51 AM, York, Brennon 
brennon.y...@capitalone.com
wrote:


Following what Ted said, if you leverage the `mvn` from within the
`build/` directory of Spark you¹ll get zinc for free which should

help

speed up build times.

On 5/1/15, 9:45 AM, Ted Yu yuzhih...@gmail.com wrote:


Pramod:
Please remember to run Zinc so that the build is faster.

Cheers

On Fri, May 1, 2015 at 9:36 AM, Ulanov, Alexander
alexander.ula...@hp.com
wrote:


Hi Pramod,

For cluster-like tests you might want to use the same code as in

mllib's

LocalClusterSparkContext. You can rebuild only the package that

you

change
and then run this main class.

Best regards, Alexander

-Original Message-
From: Pramod Biligiri [mailto:pramodbilig...@gmail.com]
Sent: Friday, May 01, 2015 1:46 AM
To: dev@spark.apache.org
Subject: Speeding up Spark build during development

Hi,
I'm making some small changes to the Spark codebase and trying

it out

on a
cluster. I was wondering if there's a faster way to build than

running

the
package target each time.
Currently I'm using: mvn -DskipTests  package

All the nodes have the same filesystem mounted at the same mount

point.

Pramod




The information contained in this e-mail is confidential and/or
proprietary to Capital One and/or its affiliates. The information
transmitted herewith is intended only for use by the individual or

entity

to which it is addressed.  If the reader of this message is not the
intended recipient, you are hereby notified that any review,
retransmission, dissemination, distribution, copying or other use

of, or

taking of any action in reliance upon this information is strictly
prohibited. If you have received this communication in error, please
contact the sender and delete the material from your computer.







--
Emre Sevinc









Mail to u...@spark.apache.org failing

2015-02-09 Thread Meethu Mathew

Hi,

The mail id given in 
https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark seems 
to be failing. Can anyone tell me how to get added to Powered By Spark list?


--

Regards,

*Meethu*


Test suites in the python wrapper of kmeans failing

2015-01-21 Thread Meethu Mathew

Hi,

The test suites in the Kmeans class in clustering.py is not updated to 
take the seed value and hence it is failing.
Shall I make the changes and submit it along with my PR( Python API for 
Gaussian Mixture Model) or create a JIRA ?


Regards,
Meethu

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Test suites in the python wrapper of kmeans failing

2015-01-21 Thread Meethu Mathew

Hi,

Sorry it was my mistake. My code was not properly built.

Regards,
Meethu


_http://www.linkedin.com/home?trk=hb_tab_home_top_

On Thursday 22 January 2015 10:39 AM, Meethu Mathew wrote:

Hi,

The test suites in the Kmeans class in clustering.py is not updated to
take the seed value and hence it is failing.
Shall I make the changes and submit it along with my PR( Python API for
Gaussian Mixture Model) or create a JIRA ?

Regards,
Meethu

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org





Use of MapConverter, ListConverter in python to java object conversion

2015-01-13 Thread Meethu Mathew

Hi all,

In the python object to java conversion done in the method _py2java in 
spark/python/pyspark/mllib/common.py, why  we are doing individual 
conversion  using MpaConverter,ListConverter? The same can be acheived 
using


bytearray(PickleSerializer().dumps(obj))
obj = sc._jvm.SerDe.loads(bytes)

Is there any performance gain or something in using individual 
converters rather than PickleSerializer?


--

Regards,

*Meethu*


Re: Python to Java object conversion of numpy array

2015-01-12 Thread Meethu Mathew

Hi,

This is the function defined in PythonMLLibAPI.scala
def findPredict(
  data: JavaRDD[Vector],
  wt: Object,
  mu: Array[Object],
  si: Array[Object]):  RDD[Array[Double]]  = {
}

So the parameter mu should be converted to Array[object].

mu = (Vectors.dense([0.8786, -0.7855]),Vectors.dense([-0.1863, 0.7799]))

def _py2java(sc, obj):

if isinstance(obj, RDD):
...
elif isinstance(obj, SparkContext):
  ...
elif isinstance(obj, dict):
   ...
elif isinstance(obj, (list, tuple)):
obj = ListConverter().convert(obj, sc._gateway._gateway_client)
elif isinstance(obj, JavaObject):
pass
elif isinstance(obj, (int, long, float, bool, basestring)):
pass
else:
bytes = bytearray(PickleSerializer().dumps(obj))
obj = sc._jvm.SerDe.loads(bytes)
return obj

Since its a tuple of Densevectors, in _py2java() its entering the 
isinstance(obj, (list, tuple)) condition and throwing exception(happens 
because the dimension of tuple 1). However the conversion occurs 
correctly if the Pickle conversion is done (last else part).


Hope its clear now.

Regards,
Meethu

On Monday 12 January 2015 11:35 PM, Davies Liu wrote:

On Sun, Jan 11, 2015 at 10:21 PM, Meethu Mathew
meethu.mat...@flytxt.com wrote:

Hi,

This is the code I am running.

mu = (Vectors.dense([0.8786, -0.7855]),Vectors.dense([-0.1863, 0.7799]))

membershipMatrix = callMLlibFunc(findPredict, rdd.map(_convert_to_vector),
mu)

What's the Java API looks like? all the arguments of findPredict
should be converted
into java objects, so what should `mu` be converted to?


Regards,
Meethu
On Monday 12 January 2015 11:46 AM, Davies Liu wrote:

Could you post a piece of code here?

On Sun, Jan 11, 2015 at 9:28 PM, Meethu Mathew meethu.mat...@flytxt.com
wrote:

Hi,
Thanks Davies .

I added a new class GaussianMixtureModel in clustering.py and the method
predict in it and trying to pass numpy array from this method.I converted it
to DenseVector and its solved now.

Similarly I tried passing a List  of more than one dimension to the function
_py2java , but now the exception is

'list' object has no attribute '_get_object_id'

and when I give a tuple input (Vectors.dense([0.8786,
-0.7855]),Vectors.dense([-0.1863, 0.7799])) exception is like

'numpy.ndarray' object has no attribute '_get_object_id'

Regards,



Meethu Mathew

Engineer

Flytxt

www.flytxt.com | Visit our blog  |  Follow us | Connect on Linkedin



On Friday 09 January 2015 11:37 PM, Davies Liu wrote:

Hey Meethu,

The Java API accepts only Vector, so you should convert the numpy array into
pyspark.mllib.linalg.DenseVector.

BTW, which class are you using? the KMeansModel.predict() accept
numpy.array,
it will do the conversion for you.

Davies

On Fri, Jan 9, 2015 at 4:45 AM, Meethu Mathew meethu.mat...@flytxt.com
wrote:

Hi,
I am trying to send a numpy array as an argument to a function predict() in
a class in spark/python/pyspark/mllib/clustering.py which is passed to the
function callMLlibFunc(name, *args)  in
spark/python/pyspark/mllib/common.py.

Now the value is passed to the function  _py2java(sc, obj) .Here I am
getting an exception

Py4JJavaError: An error occurred while calling
z:org.apache.spark.mllib.api.python.SerDe.loads.
: net.razorvine.pickle.PickleException: expected zero arguments for
construction of ClassDict (for numpy.core.multiarray._reconstruct)
 at
net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
 at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
 at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
 at net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
 at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)


Why common._py2java(sc, obj) is not handling numpy array type?

Please help..


--

Regards,

*Meethu Mathew*

*Engineer*

*Flytxt*

www.flytxt.com | Visit our blog http://blog.flytxt.com/ | Follow us
http://www.twitter.com/flytxt | _Connect on Linkedin
http://www.linkedin.com/home?trk=hb_tab_home_top_







Re: Python to Java object conversion of numpy array

2015-01-11 Thread Meethu Mathew

Hi,

This is the code I am running.

mu = (Vectors.dense([0.8786, -0.7855]),Vectors.dense([-0.1863, 0.7799]))

membershipMatrix = callMLlibFunc(findPredict, 
rdd.map(_convert_to_vector), mu)


Regards,
Meethu
On Monday 12 January 2015 11:46 AM, Davies Liu wrote:

Could you post a piece of code here?

On Sun, Jan 11, 2015 at 9:28 PM, Meethu Mathew meethu.mat...@flytxt.com wrote:

Hi,
Thanks Davies .

I added a new class GaussianMixtureModel in clustering.py and the method
predict in it and trying to pass numpy array from this method.I converted it
to DenseVector and its solved now.

Similarly I tried passing a List  of more than one dimension to the function
_py2java , but now the exception is

'list' object has no attribute '_get_object_id'

and when I give a tuple input (Vectors.dense([0.8786,
-0.7855]),Vectors.dense([-0.1863, 0.7799])) exception is like

'numpy.ndarray' object has no attribute '_get_object_id'

Regards,



Meethu Mathew

Engineer

Flytxt

www.flytxt.com | Visit our blog  |  Follow us | Connect on Linkedin



On Friday 09 January 2015 11:37 PM, Davies Liu wrote:

Hey Meethu,

The Java API accepts only Vector, so you should convert the numpy array into
pyspark.mllib.linalg.DenseVector.

BTW, which class are you using? the KMeansModel.predict() accept
numpy.array,
it will do the conversion for you.

Davies

On Fri, Jan 9, 2015 at 4:45 AM, Meethu Mathew meethu.mat...@flytxt.com
wrote:

Hi,
I am trying to send a numpy array as an argument to a function predict() in
a class in spark/python/pyspark/mllib/clustering.py which is passed to the
function callMLlibFunc(name, *args)  in
spark/python/pyspark/mllib/common.py.

Now the value is passed to the function  _py2java(sc, obj) .Here I am
getting an exception

Py4JJavaError: An error occurred while calling
z:org.apache.spark.mllib.api.python.SerDe.loads.
: net.razorvine.pickle.PickleException: expected zero arguments for
construction of ClassDict (for numpy.core.multiarray._reconstruct)
 at
net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
 at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
 at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
 at net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
 at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)


Why common._py2java(sc, obj) is not handling numpy array type?

Please help..


--

Regards,

*Meethu Mathew*

*Engineer*

*Flytxt*

www.flytxt.com | Visit our blog http://blog.flytxt.com/ | Follow us
http://www.twitter.com/flytxt | _Connect on Linkedin
http://www.linkedin.com/home?trk=hb_tab_home_top_






Re: Python to Java object conversion of numpy array

2015-01-11 Thread Meethu Mathew

Hi,
Thanks Davies .

I added a new class GaussianMixtureModel in clustering.py and the method 
predict in it and trying to pass numpy array from this method.I 
converted it to DenseVector and its solved now.


Similarly I tried passing a List  of more than one dimension to the 
function _py2java , but now the exception is


'list' object has no attribute '_get_object_id'

and when I give a tuple input (Vectors.dense([0.8786, 
-0.7855]),Vectors.dense([-0.1863, 0.7799])) exception is like


'numpy.ndarray' object has no attribute '_get_object_id'

Regards,

*Meethu Mathew*

*Engineer*

*Flytxt*

www.flytxt.com | Visit our blog http://blog.flytxt.com/ | Follow us 
http://www.twitter.com/flytxt | _Connect on Linkedin 
http://www.linkedin.com/home?trk=hb_tab_home_top_


On Friday 09 January 2015 11:37 PM, Davies Liu wrote:

Hey Meethu,

The Java API accepts only Vector, so you should convert the numpy array into
pyspark.mllib.linalg.DenseVector.

BTW, which class are you using? the KMeansModel.predict() accept numpy.array,
it will do the conversion for you.

Davies

On Fri, Jan 9, 2015 at 4:45 AM, Meethu Mathew meethu.mat...@flytxt.com wrote:

Hi,
I am trying to send a numpy array as an argument to a function predict() in
a class in spark/python/pyspark/mllib/clustering.py which is passed to the
function callMLlibFunc(name, *args)  in
spark/python/pyspark/mllib/common.py.

Now the value is passed to the function  _py2java(sc, obj) .Here I am
getting an exception

Py4JJavaError: An error occurred while calling
z:org.apache.spark.mllib.api.python.SerDe.loads.
: net.razorvine.pickle.PickleException: expected zero arguments for
construction of ClassDict (for numpy.core.multiarray._reconstruct)
 at
net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
 at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
 at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
 at net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
 at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)


Why common._py2java(sc, obj) is not handling numpy array type?

Please help..


--

Regards,

*Meethu Mathew*

*Engineer*

*Flytxt*

www.flytxt.com | Visit our blog http://blog.flytxt.com/ | Follow us
http://www.twitter.com/flytxt | _Connect on Linkedin
http://www.linkedin.com/home?trk=hb_tab_home_top_





Re: Problems concerning implementing machine learning algorithm from scratch based on Spark

2014-12-30 Thread MEETHU MATHEW
Hi,
The GMMSpark.py you mentioned is the old one.The new code is now added to 
spark-packages and is available at http://spark-packages.org/package/11 . Have 
a look at the new code.
We have used numpy functions in our code and didnt notice any slowdown because 
of this. Thanks  Regards,
Meethu M 

 On Tuesday, 30 December 2014 11:50 AM, danqing0703 
danqing0...@berkeley.edu wrote:
   

 Hi all,

I am trying to use some machine learning algorithms that are not included
in the Mllib. Like Mixture Model and LDA(Latent Dirichlet Allocation), and
I am using pyspark and Spark SQL.

My problem is: I have some scripts that implement these algorithms, but I
am not sure which part I shall change to make it fit into Big Data.

  - Like some very simple calculation may take much time if data is too
  big,but also constructing RDD or SQLContext table takes too much time. I am
  really not sure if I shall use map(), reduce() every time I need to make
  calculation.
  - Also, there are some matrix/array level calculation that can not be
  implemented easily merely using map(),reduce(), thus functions of the Numpy
  package shall be used. I am not sure when data is too big, and we simply
  use the numpy functions. Will it take too much time?

I have found some scripts that are not from Mllib and was created by other
developers(credits to Meethu Mathew from Flytxt, thanks for giving me
insights!:))

Many thanks and look forward to getting feedbacks!

Best, Danqing


GMMSpark.py (7K) 
http://apache-spark-developers-list.1001551.n3.nabble.com/attachment/9964/0/GMMSpark.py




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Problems-concerning-implementing-machine-learning-algorithm-from-scratch-based-on-Spark-tp9964.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

   

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-13 Thread Meethu Mathew

Hi Ashutosh,

Please edit the README file.I think the following function call is 
changed now.


|model = OutlierWithAVFModel.outliers(master:String, input dir:String , 
percentage:Double||)
|

Regards,

*Meethu Mathew*

*Engineer*

*Flytxt*

_http://www.linkedin.com/home?trk=hb_tab_home_top_

On Friday 14 November 2014 12:01 AM, Ashutosh wrote:

Hi Anant,

Please see the changes.

https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala


I have changed the input format to Vector of String. I think we can also make 
it generic.


Line 59  72 : that counter will not affect in parallelism, Since it only work 
on one datapoint. It  only does the Indexing of the column.


Rest all side effects have been removed.

​

Thanks,

Ashutosh





From: slcclimber [via Apache Spark Developers List] 
ml-node+s1001551n9287...@n3.nabble.com
Sent: Tuesday, November 11, 2014 11:46 PM
To: Ashutosh Trivedi (MT2013030)
Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection


Mayur,
Libsvm format sounds good to me. I could work on writing the tests if that 
helps you?
Anant

On Nov 11, 2014 11:06 AM, Ashutosh [via Apache Spark Developers List] [hidden 
email]/user/SendEmail.jtp?type=nodenode=9287i=0 wrote:

Hi Mayur,

Vector data types are implemented using breeze library, it is presented at

.../org/apache/spark/mllib/linalg


Anant,

One restriction I found that a vector can only be of 'Double', so it actually 
restrict the user.

What are you thoughts on LibSVM format?

Thanks for the comments, I was just trying to get away from those increment 
/decrement functions, they look ugly. Points are noted. I'll try to fix them 
soon. Tests are also required for the code.


Regards,

Ashutosh



From: Mayur Rustagi [via Apache Spark Developers List] ml-node+[hidden 
email]http://user/SendEmail.jtp?type=nodenode=9286i=0
Sent: Saturday, November 8, 2014 12:52 PM
To: Ashutosh Trivedi (MT2013030)
Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection


We should take a vector instead giving the user flexibility to decide
data source/ type

What do you mean by vector datatype exactly?

Mayur Rustagi
Ph: a href=tel:%2B1%20%28760%29%20203%203257 value=+17602033257 
target=_blank+1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi


On Wed, Nov 5, 2014 at 6:45 AM, slcclimber [hidden 
email]http://user/SendEmail.jtp?type=nodenode=9239i=0 wrote:


Ashutosh,
I still see a few issues.
1. On line 112 you are counting using a counter. Since this will happen in
a RDD the counter will cause issues. Also that is not good functional style
to use a filter function with a side effect.
You could use randomSplit instead. This does not the same thing without the
side effect.
2. Similar shared usage of j in line 102 is going to be an issue as well.
also hash seed does not need to be sequential it could be randomly
generated or hashed on the values.
3. The compute function and trim scores still runs on a comma separeated
RDD. We should take a vector instead giving the user flexibility to decide
data source/ type. what if we want data from hive tables or parquet or JSON
or avro formats. This is a very restrictive format. With vectors the user
has the choice of taking in whatever data format and converting them to
vectors insteda of reading json files creating a csv file and then workig
on that.
4. Similar use of counters in 54 and 65 is an issue.
Basically the shared state counters is a huge issue that does not scale.
Since the processing of RDD's is distributed and the value j lives on the
master.

Anant



On Tue, Nov 4, 2014 at 7:22 AM, Ashutosh [via Apache Spark Developers List]
[hidden email]http://user/SendEmail.jtp?type=nodenode=9239i=1 wrote:


  Anant,

I got rid of those increment/ decrements functions and now code is much
cleaner. Please check. All your comments have been looked after.




https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala


  _Ashu



https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala

   Outlier-Detection-with-AVF-Spark/OutlierWithAVFModel.scala at master ·
codeAshu/Outlier-Detection-with-AVF-Spark · GitHub
  Contribute to Outlier-Detection-with-AVF-Spark development by creating

an

account on GitHub.
  Read more...


https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala


  --
*From:* slcclimber [via Apache Spark Developers List] ml-node+[hidden
email] http://user/SendEmail.jtp?type=nodenode=9083i=0
*Sent:* Friday, October 31, 2014 10:09 AM
*To:* Ashutosh Trivedi (MT2013030)
*Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection


You should create a jira ticket to go with it as well.
Thanks
On Oct 30, 2014 10:38 PM, Ashutosh [via Apache Spark

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-13 Thread Meethu Mathew


Hi,

I have a doubt regarding the input to your algorithm.
_http://www.linkedin.com/home?trk=hb_tab_home_top_

val model = OutlierWithAVFModel.outliers(data :RDD[Vector[String]], 
percent : Double, sc :SparkContext)



Here our input  data is an RDD[Vector[String]]. How we can create this 
RDD from a file? sc.textFile will simply give us an RDD, how to make it 
a Vector[String]?



Could you plz share any code snippet of this conversion if you have..


Regards,
Meethu Mathew

On Friday 14 November 2014 10:02 AM, Meethu Mathew wrote:

Hi Ashutosh,

Please edit the README file.I think the following function call is
changed now.

|model = OutlierWithAVFModel.outliers(master:String, input dir:String , 
percentage:Double||)
|

Regards,

*Meethu Mathew*

*Engineer*

*Flytxt*

_http://www.linkedin.com/home?trk=hb_tab_home_top_

On Friday 14 November 2014 12:01 AM, Ashutosh wrote:

Hi Anant,

Please see the changes.

https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala


I have changed the input format to Vector of String. I think we can also make 
it generic.


Line 59  72 : that counter will not affect in parallelism, Since it only work 
on one datapoint. It  only does the Indexing of the column.


Rest all side effects have been removed.

​

Thanks,

Ashutosh





From: slcclimber [via Apache Spark Developers List] 
ml-node+s1001551n9287...@n3.nabble.com
Sent: Tuesday, November 11, 2014 11:46 PM
To: Ashutosh Trivedi (MT2013030)
Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection


Mayur,
Libsvm format sounds good to me. I could work on writing the tests if that 
helps you?
Anant

On Nov 11, 2014 11:06 AM, Ashutosh [via Apache Spark Developers List] [hidden 
email]/user/SendEmail.jtp?type=nodenode=9287i=0 wrote:

Hi Mayur,

Vector data types are implemented using breeze library, it is presented at

.../org/apache/spark/mllib/linalg


Anant,

One restriction I found that a vector can only be of 'Double', so it actually 
restrict the user.

What are you thoughts on LibSVM format?

Thanks for the comments, I was just trying to get away from those increment 
/decrement functions, they look ugly. Points are noted. I'll try to fix them 
soon. Tests are also required for the code.


Regards,

Ashutosh



From: Mayur Rustagi [via Apache Spark Developers List] ml-node+[hidden 
email]http://user/SendEmail.jtp?type=nodenode=9286i=0
Sent: Saturday, November 8, 2014 12:52 PM
To: Ashutosh Trivedi (MT2013030)
Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection


We should take a vector instead giving the user flexibility to decide
data source/ type

What do you mean by vector datatype exactly?

Mayur Rustagi
Ph: a href=tel:%2B1%20%28760%29%20203%203257 value=+17602033257 
target=_blank+1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi


On Wed, Nov 5, 2014 at 6:45 AM, slcclimber [hidden 
email]http://user/SendEmail.jtp?type=nodenode=9239i=0 wrote:


Ashutosh,
I still see a few issues.
1. On line 112 you are counting using a counter. Since this will happen in
a RDD the counter will cause issues. Also that is not good functional style
to use a filter function with a side effect.
You could use randomSplit instead. This does not the same thing without the
side effect.
2. Similar shared usage of j in line 102 is going to be an issue as well.
also hash seed does not need to be sequential it could be randomly
generated or hashed on the values.
3. The compute function and trim scores still runs on a comma separeated
RDD. We should take a vector instead giving the user flexibility to decide
data source/ type. what if we want data from hive tables or parquet or JSON
or avro formats. This is a very restrictive format. With vectors the user
has the choice of taking in whatever data format and converting them to
vectors insteda of reading json files creating a csv file and then workig
on that.
4. Similar use of counters in 54 and 65 is an issue.
Basically the shared state counters is a huge issue that does not scale.
Since the processing of RDD's is distributed and the value j lives on the
master.

Anant



On Tue, Nov 4, 2014 at 7:22 AM, Ashutosh [via Apache Spark Developers List]
[hidden email]http://user/SendEmail.jtp?type=nodenode=9239i=1 wrote:


   Anant,

I got rid of those increment/ decrements functions and now code is much
cleaner. Please check. All your comments have been looked after.




https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala

   _Ashu



https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala

Outlier-Detection-with-AVF-Spark/OutlierWithAVFModel.scala at master ·
codeAshu/Outlier-Detection-with-AVF-Spark · GitHub
   Contribute to Outlier-Detection-with-AVF-Spark development by creating

an

account on GitHub

Re: Gaussian Mixture Model clustering

2014-09-21 Thread Meethu Mathew

Hi Evan,
Sorry that I forgot to mention about it. I set the value of K as 10 for 
the benchmark study.


On Friday 19 September 2014 11:24 PM, Evan R. Sparks wrote:
Hey Meethu - what are you setting K to in the benchmarks you show? 
This can greatly affect the runtime.


On Thu, Sep 18, 2014 at 10:38 PM, Meethu Mathew 
meethu.mat...@flytxt.com mailto:meethu.mat...@flytxt.com wrote:


Hi all,
Please find attached the image of benchmark results. The table in
the previous mail got messed up. Thanks.



On Friday 19 September 2014 10:55 AM, Meethu Mathew wrote:

Hi all,

We have come up with an initial distributed implementation of Gaussian
Mixture Model in pyspark where the parameters are estimated using the
Expectation-Maximization algorithm.Our current implementation considers
diagonal covariance matrix for each component.
We did an initial benchmark study on a 2 node Spark standalone cluster
setup where each node config is 8 Cores,8 GB RAM, the spark version used
is 1.0.0. We also evaluated python version of k-means available in spark
on the same datasets.Below are the results from this benchmark study.
The reported stats are average from 10 runs.Tests were done on multiple
datasets with varying number of features and instances.


   DatasetGaussian mixture model
   Kmeans(Python)

Instances   Dimensions  Avg time per iteration  Time for 100 iterations
Avg time per iteration  Time for 100 iterations
0.7million  13
7s
12min
  13s   26min
1.8million  11
17s
 29min 33s
 53min
10 million  16
1.6min  2.7hr
  1.2min2 hr


We are interested in contributing this implementation as a patch to
SPARK. Does MLLib accept python implementations? If not, can we
contribute to the pyspark component
I have created a JIRA for the same
https://issues.apache.org/jira/browse/SPARK-3588  .How do I get the
ticket assigned to myself?

Please review and suggest how to take this forward.





-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
mailto:dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org
mailto:dev-h...@spark.apache.org




--

Regards,

*Meethu Mathew*

*Engineer*

*Flytxt*

www.flytxt.com | Visit our blog http://blog.flytxt.com/ | Follow us 
http://www.twitter.com/flytxt | _Connect on Linkedin 
http://www.linkedin.com/home?trk=hb_tab_home_top_




Gaussian Mixture Model clustering

2014-09-18 Thread Meethu Mathew

Hi all,

We have come up with an initial distributed implementation of Gaussian 
Mixture Model in pyspark where the parameters are estimated using the 
Expectation-Maximization algorithm.Our current implementation considers 
diagonal covariance matrix for each component.
We did an initial benchmark study on a 2 node Spark standalone cluster 
setup where each node config is 8 Cores,8 GB RAM, the spark version used 
is 1.0.0. We also evaluated python version of k-means available in spark 
on the same datasets.Below are the results from this benchmark study. 
The reported stats are average from 10 runs.Tests were done on multiple 
datasets with varying number of features and instances.



 Dataset  Gaussian mixture model
   Kmeans(Python)

Instances   Dimensions  Avg time per iteration  Time for 100 iterations
Avg time per iteration  Time for 100 iterations
0.7million  13
7s
12min
  13s   26min
1.8million  11
17s
 29min 33s
 53min
10 million  16
1.6min  2.7hr
  1.2min2 hr


We are interested in contributing this implementation as a patch to 
SPARK. Does MLLib accept python implementations? If not, can we 
contribute to the pyspark component
I have created a JIRA for the same 
https://issues.apache.org/jira/browse/SPARK-3588 .How do I get the 
ticket assigned to myself?


Please review and suggest how to take this forward.



--

Regards,


*Meethu Mathew*

*Engineer*

*Flytxt*

F: +91 471.2700202

www.flytxt.com | Visit our blog http://blog.flytxt.com/ | Follow us 
http://www.twitter.com/flytxt | _Connect on Linkedin 
http://www.linkedin.com/home?trk=hb_tab_home_top_




Contribution to MLlib

2014-07-09 Thread MEETHU MATHEW
Hi,

I am interested in contributing a clustering algorithm towards MLlib of Spark.I 
am focusing on Gaussian Mixture Model.
But I saw a JIRA @ https://spark-project.atlassian.net/browse/SPARK-952 
regrading the same.I would like to know whether Gaussian Mixture Model is 
already implemented or not.



Thanks  Regards, 
Meethu M

Contributions to MLlib

2014-05-22 Thread MEETHU MATHEW
Hi,


I would like to do some contributions towards the MLlib .I've a few concerns 
regarding the same.

1. Is there any reason for implementing the algorithms supported  by MLlib in 
Scala
2. Will you accept if  the contributions are done in Python or Java

Thanks,
Meethu M