Re: [MLlib] Contributing Algorithm for Outlier Detection

Meethu Mathew Thu, 13 Nov 2014 22:13:18 -0800

Hi,


I have a doubt regarding the input to your algorithm.
_<http://www.linkedin.com/home?trk=hb_tab_home_top>_

val model = OutlierWithAVFModel.outliers(data :RDD[Vector[String]],percent : Double, sc :SparkContext)

Here our input data is an RDD[Vector[String]]. How we can create thisRDD from a file? sc.textFile will simply give us an RDD, how to make ita Vector[String]?



Could you plz share any code snippet of this conversion if you have..


Regards,
Meethu Mathew

On Friday 14 November 2014 10:02 AM, Meethu Mathew wrote:

Hi Ashutosh,

Please edit the README file.I think the following function call is
changed now.

|model = OutlierWithAVFModel.outliers(master:String, input dir:String , 
percentage:Double||)
|

Regards,

*Meethu Mathew*

*Engineer*

*Flytxt*

_<http://www.linkedin.com/home?trk=hb_tab_home_top>_

On Friday 14 November 2014 12:01 AM, Ashutosh wrote:

Hi Anant,

Please see the changes.

https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala


I have changed the input format to Vector of String. I think we can also make 
it generic.


Line 59 & 72 : that counter will not affect in parallelism, Since it only work 
on one datapoint. It  only                         does the Indexing of the column.


Rest all side effects have been removed.



Thanks,

Ashutosh




________________________________
From: slcclimber [via Apache Spark Developers List] 
<[email protected]>
Sent: Tuesday, November 11, 2014 11:46 PM
To: Ashutosh Trivedi (MT2013030)
Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection


Mayur,
Libsvm format sounds good to me. I could work on writing the tests if that 
helps you?
Anant

On Nov 11, 2014 11:06 AM, "Ashutosh [via Apache Spark Developers List]" <[hidden 
email]</user/SendEmail.jtp?type=node&node=9287&i=0>> wrote:

Hi Mayur,

Vector data types are implemented using breeze library, it is presented at

.../org/apache/spark/mllib/linalg


Anant,

One restriction I found that a vector can only be of 'Double', so it actually 
restrict the user.

What are you thoughts on LibSVM format?

Thanks for the comments, I was just trying to get away from those increment 
/decrement functions, they look ugly. Points are noted. I'll try to fix them 
soon. Tests are also required for the code.


Regards,

Ashutosh


________________________________
From: Mayur Rustagi [via Apache Spark Developers List] <ml-node+[hidden 
email]<http://user/SendEmail.jtp?type=node&node=9286&i=0>>
Sent: Saturday, November 8, 2014 12:52 PM
To: Ashutosh Trivedi (MT2013030)
Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection

We should take a vector instead giving the user flexibility to decide
data source/ type

What do you mean by vector datatype exactly?

Mayur Rustagi
Ph: <a href="tel:%2B1%20%28760%29%20203%203257" value="+17602033257" 
target="_blank">+1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>


On Wed, Nov 5, 2014 at 6:45 AM, slcclimber <[hidden 
email]<http://user/SendEmail.jtp?type=node&node=9239&i=0>> wrote:

Ashutosh,
I still see a few issues.
1. On line 112 you are counting using a counter. Since this will happen in
a RDD the counter will cause issues. Also that is not good functional style
to use a filter function with a side effect.
You could use randomSplit instead. This does not the same thing without the
side effect.
2. Similar shared usage of j in line 102 is going to be an issue as well.
also hash seed does not need to be sequential it could be randomly
generated or hashed on the values.
3. The compute function and trim scores still runs on a comma separeated
RDD. We should take a vector instead giving the user flexibility to decide
data source/ type. what if we want data from hive tables or parquet or JSON
or avro formats. This is a very restrictive format. With vectors the user
has the choice of taking in whatever data format and converting them to
vectors insteda of reading json files creating a csv file and then workig
on that.
4. Similar use of counters in 54 and 65 is an issue.
Basically the shared state counters is a huge issue that does not scale.
Since the processing of RDD's is distributed and the value j lives on the
master.

Anant



On Tue, Nov 4, 2014 at 7:22 AM, Ashutosh [via Apache Spark Developers List]
<[hidden email]<http://user/SendEmail.jtp?type=node&node=9239&i=1>> wrote:

   Anant,

I got rid of those increment/ decrements functions and now code is much
cleaner. Please check. All your comments have been looked after.

https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala

   _Ashu

<

https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala

    Outlier-Detection-with-AVF-Spark/OutlierWithAVFModel.scala at master ·
codeAshu/Outlier-Detection-with-AVF-Spark · GitHub
   Contribute to Outlier-Detection-with-AVF-Spark development by creating

an

account on GitHub.
   Read more...
<

https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala

   ------------------------------
*From:* slcclimber [via Apache Spark Developers List] <ml-node+[hidden
email] <http://user/SendEmail.jtp?type=node&node=9083&i=0>>
*Sent:* Friday, October 31, 2014 10:09 AM
*To:* Ashutosh Trivedi (MT2013030)
*Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection


You should create a jira ticket to go with it as well.
Thanks
On Oct 30, 2014 10:38 PM, "Ashutosh [via Apache Spark Developers List]"

<[hidden

email] <http://user/SendEmail.jtp?type=node&node=9037&i=0>> wrote:

   Okay. I'll try it and post it soon with test case. After that I think
we can go ahead with the PR.
   ------------------------------
*From:* slcclimber [via Apache Spark Developers List] <ml-node+[hidden
email] <http://user/SendEmail.jtp?type=node&node=9036&i=0>>
*Sent:* Friday, October 31, 2014 10:03 AM
*To:* Ashutosh Trivedi (MT2013030)
*Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection


Ashutosh,
A vector would be a good idea vectors are used very frequently.
Test data is usually stored in the spark/data/mllib folder
   On Oct 30, 2014 10:31 PM, "Ashutosh [via Apache Spark Developers List]"
<[hidden email] <http://user/SendEmail.jtp?type=node&node=9035&i=0>>
wrote:

Hi Anant,
sorry for my late reply. Thank you for taking time and reviewing it.

I have few comments on first issue.

You are correct on the string (csv) part. But we can not take input of
type you mentioned. We calculate frequency in our function. Otherwise

user

has to do all this computation. I realize that taking a RDD[Vector]

would

be general enough for all. What do you say?

I agree on rest all the issues. I will correct them soon and post it.
I have a doubt on test cases. Where should I put data while giving test
scripts? or should i generate synthetic data for testing with in the
scripts, how does this work?

Regards,
Ashutosh

------------------------------
   If you reply to this email, your message will be added to the
discussion below:

http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9034.html

   To unsubscribe from [MLlib] Contributing Algorithm for Outlier
Detection, click here.
NAML
<

http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml

------------------------------
   If you reply to this email, your message will be added to the
discussion below:

http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9035.html

   To unsubscribe from [MLlib] Contributing Algorithm for Outlier
Detection, click here.
NAML
<

http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml

------------------------------
   If you reply to this email, your message will be added to the
discussion below:

http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9036.html

   To unsubscribe from [MLlib] Contributing Algorithm for Outlier
Detection, click here.
NAML
<

http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml

------------------------------
   If you reply to this email, your message will be added to the discussion
below:

http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9037.html

   To unsubscribe from [MLlib] Contributing Algorithm for Outlier

Detection, click

here.
NAML
<

http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml


------------------------------
   If you reply to this email, your message will be added to the discussion
below:

http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9083.html

   To unsubscribe from [MLlib] Contributing Algorithm for Outlier

Detection, click

here
<

.
NAML
<

http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml


--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9095.html
Sent from the Apache Spark Developers List mailing list archive at
Nabble.com.

________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9239.html
To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click 
here.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>


________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9286.html
To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click 
here.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>


________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9287.html
To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click 
here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=8880&code=YXNodXRvc2gudHJpdmVkaUBpaWl0Yi5vcmd8ODg4MHwtMzkzMzE5NzYx>.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9327.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: [MLlib] Contributing Algorithm for Outlier Detection

Reply via email to