Re: [MLlib] Contributing Algorithm for Outlier Detection

Mayur Rustagi Fri, 07 Nov 2014 23:20:07 -0800

>
> We should take a vector instead giving the user flexibility to decide
> data source/ type


What do you mean by vector datatype exactly?

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>


On Wed, Nov 5, 2014 at 6:45 AM, slcclimber <anant.a...@gmail.com> wrote:

> Ashutosh,
> I still see a few issues.
> 1. On line 112 you are counting using a counter. Since this will happen in
> a RDD the counter will cause issues. Also that is not good functional style
> to use a filter function with a side effect.
> You could use randomSplit instead. This does not the same thing without the
> side effect.
> 2. Similar shared usage of j in line 102 is going to be an issue as well.
> also hash seed does not need to be sequential it could be randomly
> generated or hashed on the values.
> 3. The compute function and trim scores still runs on a comma separeated
> RDD. We should take a vector instead giving the user flexibility to decide
> data source/ type. what if we want data from hive tables or parquet or JSON
> or avro formats. This is a very restrictive format. With vectors the user
> has the choice of taking in whatever data format and converting them to
> vectors insteda of reading json files creating a csv file and then workig
> on that.
> 4. Similar use of counters in 54 and 65 is an issue.
> Basically the shared state counters is a huge issue that does not scale.
> Since the processing of RDD's is distributed and the value j lives on the
> master.
>
> Anant
>
>
>
> On Tue, Nov 4, 2014 at 7:22 AM, Ashutosh [via Apache Spark Developers List]
> <ml-node+s1001551n9083...@n3.nabble.com> wrote:
>
> >  Anant,
> >
> > I got rid of those increment/ decrements functions and now code is much
> > cleaner. Please check. All your comments have been looked after.
> >
> >
> >
> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
> >
> >
> >  _Ashu
> >
> > <
> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
> >
> >   Outlier-Detection-with-AVF-Spark/OutlierWithAVFModel.scala at master ·
> > codeAshu/Outlier-Detection-with-AVF-Spark · GitHub
> >  Contribute to Outlier-Detection-with-AVF-Spark development by creating
> an
> > account on GitHub.
> >  Read more...
> > <
> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
> >
> >
> >  ------------------------------
> > *From:* slcclimber [via Apache Spark Developers List] <ml-node+[hidden
> > email] <http://user/SendEmail.jtp?type=node&node=9083&i=0>>
> > *Sent:* Friday, October 31, 2014 10:09 AM
> > *To:* Ashutosh Trivedi (MT2013030)
> > *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection
> >
> >
> > You should create a jira ticket to go with it as well.
> > Thanks
> > On Oct 30, 2014 10:38 PM, "Ashutosh [via Apache Spark Developers List]"
> <[hidden
> > email] <http://user/SendEmail.jtp?type=node&node=9037&i=0>> wrote:
> >
> >>  Okay. I'll try it and post it soon with test case. After that I think
> >> we can go ahead with the PR.
> >>  ------------------------------
> >> *From:* slcclimber [via Apache Spark Developers List] <ml-node+[hidden
> >> email] <http://user/SendEmail.jtp?type=node&node=9036&i=0>>
> >> *Sent:* Friday, October 31, 2014 10:03 AM
> >> *To:* Ashutosh Trivedi (MT2013030)
> >> *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection
> >>
> >>
> >> Ashutosh,
> >> A vector would be a good idea vectors are used very frequently.
> >> Test data is usually stored in the spark/data/mllib folder
> >>  On Oct 30, 2014 10:31 PM, "Ashutosh [via Apache Spark Developers List]"
> >> <[hidden email] <http://user/SendEmail.jtp?type=node&node=9035&i=0>>
> >> wrote:
> >>
> >>> Hi Anant,
> >>> sorry for my late reply. Thank you for taking time and reviewing it.
> >>>
> >>> I have few comments on first issue.
> >>>
> >>> You are correct on the string (csv) part. But we can not take input of
> >>> type you mentioned. We calculate frequency in our function. Otherwise
> user
> >>> has to do all this computation. I realize that taking a RDD[Vector]
> would
> >>> be general enough for all. What do you say?
> >>>
> >>> I agree on rest all the issues. I will correct them soon and post it.
> >>> I have a doubt on test cases. Where should I put data while giving test
> >>> scripts? or should i generate synthetic data for testing with in the
> >>> scripts, how does this work?
> >>>
> >>> Regards,
> >>> Ashutosh
> >>>
> >>> ------------------------------
> >>>  If you reply to this email, your message will be added to the
> >>> discussion below:
> >>>
> >>>
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9034.html
> >>>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> >>> Detection, click here.
> >>> NAML
> >>> <
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >>>
> >>
> >>
> >> ------------------------------
> >>  If you reply to this email, your message will be added to the
> >> discussion below:
> >>
> >>
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9035.html
> >>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> >> Detection, click here.
> >> NAML
> >> <
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >>
> >>
> >> ------------------------------
> >>  If you reply to this email, your message will be added to the
> >> discussion below:
> >>
> >>
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9036.html
> >>  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> >> Detection, click here.
> >> NAML
> >> <
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >>
> >
> >
> > ------------------------------
> >  If you reply to this email, your message will be added to the discussion
> > below:
> >
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9037.html
> >  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> Detection, click
> > here.
> > NAML
> > <
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >
> >
> > ------------------------------
> >  If you reply to this email, your message will be added to the discussion
> > below:
> >
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9083.html
> >  To unsubscribe from [MLlib] Contributing Algorithm for Outlier
> Detection, click
> > here
> > <
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=8880&code=YW5hbnQuYXN0eUBnbWFpbC5jb218ODg4MHwxOTU2OTQ5NjMy
> >
> > .
> > NAML
> > <
> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9095.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>

Re: [MLlib] Contributing Algorithm for Outlier Detection

Reply via email to