Re: Streaming anomaly detection using ARIMA

2015-04-10 Thread Corey Nolet
Sean,

I do agree about the inside out parallelization but my curiosity is
mostly in what type of performance I can expect to have by piping out to R.
I'm playing with Twitter's new Anomaly Detection library btw, this could be
a solution if I can get the calls to R to stand up to the massive dataset
that I have.

I'll report back my findings.

On Thu, Apr 2, 2015 at 3:46 AM, Sean Owen so...@cloudera.com wrote:

 This inside out parallelization has been a way people have used R
 with MapReduce for a long time. Run N copies of an R script on the
 cluster, on different subsets of the data, babysat by Mappers. You
 just need R installed on the cluster. Hadoop Streaming makes this easy
 and things like RDD.pipe in Spark make it easier.

 So it may be just that simple and so there's not much to say about it.
 I haven't tried this with Spark Streaming but imagine it would also
 work. Have you tried this?

 Within a window you would probably take the first x% as training and
 the rest as test. I don't think there's a question of looking across
 windows.

 On Thu, Apr 2, 2015 at 12:31 AM, Corey Nolet cjno...@gmail.com wrote:
  Surprised I haven't gotten any responses about this. Has anyone tried
 using
  rJava or FastR w/ Spark? I've seen the SparkR project but thta goes the
  other way- what I'd like to do is use R for model calculation and Spark
 to
  distribute the load across the cluster.
 
  Also, has anyone used Scalation for ARIMA models?
 
  On Mon, Mar 30, 2015 at 9:30 AM, Corey Nolet cjno...@gmail.com wrote:
 
  Taking out the complexity of the ARIMA models to simplify things- I
 can't
  seem to find a good way to represent even standard moving averages in
 spark
  streaming. Perhaps it's my ignorance with the micro-batched style of the
  DStreams API.
 
  On Fri, Mar 27, 2015 at 9:13 PM, Corey Nolet cjno...@gmail.com wrote:
 
  I want to use ARIMA for a predictive model so that I can take time
 series
  data (metrics) and perform a light anomaly detection. The time series
 data
  is going to be bucketed to different time units (several minutes within
  several hours, several hours within several days, several days within
  several years.
 
  I want to do the algorithm in Spark Streaming. I'm used to tuple at a
  time streaming and I'm having a tad bit of trouble gaining insight
 into how
  exactly the windows are managed inside of DStreams.
 
  Let's say I have a simple dataset that is marked by a key/value tuple
  where the key is the name of the component who's metrics I want to run
 the
  algorithm against and the value is a metric (a value representing a
 sum for
  the time bucket. I want to create histograms of the time series data
 for
  each key in the windows in which they reside so I can use that
 histogram
  vector to generate my ARIMA prediction (actually, it seems like this
 doesn't
  just apply to ARIMA but could apply to any sliding average).
 
  I *think* my prediction code may look something like this:
 
  val predictionAverages = dstream
.groupByKeyAndWindow(60*60*24, 60*60*24)
.mapValues(applyARIMAFunction)
 
  That is, keep 24 hours worth of metrics in each window and use that for
  the ARIMA prediction. The part I'm struggling with is how to join
 together
  the actual values so that i can do my comparison against the prediction
  model.
 
  Let's say dstream contains the actual values. For any time  window, I
  should be able to take a previous set of windows and use model to
 compare
  against the current values.
 
 
 
 



Re: Streaming anomaly detection using ARIMA

2015-04-02 Thread Sean Owen
This inside out parallelization has been a way people have used R
with MapReduce for a long time. Run N copies of an R script on the
cluster, on different subsets of the data, babysat by Mappers. You
just need R installed on the cluster. Hadoop Streaming makes this easy
and things like RDD.pipe in Spark make it easier.

So it may be just that simple and so there's not much to say about it.
I haven't tried this with Spark Streaming but imagine it would also
work. Have you tried this?

Within a window you would probably take the first x% as training and
the rest as test. I don't think there's a question of looking across
windows.

On Thu, Apr 2, 2015 at 12:31 AM, Corey Nolet cjno...@gmail.com wrote:
 Surprised I haven't gotten any responses about this. Has anyone tried using
 rJava or FastR w/ Spark? I've seen the SparkR project but thta goes the
 other way- what I'd like to do is use R for model calculation and Spark to
 distribute the load across the cluster.

 Also, has anyone used Scalation for ARIMA models?

 On Mon, Mar 30, 2015 at 9:30 AM, Corey Nolet cjno...@gmail.com wrote:

 Taking out the complexity of the ARIMA models to simplify things- I can't
 seem to find a good way to represent even standard moving averages in spark
 streaming. Perhaps it's my ignorance with the micro-batched style of the
 DStreams API.

 On Fri, Mar 27, 2015 at 9:13 PM, Corey Nolet cjno...@gmail.com wrote:

 I want to use ARIMA for a predictive model so that I can take time series
 data (metrics) and perform a light anomaly detection. The time series data
 is going to be bucketed to different time units (several minutes within
 several hours, several hours within several days, several days within
 several years.

 I want to do the algorithm in Spark Streaming. I'm used to tuple at a
 time streaming and I'm having a tad bit of trouble gaining insight into how
 exactly the windows are managed inside of DStreams.

 Let's say I have a simple dataset that is marked by a key/value tuple
 where the key is the name of the component who's metrics I want to run the
 algorithm against and the value is a metric (a value representing a sum for
 the time bucket. I want to create histograms of the time series data for
 each key in the windows in which they reside so I can use that histogram
 vector to generate my ARIMA prediction (actually, it seems like this doesn't
 just apply to ARIMA but could apply to any sliding average).

 I *think* my prediction code may look something like this:

 val predictionAverages = dstream
   .groupByKeyAndWindow(60*60*24, 60*60*24)
   .mapValues(applyARIMAFunction)

 That is, keep 24 hours worth of metrics in each window and use that for
 the ARIMA prediction. The part I'm struggling with is how to join together
 the actual values so that i can do my comparison against the prediction
 model.

 Let's say dstream contains the actual values. For any time  window, I
 should be able to take a previous set of windows and use model to compare
 against the current values.





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Streaming anomaly detection using ARIMA

2015-04-01 Thread Felix Cheung
I'm curious - I'm not sure if I understand you correctly. With SparkR, the work 
is distributed in Spark and computed in R, isn't that what your are looking for?
SparkR was on rJava for the R-JVM but moved away from it.
 
rJava has a component called JRI which allows JVM to call R.
You could call R with JRI or through rdd.forEachPartition(pass_data_to_R) or 
rdd.pipe
 
From: cjno...@gmail.com
Date: Wed, 1 Apr 2015 19:31:48 -0400
Subject: Re: Streaming anomaly detection using ARIMA
To: user@spark.apache.org

Surprised I haven't gotten any responses about this. Has anyone tried using 
rJava or FastR w/ Spark? I've seen the SparkR project but thta goes the other 
way- what I'd like to do is use R for model calculation and Spark to distribute 
the load across the cluster.
Also, has anyone used Scalation for ARIMA models? 
On Mon, Mar 30, 2015 at 9:30 AM, Corey Nolet cjno...@gmail.com wrote:
Taking out the complexity of the ARIMA models to simplify things- I can't seem 
to find a good way to represent even standard moving averages in spark 
streaming. Perhaps it's my ignorance with the micro-batched style of the 
DStreams API.

On Fri, Mar 27, 2015 at 9:13 PM, Corey Nolet cjno...@gmail.com wrote:
I want to use ARIMA for a predictive model so that I can take time series data 
(metrics) and perform a light anomaly detection. The time series data is going 
to be bucketed to different time units (several minutes within several hours, 
several hours within several days, several days within several years.
I want to do the algorithm in Spark Streaming. I'm used to tuple at a time 
streaming and I'm having a tad bit of trouble gaining insight into how exactly 
the windows are managed inside of DStreams.
Let's say I have a simple dataset that is marked by a key/value tuple where the 
key is the name of the component who's metrics I want to run the algorithm 
against and the value is a metric (a value representing a sum for the time 
bucket. I want to create histograms of the time series data for each key in the 
windows in which they reside so I can use that histogram vector to generate my 
ARIMA prediction (actually, it seems like this doesn't just apply to ARIMA but 
could apply to any sliding average). 
I *think* my prediction code may look something like this:
val predictionAverages = dstream  .groupByKeyAndWindow(60*60*24, 60*60*24)
  .mapValues(applyARIMAFunction)
That is, keep 24 hours worth of metrics in each window and use that for the 
ARIMA prediction. The part I'm struggling with is how to join together the 
actual values so that i can do my comparison against the prediction model. 

Let's say dstream contains the actual values. For any time  window, I should be 
able to take a previous set of windows and use model to compare against the 
current values.





  

Re: Streaming anomaly detection using ARIMA

2015-04-01 Thread Corey Nolet
Surprised I haven't gotten any responses about this. Has anyone tried using
rJava or FastR w/ Spark? I've seen the SparkR project but thta goes the
other way- what I'd like to do is use R for model calculation and Spark to
distribute the load across the cluster.

Also, has anyone used Scalation for ARIMA models?

On Mon, Mar 30, 2015 at 9:30 AM, Corey Nolet cjno...@gmail.com wrote:

 Taking out the complexity of the ARIMA models to simplify things- I can't
 seem to find a good way to represent even standard moving averages in spark
 streaming. Perhaps it's my ignorance with the micro-batched style of the
 DStreams API.

 On Fri, Mar 27, 2015 at 9:13 PM, Corey Nolet cjno...@gmail.com wrote:

 I want to use ARIMA for a predictive model so that I can take time series
 data (metrics) and perform a light anomaly detection. The time series data
 is going to be bucketed to different time units (several minutes within
 several hours, several hours within several days, several days within
 several years.

 I want to do the algorithm in Spark Streaming. I'm used to tuple at a
 time streaming and I'm having a tad bit of trouble gaining insight into
 how exactly the windows are managed inside of DStreams.

 Let's say I have a simple dataset that is marked by a key/value tuple
 where the key is the name of the component who's metrics I want to run the
 algorithm against and the value is a metric (a value representing a sum for
 the time bucket. I want to create histograms of the time series data for
 each key in the windows in which they reside so I can use that histogram
 vector to generate my ARIMA prediction (actually, it seems like this
 doesn't just apply to ARIMA but could apply to any sliding average).

 I *think* my prediction code may look something like this:

 val predictionAverages = dstream
   .groupByKeyAndWindow(60*60*24, 60*60*24)
   .mapValues(applyARIMAFunction)

 That is, keep 24 hours worth of metrics in each window and use that for
 the ARIMA prediction. The part I'm struggling with is how to join together
 the actual values so that i can do my comparison against the prediction
 model.

 Let's say dstream contains the actual values. For any time  window, I
 should be able to take a previous set of windows and use model to compare
 against the current values.






Re: Streaming anomaly detection using ARIMA

2015-03-30 Thread Corey Nolet
Taking out the complexity of the ARIMA models to simplify things- I can't
seem to find a good way to represent even standard moving averages in spark
streaming. Perhaps it's my ignorance with the micro-batched style of the
DStreams API.

On Fri, Mar 27, 2015 at 9:13 PM, Corey Nolet cjno...@gmail.com wrote:

 I want to use ARIMA for a predictive model so that I can take time series
 data (metrics) and perform a light anomaly detection. The time series data
 is going to be bucketed to different time units (several minutes within
 several hours, several hours within several days, several days within
 several years.

 I want to do the algorithm in Spark Streaming. I'm used to tuple at a
 time streaming and I'm having a tad bit of trouble gaining insight into
 how exactly the windows are managed inside of DStreams.

 Let's say I have a simple dataset that is marked by a key/value tuple
 where the key is the name of the component who's metrics I want to run the
 algorithm against and the value is a metric (a value representing a sum for
 the time bucket. I want to create histograms of the time series data for
 each key in the windows in which they reside so I can use that histogram
 vector to generate my ARIMA prediction (actually, it seems like this
 doesn't just apply to ARIMA but could apply to any sliding average).

 I *think* my prediction code may look something like this:

 val predictionAverages = dstream
   .groupByKeyAndWindow(60*60*24, 60*60*24)
   .mapValues(applyARIMAFunction)

 That is, keep 24 hours worth of metrics in each window and use that for
 the ARIMA prediction. The part I'm struggling with is how to join together
 the actual values so that i can do my comparison against the prediction
 model.

 Let's say dstream contains the actual values. For any time  window, I
 should be able to take a previous set of windows and use model to compare
 against the current values.