Sean,

I do agree about the "inside out" parallelization but my curiosity is
mostly in what type of performance I can expect to have by piping out to R.
I'm playing with Twitter's new Anomaly Detection library btw, this could be
a solution if I can get the calls to R to stand up to the massive dataset
that I have.

I'll report back my findings.

On Thu, Apr 2, 2015 at 3:46 AM, Sean Owen <so...@cloudera.com> wrote:

> This "inside out" parallelization has been a way people have used R
> with MapReduce for a long time. Run N copies of an R script on the
> cluster, on different subsets of the data, babysat by Mappers. You
> just need R installed on the cluster. Hadoop Streaming makes this easy
> and things like RDD.pipe in Spark make it easier.
>
> So it may be just that simple and so there's not much to say about it.
> I haven't tried this with Spark Streaming but imagine it would also
> work. Have you tried this?
>
> Within a window you would probably take the first x% as training and
> the rest as test. I don't think there's a question of looking across
> windows.
>
> On Thu, Apr 2, 2015 at 12:31 AM, Corey Nolet <cjno...@gmail.com> wrote:
> > Surprised I haven't gotten any responses about this. Has anyone tried
> using
> > rJava or FastR w/ Spark? I've seen the SparkR project but thta goes the
> > other way- what I'd like to do is use R for model calculation and Spark
> to
> > distribute the load across the cluster.
> >
> > Also, has anyone used Scalation for ARIMA models?
> >
> > On Mon, Mar 30, 2015 at 9:30 AM, Corey Nolet <cjno...@gmail.com> wrote:
> >>
> >> Taking out the complexity of the ARIMA models to simplify things- I
> can't
> >> seem to find a good way to represent even standard moving averages in
> spark
> >> streaming. Perhaps it's my ignorance with the micro-batched style of the
> >> DStreams API.
> >>
> >> On Fri, Mar 27, 2015 at 9:13 PM, Corey Nolet <cjno...@gmail.com> wrote:
> >>>
> >>> I want to use ARIMA for a predictive model so that I can take time
> series
> >>> data (metrics) and perform a light anomaly detection. The time series
> data
> >>> is going to be bucketed to different time units (several minutes within
> >>> several hours, several hours within several days, several days within
> >>> several years.
> >>>
> >>> I want to do the algorithm in Spark Streaming. I'm used to "tuple at a
> >>> time" streaming and I'm having a tad bit of trouble gaining insight
> into how
> >>> exactly the windows are managed inside of DStreams.
> >>>
> >>> Let's say I have a simple dataset that is marked by a key/value tuple
> >>> where the key is the name of the component who's metrics I want to run
> the
> >>> algorithm against and the value is a metric (a value representing a
> sum for
> >>> the time bucket. I want to create histograms of the time series data
> for
> >>> each key in the windows in which they reside so I can use that
> histogram
> >>> vector to generate my ARIMA prediction (actually, it seems like this
> doesn't
> >>> just apply to ARIMA but could apply to any sliding average).
> >>>
> >>> I *think* my prediction code may look something like this:
> >>>
> >>> val predictionAverages = dstream
> >>>   .groupByKeyAndWindow(60*60*24, 60*60*24)
> >>>   .mapValues(applyARIMAFunction)
> >>>
> >>> That is, keep 24 hours worth of metrics in each window and use that for
> >>> the ARIMA prediction. The part I'm struggling with is how to join
> together
> >>> the actual values so that i can do my comparison against the prediction
> >>> model.
> >>>
> >>> Let's say dstream contains the actual values. For any time  window, I
> >>> should be able to take a previous set of windows and use model to
> compare
> >>> against the current values.
> >>>
> >>>
> >>
> >
>

Reply via email to