Surprised I haven't gotten any responses about this. Has anyone tried using
rJava or FastR w/ Spark? I've seen the SparkR project but thta goes the
other way- what I'd like to do is use R for model calculation and Spark to
distribute the load across the cluster.

Also, has anyone used Scalation for ARIMA models?

On Mon, Mar 30, 2015 at 9:30 AM, Corey Nolet <cjno...@gmail.com> wrote:

> Taking out the complexity of the ARIMA models to simplify things- I can't
> seem to find a good way to represent even standard moving averages in spark
> streaming. Perhaps it's my ignorance with the micro-batched style of the
> DStreams API.
>
> On Fri, Mar 27, 2015 at 9:13 PM, Corey Nolet <cjno...@gmail.com> wrote:
>
>> I want to use ARIMA for a predictive model so that I can take time series
>> data (metrics) and perform a light anomaly detection. The time series data
>> is going to be bucketed to different time units (several minutes within
>> several hours, several hours within several days, several days within
>> several years.
>>
>> I want to do the algorithm in Spark Streaming. I'm used to "tuple at a
>> time" streaming and I'm having a tad bit of trouble gaining insight into
>> how exactly the windows are managed inside of DStreams.
>>
>> Let's say I have a simple dataset that is marked by a key/value tuple
>> where the key is the name of the component who's metrics I want to run the
>> algorithm against and the value is a metric (a value representing a sum for
>> the time bucket. I want to create histograms of the time series data for
>> each key in the windows in which they reside so I can use that histogram
>> vector to generate my ARIMA prediction (actually, it seems like this
>> doesn't just apply to ARIMA but could apply to any sliding average).
>>
>> I *think* my prediction code may look something like this:
>>
>> val predictionAverages = dstream
>>   .groupByKeyAndWindow(60*60*24, 60*60*24)
>>   .mapValues(applyARIMAFunction)
>>
>> That is, keep 24 hours worth of metrics in each window and use that for
>> the ARIMA prediction. The part I'm struggling with is how to join together
>> the actual values so that i can do my comparison against the prediction
>> model.
>>
>> Let's say dstream contains the actual values. For any time  window, I
>> should be able to take a previous set of windows and use model to compare
>> against the current values.
>>
>>
>>
>

Reply via email to