JIRA Storm/Mahout Sink

syates Tue, 16 Jul 2013 00:11:45 -0700

Hi Dev, if there is suitable interest I would like to discuss thebelow thread further and open up any further opportunities forstreaming analytics within Flume.

One particular use case I am considering right now is the ability tocapture a continuous (varying velocities) stream of user generatedevents from different sources and overlay this stream with aplugin-style dashboard. This dashbboard will allow a user tocreate/manage an underlying stream and apply various "probes" (so tospeak) .

I understand the Storm.project is a great contender at this point intime for offloading computational tasks for real-time "like" feedback.However I do not believe at this stage the Storm project has theintuitive dashboard-like pluggable probe system (call it what youlike) that I am interested in.

I would be interested in working with the Flume contribs to developsuch an addon proj.


All thoughts are welcome

Yours in flume;
-Steve

Hi Steve,

I can understand  the idea of having data processed inside flume by
streaming it to another flume agent. But do we really need to re-engineer
something inside flume is what I am thinking? Core flume dev team may have
better ideas on this but currently for streaming data processing storm is a
huge candidate.
flume does have have an open jira on this integration
FLUME-1286<https://issues.apache.org/jira/browse/FLUME-1286>

It will be interesting to draw up the comparisons in performance if the
data processing logic is added to to flume. We do see currently people
having a little bit of pre-processing of their data (they have their own
custom channel types where they modify the data and sink it)


On Fri, Feb 8, 2013 at 8:52 AM, Steve Yates <[email protected]> wrote:

Thanks for your feedback Mike, I have been thinking about this a little
more and just using Mahout as an example I was considering the concept of
somehow developing an enriched 'sink' so to speak where it would accept
input streams / msgs from a flume channel and onforward specifically to a
'service' i.e Mahout service which would subsequently deliver the results
to the configured sink. So yes it would behave as an
intercept->filter->process->sink for applicable data items.

I apologise if that is still vague. It would be great to receive further
feedback from the user group.

-Steve

Mike Percy <[email protected]> wrote:
Hi Steven,
Thanks for chiming in! Please see my responses inline:

On Thu, Feb 7, 2013 at 3:04 PM, Steven Yates <[email protected]>wrote:

The only missing link within the Flume architecture I see in this
conversation is the actual channel's and brokers themselves which
orchestrate this lovely undertaking of data collection.


Can you define what you mean by channels and brokers in this context?
Since channel is a synonym for queueing event buffer in Flume parlance.
Also, can you elaborate more on what you mean by orchestration? I think I
know where you're going but I don't want to put words in your mouth.

One opportunity I do see (and I may be wrong) is for the data to offloaded

into a system such as Apache Mahout  before being sent to the sink. Perhaps
the concept of a ChannelAdapter of sorts? I.e Mahout Adapter ? Just
thinking out loud and it may be well out of the question.


Why not a Mahout sink? Since Mahout often wants sequence files in a
particular format to begin its MapReduce processing (e.g. its k-Means
clustering implementation), Flume is already a good fit with its HDFS sink
and EventSerializers allowing for writing a plugin to format your data
however it needs to go in. In fact that works today if you have a batch
(even 5-minute batch) use case. With today's functionality, you could use
Oozie to coordinate kicking off the Mahout M/R job periodically, as new
data becomes available and the files are rolled.

Perhaps even more interestingly, I can see a use case where you might want
to use Mahout to do streaming / realtime updates driven by Flume in the
form of an interceptor or a Mahout sink. If online machine learning (e.g.
stochastic gradient descent or something else online) was what you were
thinking, I wonder if there are any folks on this list who might have an
interest in helping to work on putting such a thing together.

In any case, I'd like to hear more about specific use cases for streaming
analytics. :)

Regards,
Mike



--
Nitin Pawar

JIRA Storm/Mahout Sink

Reply via email to