Eric,
Storm is not a batch processing system. It is mend for continuous streams of
data that are never done. You could use it for batch processing like what
flink does, but it is not really designed for that.
Q1: each spout knows how many other spouts there are and which one in that list
it is. You get that information from the TopologyContext passed into the open
method of the spout, so it could handle splitting the input accordingly.Q2:
That is up to you. So you could pass the list into the spout when you create
it, just be careful not to open a link to the file until the open call because
the spout will be serialized and deserialized in another process. File Handles
don't like serialization.
Q3: You can use the acking process to know. When acking is enabled each tuple
is tracked as it is processed by your topology. When it is fully done the ack
method in your spout is called. - Bobby
On Saturday, October 31, 2015 10:38 AM, Eric Frazer
<[email protected]> wrote:
I've seen a ton of examples for Storm so far (I'm a noob)... but what I don't
understand is how the spouts do parallelism. Suppose I want to process a giant
file in Storm, each source has to read and process 64MB of the input file. I
can't envision a topology like this yet (because I'm ignorant). Q1: How does
each spout know which part of the giant input file to read? Q2: How does each
spout get told which file to read? Q3: how do I know when the input file is
completely processed? In the final bolts' emit logic, can they all communicate
to one final bolt and tell them which piece of the source they've processed,
and the final bolt checks off all the done messages and when done, does - ? How
can it signal the topology owner it's done? Is there a online forum that is
easier to use than this email list server thing, where I can ask and browse
questions? This email list server is so early 1990's, it's shocking...
All the online examples I've read about Storm have spouts that produce
essentially random information forever. They are essentially near-useless
examples, to me. Processing a giant file, or processing data from a live
generator of actual data, are much better. I hope I find some decent ones this
weekend.
Thanks!