Eric,
Storm is not a batch processing system.  It is mend for continuous streams of 
data that are never done.  You could use it for batch processing like what 
flink does, but it is not really designed for that.
Q1: each spout knows how many other spouts there are and which one in that list 
it is.  You get that information from the TopologyContext passed into the open 
method of the spout, so it could handle splitting the input accordingly.Q2: 
That is up to you.  So you could pass the list into the spout when you create 
it, just be careful not to open a link to the file until the open call because 
the spout will be serialized and deserialized in another process.  File Handles 
don't like serialization.
Q3: You can use the acking process to know.  When acking is enabled each tuple 
is tracked as it is processed by your topology.  When it is fully done the ack 
method in your spout is called. - Bobby 


     On Saturday, October 31, 2015 10:38 AM, Eric Frazer 
<[email protected]> wrote:
   

 I've seen a ton of examples for Storm so far (I'm a noob)... but what I don't 
understand is how the spouts do parallelism. Suppose I want to process a giant 
file in Storm, each source has to read and process 64MB of the input file. I 
can't envision a topology like this yet (because I'm ignorant). Q1: How does 
each spout know which part of the giant input file to read? Q2: How does each 
spout get told which file to read? Q3: how do I know when the input file is 
completely processed? In the final bolts' emit logic, can they all communicate 
to one final bolt and tell them which piece of the source they've processed, 
and the final bolt checks off all the done messages and when done, does - ? How 
can it signal the topology owner it's done? Is there a online forum that is 
easier to use than this email list server thing, where I can ask and browse 
questions? This email list server is so early 1990's, it's shocking...

All the online examples I've read about Storm have spouts that produce 
essentially random information forever. They are essentially near-useless 
examples, to me. Processing a giant file, or processing data from a live 
generator of actual data, are much better. I hope I find some decent ones this 
weekend.

Thanks!





  

Reply via email to