Hello Storm devs, I work for one of the big four experiments at CERN. We are currently thinking about "the future of our real time data processing". I tweeted about it <https://twitter.com/betatim/status/611503830161862657> and @ptgoetz replied saying "post to dev/usr". This is that post.
Below some brief introduction to what LHCb is, how we process data, and some constraints/context. I tried to think of all the relevant info and invariably will have forgotten some... We are currently surveying the landscape of real time stream processing. The question we are trying to answer is: "What toolkit could we use as starting point for the LHCb experiment's real time data processing if we started from scratch today?" Storm looks attractive, however I know nobody who has real world experience with it. So my first question would be: is it worth investigating Storm further for this kind of use-case? LHCb is one of the four big experiments at CERN < http://home.web.cern.ch/about>, the home of the Large Hadron Collider (you might have heard of us when we discovered the Higgs boson). A brief outline of what LHCb does when processing data: The LHCb trigger reduces/filters the stream of messages (or events) from the experimental hardware. It runs on a large farm of about 27000 physical cores with about 1-2GB of RAM per core. Its task is to decide which messages are worth recording to disk and which to reject. Most messages/events are rejected. Each message contains low level features (which pixels in the silicon sensor were hit, how much energy was deposited in read out channel X, etc). Part of the decision process is to construct higher level features from these. Usually the more high level a feature the more expensive it is to compute and the more discriminative it is. We routinely use machine learning algorithms like random forests/gradient boosted decision trees/NNs as part of the decision making. In the current setup each message is delivered to exactly one computing node, which processes it in full. If the message is accepted it is sent onwards for storing to disk. One avenue we are investigating is if it is feasible to instead pass a message around several "specialised" nodes that each do part of the processing. This would be attractive as it means you could easily integrate GPUs into the flow, or spin up more instances for a certain part of the processing if it is taking too long. Such a scheme would need extremely cheap serialisation. A good (longer) introduction written up for a NIPS workshop [pdf]: <https://www.dropbox.com/s/fcq3v28oxdvp6iw/RealTimeAnalysisLHC.pdf?dl=0> Some boundary conditions/context: - input rate of ~40million messages per second - 100kB per message, without higher level features - ~O(60) cores per box (extrapolating the tick-tock of Intel's roadmap to 2020) - ~O(2)GB RAM per core (very hard to tell where this will be in 2020) - network design is throughput driven, bound to the connectivity requirements. Events are distributed for filtering with a one-to-one communication, usually implemented with a "cheap" CLOS-topology like networking. The topology doesn't matter that much as long as it is non-blocking, that is, all nodes can send to the next stage of the pipeline guaranteeing a certain required throughput. - every messages has to be processed, no duplication (at the output of the processing chain) of messages - most messages are rejected, the output rate is (much) lower than the input rate - fault tolerance, when the LHC is delivering collisions to the experiment we have to be ready, otherwise valuable beam time is wasted, there is no pausing or going back - need to be able to re-run the processing pipeline on a single machine after the fact for debugging If this isn't the right place for discussing this or some other medium is more efficient, let me know. Did I miss some obvious constraint/bit of context? Thanks a lot for your time, you'd think LHCb has a lot of experts on this topic but I suspect there is even more (real world, modern) expertise out there :-) T ps. I CC'ed some of the other guys from LHCb who are interested in this.
