I would like to know whether load shedding is available in other real time systems like storm , spark or kafka? If not then why they have not implemented it in these systems? Regards
On Tuesday, 11 October 2011 23:18:39 UTC+5:30, Leo Neumeyer wrote: > > I think that both projects share similar goals. I haven't done a detailed > study of Strom so I can't give you a very detailed comparison. However, it > is great to see more of these systems so we can keep learning from each > other. > > Let me summarize the approach we are taking in S4-piper which is the next > release that will be done in the Apache Incubator. (Here for now: > https://github.com/leoneu/s4-piper) > > The high level goal is unchanged: distributed stream processing and ease > to use. This implies hiding the distributed nature of the application from > the user. > > We basically embrace symmetry, strict OO design, no multithreading > required, the Prototype design pattern, extensive use of the Java platform, > static typing, no string literals, no large XML configuration files. > > - Symmetry: all nodes in a cluster are identical. Makes management, > configuration and deployment easier. Fewer parts that can break. > > - OO Design: not much I can say here except that processing elements (PEs) > are objects, streams are objects, and events are objects. The core API is > very small and works in a coordinated way to build the platform. > > - In most cases, application developers don't need to write any > multithreaded code. By default PE threads are synchronized. > > - Prototype pattern design: once a PE is configured, we instantiate a PE > object that we call a PE prototype. The PE prototype has the PE type and > the configuration. PE instances are cloned from the prototype and keyed > using a (key_finder, key_value) tuple. The key_finder is a function that > extracts a value from the event and that value is the key for the PE > instance. This guarantees that we have a unique instance of a PE for a > given key. This can be used in a single node or in a cluster with multiple > nodes. For the app developer, this means that the app is a graph of > prototype PEs connected buy streams. The key logic resides in the stream > object. Building an app is as easy as drawing a graph. We have a basic API > but people will be able to build tools on top of it (GUI, DSLs, etc.) We > are planning to provide an API on top of Guice so we can take advantage of > Guice's dependency injection under the hood. > > - Using the Java platform. We feel that having a consistent platform that > is familiar to many developers helps make things easier. While dealing with > some complexity and less known frameworks under the hood (Generics, Guice, > OSGI for dynamic app loading, ..) we try to leave the external API as pure > as simple as possible. We also want to take advantage of new language > feature coming in Java 7,8... We also have the option to easily add a Scala > API. > > - Static typing: I always thought that dynamic languages were much nicer > but I can see the trade offs. In a large system with multiple programmers > and modules, having static typing with a good OO design makes understanding > and maintaining the code much easier. To accomplish this we use Guice to > replace XML config files which are nightmare if you had lots of PEs to > configure. I'd rather use a Java API to build the graph. We eliminate the > use of string literals to find objects, this is cleaner, more efficient, > easy to refactor, and we can easily find things in Eclipse. For remote > objects, we send POJOs around and pretty much hide all the details. The > stream object is smart enough to pass events to the comm layer when the > target PE instance is in another JVM. The event magically arrives in the > stream's queue at the destination. > > - In terms of pluggability, S4 can use any comm protocol and serialization > scheme. You just need to implement it in the comm layer. Right now we have > a simple UDP implementation and a TCP implementation using Netty. We tested > ZMQ but saw no performance advantage and no need for it since everything is > Java in our case. However, I may be missing something. Netty gives you the > flexibility to implement alternate protocols. > > - In terms of losing events, it is not a matter of losing or not losing, > you often need to plan how to handle peak traffic in a real-time system, > otherwise it is probably over engineered. Our plan is to provide an API for > load shedding so app developers can decide what to do when the queues are > getting full. Some may want to throw events, others may want to switch to a > less computationally expensive algorithm. I prefer to think in a > probabilistic way, that is, event loss or load shedding are part of the > algorithm and is evaluated in terms of a performance metric as it is done > in a communications system. > > - I also experimented on how to run batch tasks in S4. Unlike streaming, > in batch mode, we can afford infinite delays and prevent event loss. This > is important for offline tasks and for debugging. For this we need to > support blocking queues in the stream so when a queue is full, it blocks > instead of losing events. In a cluster, this means, that blocking needs to > propagate upstream across nodes. The downside is that you may end up having > deadlocks if the graph has cycles but that is a whole other story. In the > model example I push events at a high data rate into S4 to train a model. > The exact same model is also used at run time. This brings consistency for > training and execution because it uses the same code base and the same > framework. It doesn't have all the functionality of Hadoop but can be used > for some tasks. > > Hope this helps for now. > > -leo > > > > On Oct 11, 2011, at 6:54 AM, OG wrote: > > You may fine this useful: > http://blog.sematext.com/2011/09/26/event-stream-processor-matrix/ > > Otis > -- > Sematext is hiring! http://sematext.com/about/jobs.html > > > On Oct 10, 5:29 am, yasith tharindu <[email protected]> wrote: > > Any body have a idea about the $Subject ? > > > -- > > Thanks.. > > Regards... > > >
