Brian, In the short run, what I would like to see is a simple server that will accept a query in JSON form and send the resulting rows to another networked server. The JSON form should allow for networked sources.
Associated with this could be an operator that sends a JSON sub-query to a remote server. All of this can be accomplished pretty straightforwardly using protobuf rpc. See http://code.google.com/p/protobuf-rpc-pro/ for a good library for that and see https://github.com/tdunning/mapr-spout for an example of using this with zookeeper. On Fri, Feb 1, 2013 at 10:19 AM, Brian O'Neill <[email protected]>wrote: > > Excellent points Ted. (again) > You've got me thinking. > > I haven't delved into the algorithm enough to understand how widely the > communication patterns vary. I'm sure your right, but it'd be really cool > if we could find a static topology construct that can accommodate the > different communication patterns. > (even if we have to sacrifice locality for now) > > If I get some time, I may have a look. > > thanks, > -brian > > > --- > Brian O'Neill > Lead Architect, Software Development > Health Market Science > The Science of Better Results > 2700 Horizon Drive € King of Prussia, PA € 19406 > M: 215.588.6024 € @boneill42 <http://www.twitter.com/boneill42> € > healthmarketscience.com > > This information transmitted in this email message is for the intended > recipient only and may contain confidential and/or privileged material. If > you received this email in error and are not the intended recipient, or > the person responsible to deliver it to the intended recipient, please > contact the sender at the email above and delete this email and any > attachments and destroy any copies thereof. Any review, retransmission, > dissemination, copying or other use of, or taking any action in reliance > upon, this information by persons or entities other than the intended > recipient is strictly prohibited. > > > > > > > > On 1/31/13 6:33 PM, "Ted Dunning" <[email protected]> wrote: > > >I hear you. Deployment complexity is an evil thing. > > > >And your comment about being willing to trade some performance for > >flexibility is also interesting. > > > >A big mismatch here, however, is that every query is going to cause > >different desired communication patterns. One way to handle that is to > >build a new topology for every query. That isn't going to fly due to long > >topology deployment times. Essentially Nimbus becomes the out of band > >communication mechanism. > > > >The other option would be to use Storm to move query components around. > > The communication patterns are much simpler in this case, but bolts > >suddenly need the ability to communicate to arbitrary other bolts to > >implement the data flow. This makes Storm handle the out-of-band > >communication and leaves us with implementation of the data transform > >outside of Storm. Since the out-of-band comms are tiny, this is perverse > >and doesn't use Storm for what it should be doing. > > > >So I really think that the takeaway here is that we need to be able to pop > >up workers very quickly and easily. That is the lesson learned from Storm > >here and it really needs to happen. This also impacts features like > >elasticity (where Drill might soak up excess capability in a cluster, but > >not hurt batch performance). > > > > > >On Thu, Jan 31, 2013 at 12:43 PM, Brian O'Neill > ><[email protected]>wrote: > > > >> Great points. Thanks Ted. > >> > >> I'm not sure if it is possible, but if there were a Storm topology > >> deployment option, I think there might be appetite for that since it > >>would > >> reduce the operations/admin complexity significantly for consumers that > >> already have Storm deployed. (IMHO) I would be willing to sacrifice > >>some > >> performance to maintain only one set of distributed processing > >> infrastructure. > >> > >> With respect to locality information, I think Storm will eventually need > >> to add out-of-band information to optimize the tuple routing. We > >>developed > >> the storm-cassandra bolt, and I'm eager to get to the point where we can > >> supply ring/token information to Storm so it can route the tuples to the > >> nodes that contain the data. > >> > >> (Maybe it gets carried around in the tuple and leveraged by the > >>underlying > >> infrastructure -- much like Nathan did with transaction id for Trident?) > >> > >> But I fully appreciate your points. (especially regarding > >>java-centricity, > >> serialization, kryo, etc.) > >> > >> -brian > >> > >> -- > >> Brian O'Neill > >> Lead Architect, Software Development > >> Health Market Science > >> The Science of Better Results > >> 2700 Horizon Drive € King of Prussia, PA € 19406 > >> M: 215.588.6024 € @boneill42 € healthmarketscience.com > >> > >> On Jan 30, 2013, at 3:16 PM, Ted Dunning wrote: > >> > >> > On Wed, Jan 30, 2013 at 11:53 AM, Brian O'Neill < > [email protected] > >> >wrote: > >> > > >> >> ... > >> >> How do we intend to distribute the execution engine across a set of > >> >> machines? > >> >> > >> > > >> > There are a variety of thoughts. These include: > >> > > >> > - custom built execution controller similar to Storm's Nimbus > >> > > >> > - use Storm's Nimbus > >> > > >> > - use the custom built controller via Yarn. Or Mesos. Or the MapR > >> warden > >> > > >> > - start them by hand. > >> > > >> > Obviously the last option will be the one that is used in initial > >> testing. > >> > > >> > Any thought to deploying the engine as a Storm topology? > >> >> > >> > > >> > Using Storm probably limits the performance that we can get. Storm's > >> > performance is creditable but not super awesomely impressive. > >> > > >> > Some of the performance issues with Storm include: > >> > > >> > - limited to Java. This may or may not make a difference in the end > >>in > >> > terms of performance, but we definitely want flexibility here. Java > >>can > >> be > >> > awesomely fast (see LMAX and Disruptor), but C++ may be more > >>predictable. > >> > We definitely *don't* want to decide for all time right now which > >>option > >> > we take and we definitely *do* want to have the C++ option in our > >> > hip-pocket later regardless of how we build execution engines now. > >>Part > >> of > >> > Storm's limitations here have to do with the use of Kryo instead of a > >> > portable serializer like protobufs. > >> > > >> > - the designs I have seen or heard batting around tend to deal with > >> batches > >> > of records represented in an ephemeral column oriented design. It > >>will > >> > also be important for records to be kept in unmaterialized, virtual > >>form > >> to > >> > minimize copying, especially when flattening arrays and such. Storm > >> allows > >> > tuples to be kept in memory when bolts are on the same machine, but > >> insists > >> > on serializing and deserializing them at the frontier. To control > >>this, > >> we > >> > would have to nest serializations which seems a bit like incipient > >> insanity. > >> > > >> > Other issues include: > >> > > >> > - Drill execution engines will need access to a considerable amount of > >> > out-of-band information such as schemas and statistics. How do we > >>manage > >> > that in a restrictive paradigm like Storm > >> > > >> > - Storm hides location from Bolts. Drill needs to make decisions > >>based > >> on > >> > location of execution engines and data. > >> > >> > > >
