Re: High-level architecture

Ted Dunning Sat, 02 Feb 2013 00:44:05 -0800

Brian,

In the short run, what I would like to see is a simple server that will
accept a query in JSON form and send the resulting rows to another
networked server.  The JSON form should allow for networked sources.


Associated with this could be an operator that sends a JSON sub-query to a
remote server.

All of this can be accomplished pretty straightforwardly using protobuf
rpc.  See http://code.google.com/p/protobuf-rpc-pro/ for a good library for
that and see https://github.com/tdunning/mapr-spout for an example of using
this with zookeeper.

On Fri, Feb 1, 2013 at 10:19 AM, Brian O'Neill <[email protected]>wrote:

>
> Excellent points Ted. (again)
> You've got me thinking.
>
> I haven't delved into the algorithm enough to understand how widely the
> communication patterns vary.  I'm sure your right, but it'd be really cool
> if we could find a static topology construct that can accommodate the
> different communication patterns.
> (even if we have to sacrifice locality for now)
>
> If I get some time, I may have a look.
>
> thanks,
> -brian
>
>
> ---
> Brian O'Neill
> Lead Architect, Software Development
> Health Market Science
> The Science of Better Results
> 2700 Horizon Drive € King of Prussia, PA € 19406
> M: 215.588.6024 € @boneill42 <http://www.twitter.com/boneill42>  €
> healthmarketscience.com
>
> This information transmitted in this email message is for the intended
> recipient only and may contain confidential and/or privileged material. If
> you received this email in error and are not the intended recipient, or
> the person responsible to deliver it to the intended recipient, please
> contact the sender at the email above and delete this email and any
> attachments and destroy any copies thereof. Any review, retransmission,
> dissemination, copying or other use of, or taking any action in reliance
> upon, this information by persons or entities other than the intended
> recipient is strictly prohibited.
>
>
>
>
>
>
>
> On 1/31/13 6:33 PM, "Ted Dunning" <[email protected]> wrote:
>
> >I hear you.  Deployment complexity is an evil thing.
> >
> >And your comment about being willing to trade some performance for
> >flexibility is also interesting.
> >
> >A big mismatch here, however, is that every query is going to cause
> >different desired communication patterns.  One way to handle that is to
> >build a new topology for every query.  That isn't going to fly due to long
> >topology deployment times.  Essentially Nimbus becomes the out of band
> >communication mechanism.
> >
> >The other option would be to use Storm to move query components around.
> > The communication patterns are much simpler in this case, but bolts
> >suddenly need the ability to communicate to arbitrary other bolts to
> >implement the data flow.  This makes Storm handle the out-of-band
> >communication and leaves us with implementation of the data transform
> >outside of Storm.  Since the out-of-band comms are tiny, this is perverse
> >and doesn't use Storm for what it should be doing.
> >
> >So I really think that the takeaway here is that we need to be able to pop
> >up workers very quickly and easily.  That is the lesson learned from Storm
> >here and it really needs to happen.  This also impacts features like
> >elasticity (where Drill might soak up excess capability in a cluster, but
> >not hurt batch performance).
> >
> >
> >On Thu, Jan 31, 2013 at 12:43 PM, Brian O'Neill
> ><[email protected]>wrote:
> >
> >> Great points. Thanks Ted.
> >>
> >> I'm not sure if it is possible, but if there were a Storm topology
> >> deployment option, I think there might be appetite for that since it
> >>would
> >> reduce the operations/admin complexity significantly for consumers that
> >> already have Storm deployed.  (IMHO) I would be willing to sacrifice
> >>some
> >> performance to maintain only one set of distributed processing
> >> infrastructure.
> >>
> >> With respect to locality information, I think Storm will eventually need
> >> to add out-of-band information to optimize the tuple routing.  We
> >>developed
> >> the storm-cassandra bolt, and I'm eager to get to the point where we can
> >> supply ring/token information to Storm so it can route the tuples to the
> >> nodes that contain the data.
> >>
> >> (Maybe it gets carried around in the tuple and leveraged by the
> >>underlying
> >> infrastructure -- much like Nathan did with transaction id for Trident?)
> >>
> >> But I fully appreciate your points. (especially regarding
> >>java-centricity,
> >> serialization, kryo, etc.)
> >>
> >> -brian
> >>
> >> --
> >> Brian O'Neill
> >> Lead Architect, Software Development
> >> Health Market Science
> >> The Science of Better Results
> >> 2700 Horizon Drive € King of Prussia, PA € 19406
> >> M: 215.588.6024 € @boneill42  € healthmarketscience.com
> >>
> >> On Jan 30, 2013, at 3:16 PM, Ted Dunning wrote:
> >>
> >> > On Wed, Jan 30, 2013 at 11:53 AM, Brian O'Neill <
> [email protected]
> >> >wrote:
> >> >
> >> >> ...
> >> >> How do we intend to distribute the execution engine across a set of
> >> >> machines?
> >> >>
> >> >
> >> > There are a variety of thoughts.  These include:
> >> >
> >> > - custom built execution controller similar to Storm's Nimbus
> >> >
> >> > - use Storm's Nimbus
> >> >
> >> > - use the custom built controller via Yarn.  Or Mesos.  Or the MapR
> >> warden
> >> >
> >> > - start them by hand.
> >> >
> >> > Obviously the last option will be the one that is used in initial
> >> testing.
> >> >
> >> > Any thought to deploying the engine as a Storm topology?
> >> >>
> >> >
> >> > Using Storm probably limits the performance that we can get.  Storm's
> >> > performance is creditable but not super awesomely impressive.
> >> >
> >> > Some of the performance issues with Storm include:
> >> >
> >> > - limited to Java.  This may or may not make a difference in the end
> >>in
> >> > terms of performance, but we definitely want flexibility here.  Java
> >>can
> >> be
> >> > awesomely fast (see LMAX and Disruptor), but C++ may be more
> >>predictable.
> >> > We definitely *don't* want to decide for all time right now which
> >>option
> >> > we take and we definitely *do* want to have the C++ option in our
> >> > hip-pocket later regardless of how we build execution engines now.
> >>Part
> >> of
> >> > Storm's limitations here have to do with the use of Kryo instead of a
> >> > portable serializer like protobufs.
> >> >
> >> > - the designs I have seen or heard batting around tend to deal with
> >> batches
> >> > of records represented in an ephemeral column oriented design.  It
> >>will
> >> > also be important for records to be kept in unmaterialized, virtual
> >>form
> >> to
> >> > minimize copying, especially when flattening arrays and such.  Storm
> >> allows
> >> > tuples to be kept in memory when bolts are on the same machine, but
> >> insists
> >> > on serializing and deserializing them at the frontier.  To control
> >>this,
> >> we
> >> > would have to nest serializations which seems a bit like incipient
> >> insanity.
> >> >
> >> > Other issues include:
> >> >
> >> > - Drill execution engines will need access to a considerable amount of
> >> > out-of-band information such as schemas and statistics.  How do we
> >>manage
> >> > that in a restrictive paradigm like Storm
> >> >
> >> > - Storm hides location from Bolts.  Drill needs to make decisions
> >>based
> >> on
> >> > location of execution engines and data.
> >>
> >>
>
>
>

Re: High-level architecture

Reply via email to