Re: Drill in the distributed compute jungle

Khurram Faraaz Mon, 10 Sep 2018 13:58:27 -0700

Paul,

I see that you mention about building a SQL++ engine in your email.
There has been some work done in that direction from folks from UC San
Diego, here is the link to the paper.


https://arxiv.org/pdf/1405.3631.pdf

The SQL++ Query Language:
Configurable, Unifying and Semi-structured

Regards,
Khurram

On Mon, Sep 10, 2018 at 12:37 PM, Timothy Farkas <[email protected]> wrote:

> It's an interesting idea, and I think the main inhibitor that prevents this
> from happening is that the popular big data projects are stuck on services.
> Specifically if you need distributed coordination you run a separate
> zookeeper cluster. If you need a batch compute engine you run a separate
> spark cluster. If you need a streaming engine you deploy a separate Flink
> or Apex pipeline. If you want to reuse and combine all these services to
> make a new engine, you find yourself maintaining several different clusters
> of machines, which just isn't practical.
>
> IMO the paradigm needs to shift from services to libraries. If you need
> distributed coordination import the zookeeper library and start the
> zookeeper client, which will run zookeeper threads and turn your
> application process into a member of the zookeeper quorum. If you need
> compute import the compute engine library and start the compute engine
> client and your application node will also turn into a worker node. When
> you start a library it will discover the other nodes in your application to
> form a cohesive cluster. I think this shift has already begun. Calcite is a
> library, not a query planning service. Also etcd allows you to run an etcd
> instance in your application's process using a simple function call. Arrow
> is also a library, not a service. And Apache Ignite is a compute engine
> that allows you to run the Ignite compute engine in your application's
> process https://urldefense.proofpoint.com/v2/url?u=https-3A__ignite.
> apache.org_&d=DwIBaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=H5JEl9vb-
> mBIjic10QAbDD2vkUUKAxjO6wZO322RtdI&m=dNT0JmXhntaDqeyAD_
> Wm7M29lWrrWoNNAX49x3XiooI&s=QomY-hvPT5_IB8bHvlmYtV9z0yNUD_STHXAaWDDoEDM&e=
> .
>
> If we shift to thinking of libraries instead of services, then it becomes
> trivial to build new engines, since new engines would just be a library
> that depends on other libraries. Also you no longer manage several
> services, you only manage the service that you built.
>
> From the little I read about ray, it seems like ray is also moving in the
> library direction.
>
> Tim
>
>
>
> On Sun, Sep 9, 2018 at 10:21 PM Paul Rogers <[email protected]>
> wrote:
>
> > Hi All,
> >
> > Been reading up on distributed DB papers of late, including those passed
> > along by this group. Got me thinking about Arina's question about where
> > Drill might go in the long term.
> >
> > One thing I've noticed is that there are now quite a few distributed
> > compute frameworks, many of which support SQL in some form. A partial
> list
> > would include Drill, Presto, Impala, Hive LLAP, Spark SQL (sort of),
> > Dremio, Alibaba MaxCompute, Microsoft's Dryad, Scope and StreamS,
> Google's
> > Dremel and BigQuery and F1, the batch version of Flink -- and those are
> > just the ones off the top of my head. Seems every big Internet shop has
> > created one (Google, Facebook, Alibaba, Microsoft, etc.)
> >
> > There is probably some lesson in here for Drill. Being a distributed
> > compute engine seems to have become a commodity at this late stage of Big
> > Data. But, it is still extremely hard to build a distributed compute
> engine
> > that scales, especially for a small project like Drill.
> >
> > What unique value does Drill bring compared to the others? Certainly
> being
> > open source. Being in Java helps. Supporting xDBC is handy. Being able to
> > scan any type of data is great (but we tell people that when they get
> > serious, they should use only Parquet).
> >
> > As the team thinks about Arina's question about where Drill goes next,
> one
> > wonders if there is some way to share the load?  Rather than every
> project
> > building its own DAG optimizer and execution engine, its own distribution
> > framework, its own scanners, its own implementation of data types, and of
> > SQL functions, etc., is there a way to combine efforts?
> >
> > Ray [1] out of UC Berkeley is early days, but it promises to be exactly
> > the highly scalable, low-latency engine that Drill tries to be. Calcite
> is
> > the universal SQL parser and optimizer. Arrow wants to be the database
> > toolkit, including data format, network protocol, etc. YARN, Mesos,
> > Kubernetes and others want to manage the cluster load. Ranger and Sentry
> > want to do data security. There are now countless storage formats (HDFS
> > (classic, erasure coding, Ozone), S3, ADLS, Ceph, MapR, Aluxio, Druid,
> Kudu
> > and countless key-value stores. HMS is the metastore we all love to hate
> > and cries out for a newer, more scalable design -- but one shared by all
> > engines.
> >
> > Then, on the compute side, SQL is just one (important) model. Spark and
> > old-school MapReduce can handle general data transform problems. Ray
> > targets ML. Apex and Flink target streaming. Then there are graph-query
> > engines and more. Might these all be seen as different compute models
> that
> > run on top of a common scheduler, data, security, storage, and
> distribution
> > framework? Each has unique needs around shuffle, planning, duration, etc.
> > But the underlying mechanisms are really all quite similar.
> >
> > Might Drill, in some far future form, be the project that combines these
> > tools to create a SQL compute engine? Rather than a little project like
> > Drill trying to do it all (and Drill is now talking about taking on the
> > metastore challenge), perhaps Drill might evolve to get out of the
> > scheduling, distribution, data, networking, metadata and scan businesses
> to
> > focus on the surprising complexity of translating SQL to query plans that
> > run on a common, shared substrate.
> >
> > Of course, for that to happen, there would need to actually BE a common
> > substrate: probably in the form of a variety of projects that tackle the
> > different challenges listed above. The HDFS client, Calcite, Arrow,
> > Kubernetes and Ray projects are hints in that direction (though all have
> > obvious gaps.) Might such a rearranging of technologies be a next act for
> > Apache as the existing big data projects live out their lifecycles and
> the
> > world shifts to the cloud for commodity Hadoop?
> >
> > Think how cool it would be to be able to grab a compute framework off the
> > shelf and, say, build a SQL++ engine for complex data? Or, meld ML and
> SQL
> > to create some new hybrid? To be able to run a query against multiple
> data
> > sources, with standard row/column security, and common metadata, without
> > the need to write all this stuff yourself?
> >
> > Not sure how we get there, but it is clear that Apache owning and
> > maintaining many different versions of the same technology just dilutes
> our
> > very limited volunteer efforts.
> >
> > Thoughts? Anyone know of open source efforts in this general direction?
> >
> > Thanks,
> > - Paul
> >
> > [1]
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__rise.
> cs.berkeley.edu_projects_ray_&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=
> 4eQVr8zB8ZBff-yxTimdOQ&m=XoNNIKuw9d-k4-4DMVxRL4oRAnyP3YR8N5iSSY0X1Io&s=
> T33zCvWoxmkbTmerrTV4ZNujrFqv9vWJu6zjXpbgnIA&e=
> >
> >
> >
> >
> >
>

Re: Drill in the distributed compute jungle

Reply via email to