Paul, I see that you mention about building a SQL++ engine in your email. There has been some work done in that direction from folks from UC San Diego, here is the link to the paper.
https://arxiv.org/pdf/1405.3631.pdf The SQL++ Query Language: Configurable, Unifying and Semi-structured Regards, Khurram On Mon, Sep 10, 2018 at 12:37 PM, Timothy Farkas <[email protected]> wrote: > It's an interesting idea, and I think the main inhibitor that prevents this > from happening is that the popular big data projects are stuck on services. > Specifically if you need distributed coordination you run a separate > zookeeper cluster. If you need a batch compute engine you run a separate > spark cluster. If you need a streaming engine you deploy a separate Flink > or Apex pipeline. If you want to reuse and combine all these services to > make a new engine, you find yourself maintaining several different clusters > of machines, which just isn't practical. > > IMO the paradigm needs to shift from services to libraries. If you need > distributed coordination import the zookeeper library and start the > zookeeper client, which will run zookeeper threads and turn your > application process into a member of the zookeeper quorum. If you need > compute import the compute engine library and start the compute engine > client and your application node will also turn into a worker node. When > you start a library it will discover the other nodes in your application to > form a cohesive cluster. I think this shift has already begun. Calcite is a > library, not a query planning service. Also etcd allows you to run an etcd > instance in your application's process using a simple function call. Arrow > is also a library, not a service. And Apache Ignite is a compute engine > that allows you to run the Ignite compute engine in your application's > process https://urldefense.proofpoint.com/v2/url?u=https-3A__ignite. > apache.org_&d=DwIBaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=H5JEl9vb- > mBIjic10QAbDD2vkUUKAxjO6wZO322RtdI&m=dNT0JmXhntaDqeyAD_ > Wm7M29lWrrWoNNAX49x3XiooI&s=QomY-hvPT5_IB8bHvlmYtV9z0yNUD_STHXAaWDDoEDM&e= > . > > If we shift to thinking of libraries instead of services, then it becomes > trivial to build new engines, since new engines would just be a library > that depends on other libraries. Also you no longer manage several > services, you only manage the service that you built. > > From the little I read about ray, it seems like ray is also moving in the > library direction. > > Tim > > > > On Sun, Sep 9, 2018 at 10:21 PM Paul Rogers <[email protected]> > wrote: > > > Hi All, > > > > Been reading up on distributed DB papers of late, including those passed > > along by this group. Got me thinking about Arina's question about where > > Drill might go in the long term. > > > > One thing I've noticed is that there are now quite a few distributed > > compute frameworks, many of which support SQL in some form. A partial > list > > would include Drill, Presto, Impala, Hive LLAP, Spark SQL (sort of), > > Dremio, Alibaba MaxCompute, Microsoft's Dryad, Scope and StreamS, > Google's > > Dremel and BigQuery and F1, the batch version of Flink -- and those are > > just the ones off the top of my head. Seems every big Internet shop has > > created one (Google, Facebook, Alibaba, Microsoft, etc.) > > > > There is probably some lesson in here for Drill. Being a distributed > > compute engine seems to have become a commodity at this late stage of Big > > Data. But, it is still extremely hard to build a distributed compute > engine > > that scales, especially for a small project like Drill. > > > > What unique value does Drill bring compared to the others? Certainly > being > > open source. Being in Java helps. Supporting xDBC is handy. Being able to > > scan any type of data is great (but we tell people that when they get > > serious, they should use only Parquet). > > > > As the team thinks about Arina's question about where Drill goes next, > one > > wonders if there is some way to share the load? Rather than every > project > > building its own DAG optimizer and execution engine, its own distribution > > framework, its own scanners, its own implementation of data types, and of > > SQL functions, etc., is there a way to combine efforts? > > > > Ray [1] out of UC Berkeley is early days, but it promises to be exactly > > the highly scalable, low-latency engine that Drill tries to be. Calcite > is > > the universal SQL parser and optimizer. Arrow wants to be the database > > toolkit, including data format, network protocol, etc. YARN, Mesos, > > Kubernetes and others want to manage the cluster load. Ranger and Sentry > > want to do data security. There are now countless storage formats (HDFS > > (classic, erasure coding, Ozone), S3, ADLS, Ceph, MapR, Aluxio, Druid, > Kudu > > and countless key-value stores. HMS is the metastore we all love to hate > > and cries out for a newer, more scalable design -- but one shared by all > > engines. > > > > Then, on the compute side, SQL is just one (important) model. Spark and > > old-school MapReduce can handle general data transform problems. Ray > > targets ML. Apex and Flink target streaming. Then there are graph-query > > engines and more. Might these all be seen as different compute models > that > > run on top of a common scheduler, data, security, storage, and > distribution > > framework? Each has unique needs around shuffle, planning, duration, etc. > > But the underlying mechanisms are really all quite similar. > > > > Might Drill, in some far future form, be the project that combines these > > tools to create a SQL compute engine? Rather than a little project like > > Drill trying to do it all (and Drill is now talking about taking on the > > metastore challenge), perhaps Drill might evolve to get out of the > > scheduling, distribution, data, networking, metadata and scan businesses > to > > focus on the surprising complexity of translating SQL to query plans that > > run on a common, shared substrate. > > > > Of course, for that to happen, there would need to actually BE a common > > substrate: probably in the form of a variety of projects that tackle the > > different challenges listed above. The HDFS client, Calcite, Arrow, > > Kubernetes and Ray projects are hints in that direction (though all have > > obvious gaps.) Might such a rearranging of technologies be a next act for > > Apache as the existing big data projects live out their lifecycles and > the > > world shifts to the cloud for commodity Hadoop? > > > > Think how cool it would be to be able to grab a compute framework off the > > shelf and, say, build a SQL++ engine for complex data? Or, meld ML and > SQL > > to create some new hybrid? To be able to run a query against multiple > data > > sources, with standard row/column security, and common metadata, without > > the need to write all this stuff yourself? > > > > Not sure how we get there, but it is clear that Apache owning and > > maintaining many different versions of the same technology just dilutes > our > > very limited volunteer efforts. > > > > Thoughts? Anyone know of open source efforts in this general direction? > > > > Thanks, > > - Paul > > > > [1] > > https://urldefense.proofpoint.com/v2/url?u=https-3A__rise. > cs.berkeley.edu_projects_ray_&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r= > 4eQVr8zB8ZBff-yxTimdOQ&m=XoNNIKuw9d-k4-4DMVxRL4oRAnyP3YR8N5iSSY0X1Io&s= > T33zCvWoxmkbTmerrTV4ZNujrFqv9vWJu6zjXpbgnIA&e= > > > > > > > > > > >
