Drill in the distributed compute jungle

Paul Rogers Sun, 09 Sep 2018 22:21:33 -0700

Hi All,

Been reading up on distributed DB papers of late, including those passed along 
by this group. Got me thinking about Arina's question about where Drill might 
go in the long term.


One thing I've noticed is that there are now quite a few distributed compute 
frameworks, many of which support SQL in some form. A partial list would 
include Drill, Presto, Impala, Hive LLAP, Spark SQL (sort of), Dremio, Alibaba 
MaxCompute, Microsoft's Dryad, Scope and StreamS, Google's Dremel and BigQuery 
and F1, the batch version of Flink -- and those are just the ones off the top 
of my head. Seems every big Internet shop has created one (Google, Facebook, 
Alibaba, Microsoft, etc.)

There is probably some lesson in here for Drill. Being a distributed compute 
engine seems to have become a commodity at this late stage of Big Data. But, it 
is still extremely hard to build a distributed compute engine that scales, 
especially for a small project like Drill.

What unique value does Drill bring compared to the others? Certainly being open 
source. Being in Java helps. Supporting xDBC is handy. Being able to scan any 
type of data is great (but we tell people that when they get serious, they 
should use only Parquet).

As the team thinks about Arina's question about where Drill goes next, one 
wonders if there is some way to share the load?  Rather than every project 
building its own DAG optimizer and execution engine, its own distribution 
framework, its own scanners, its own implementation of data types, and of SQL 
functions, etc., is there a way to combine efforts?

Ray [1] out of UC Berkeley is early days, but it promises to be exactly the 
highly scalable, low-latency engine that Drill tries to be. Calcite is the 
universal SQL parser and optimizer. Arrow wants to be the database toolkit, 
including data format, network protocol, etc. YARN, Mesos, Kubernetes and 
others want to manage the cluster load. Ranger and Sentry want to do data 
security. There are now countless storage formats (HDFS (classic, erasure 
coding, Ozone), S3, ADLS, Ceph, MapR, Aluxio, Druid, Kudu and countless 
key-value stores. HMS is the metastore we all love to hate and cries out for a 
newer, more scalable design -- but one shared by all engines.

Then, on the compute side, SQL is just one (important) model. Spark and 
old-school MapReduce can handle general data transform problems. Ray targets 
ML. Apex and Flink target streaming. Then there are graph-query engines and 
more. Might these all be seen as different compute models that run on top of a 
common scheduler, data, security, storage, and distribution framework? Each has 
unique needs around shuffle, planning, duration, etc. But the underlying 
mechanisms are really all quite similar.

Might Drill, in some far future form, be the project that combines these tools 
to create a SQL compute engine? Rather than a little project like Drill trying 
to do it all (and Drill is now talking about taking on the metastore 
challenge), perhaps Drill might evolve to get out of the scheduling, 
distribution, data, networking, metadata and scan businesses to focus on the 
surprising complexity of translating SQL to query plans that run on a common, 
shared substrate.

Of course, for that to happen, there would need to actually BE a common 
substrate: probably in the form of a variety of projects that tackle the 
different challenges listed above. The HDFS client, Calcite, Arrow, Kubernetes 
and Ray projects are hints in that direction (though all have obvious gaps.) 
Might such a rearranging of technologies be a next act for Apache as the 
existing big data projects live out their lifecycles and the world shifts to 
the cloud for commodity Hadoop?

Think how cool it would be to be able to grab a compute framework off the shelf 
and, say, build a SQL++ engine for complex data? Or, meld ML and SQL to create 
some new hybrid? To be able to run a query against multiple data sources, with 
standard row/column security, and common metadata, without the need to write 
all this stuff yourself?

Not sure how we get there, but it is clear that Apache owning and maintaining 
many different versions of the same technology just dilutes our very limited 
volunteer efforts.

Thoughts? Anyone know of open source efforts in this general direction?

Thanks,
- Paul

[1] https://rise.cs.berkeley.edu/projects/ray/

Drill in the distributed compute jungle

Reply via email to