Hi All, The Google F1 query engine [1] has many features similar to Drill and suggests many ideas that we might adopt. I believe F1 has been mentioned here before; the paper is worth a read if you've not yet done so. Like Drill, F1 derives from the original Dremel work [2] done at Google. Just as Drill 1.0 was inspired by Dremel, perhaps Drill 2.0 can be inspired by F1 Query. Some highlights below.
Areas where F1 Query is similar to Drill: * "F1 Query is a stand-alone, federated query processing platform that executes SQL queries against data stored in different filebased formats as well as different storage systems at Google (e.g., Bigtable, Spanner, Google Spreadsheets, etc.)." * "The data processing and analysis use cases in large organizations like Google exhibit diverse requirements in data sizes, latency, data sources and sinks, freshness, and the need for custom business logic. ... F1 Query, [is] an SQL query engine that is unique not because of its focus on doing one thing well, but instead because it aims to cover all corners of the requirements space for enterprise data processing and analysis." * "F1 Query decouples database storage from query processing, and as a result, it can serve as an engine for all data in the datacenter" * The entire planner framework is similar to Drill (though not based on Calcite) and execution uses a Volcano-like iterator similar to Drill's RecordBatch iterator. Of course, F1 operates at a scale just a bit larger than Drill: "F1 Query is highly decentralized and replicated over multiple datacenters, using hundreds to thousands of machines at each datacenter." Areas where Drill can learn from F1 Query: * "F1 Query runs as a stand-alone, federated query processing platform to execute declarative queries against data stored in different file-based formats as well as different remote storage systems ... enabling use cases large and small, with simple or highly customized business logic, and across whichever data sources the data resides in" Similar to Drill, but the business logic support appears better. * "In F1 Query, short queries are executed on a single node, while larger queries are executed in a low-overhead distributed execution mode with no checkpointing and limited reliability guarantees. The largest queries are run in a reliable batch-oriented execution mode that uses the MapReduce framework" (Drill has the middle bits; the single-node and delegate-to-MR (Spark?)modes are interesting extensions.) * "F1 Query abstracts away the details of each storage type. It makes all data appear as if it is stored in relational tables (with rich structured data types in the form of Protocol Buffers" (Subtle point: F1 supports any data source, but requires that each provide a schema.) * "F1 Query is extensible in various ways: it supports custom data sources as well as user defined scalar functions (UDFs), aggregation functions (UDAs), and table-valued functions (TVFs). User defined functions can use any type of data as input and output, including Protocol Buffers. Clients may express user-defined logic in SQL syntax, providing them with a simple way of abstracting common concepts from their queries and making them more readable and maintainable. They may also use ... scripts to define additional functions for ad-hoc queries and analysis [including] compiled and managed languages like ... Java". F1 has an interesting twist on the UDF function registry. * "Querying protocol buffers presents many of the same challenges as semi-structured data formats like ... JSON...Where JSON is entirely dynamically typed and often stored in human readable format, protocol buffers are statically typed and typically stored in a compact binary format, enabling much more efficient decoding. ... The exact structure and types of all protos referenced in a query are known at query planning time" That is, F1 does not waste time with ill-defined, loosey-goosey JSON. However he user can define an up-front schema with reading CSV (and presumably JSON.) That should be enough to inspire you to read the full paper. Thanks, - Paul [1] https://research.google/pubs/pub47224/ [2] https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf
