Google F1 Query

Paul Rogers Fri, 24 Jan 2020 15:48:24 -0800

Hi All,

The Google F1 query engine [1] has many features similar to Drill and suggests 
many ideas that we might adopt. I believe F1 has been mentioned here before; 
the paper is worth a read if you've not yet done so. Like Drill, F1 derives 
from the original Dremel work [2] done at Google. Just as Drill 1.0 was 
inspired by Dremel, perhaps Drill 2.0 can be inspired by F1 Query. Some 
highlights below.


Areas where F1 Query is similar to Drill:

* "F1 Query is a stand-alone, federated query processing platform that executes 
SQL queries against data stored in different filebased formats as well as 
different storage systems at Google (e.g., Bigtable, Spanner, Google 
Spreadsheets, etc.)."

* "The data processing and analysis use cases in large organizations like 
Google exhibit diverse requirements in data sizes, latency, data sources and 
sinks, freshness, and the need for custom business logic. ... F1 Query, [is] an 
SQL query engine that is unique not because of its focus on doing one thing 
well, but instead because it aims to cover all corners of the requirements 
space for enterprise data processing and analysis."

* "F1 Query decouples database storage from query processing, and as a result, 
it can serve as an engine for all data in the datacenter"

* The entire planner framework is similar to Drill (though not based on 
Calcite) and execution uses a Volcano-like iterator similar to Drill's 
RecordBatch iterator.

Of course, F1 operates at a scale just a bit larger than Drill: "F1 Query is 
highly decentralized and replicated over multiple datacenters, using hundreds 
to thousands of machines at each datacenter."

Areas where Drill can learn from F1 Query:

* "F1 Query runs as a stand-alone, federated query processing platform to 
execute declarative queries against data stored in different file-based formats 
as well as different remote storage systems ... enabling use cases large and 
small, with simple or highly customized business logic, and across whichever 
data sources the data resides in" Similar to Drill, but the business logic 
support appears better.

* "In F1 Query, short queries are executed on a single node, while larger 
queries are executed in a low-overhead distributed execution mode with no 
checkpointing and limited reliability guarantees. The largest queries are run 
in a reliable batch-oriented execution mode that uses the MapReduce framework" 
(Drill has the middle bits; the single-node and delegate-to-MR (Spark?)modes 
are interesting extensions.)

* "F1 Query abstracts away the details of each storage type. It makes all data 
appear as if it is stored in relational tables (with rich structured data types 
in the form of Protocol Buffers" (Subtle point: F1 supports any data source, 
but requires that each provide a schema.)

* "F1 Query is extensible in various ways: it supports custom data sources as 
well as user defined scalar functions (UDFs), aggregation functions (UDAs), and 
table-valued functions (TVFs). User defined functions can use any type of data 
as input and output, including Protocol Buffers. Clients may express 
user-defined logic in SQL syntax, providing them with a simple way of 
abstracting common concepts from their queries and making them more readable 
and maintainable. They may also use ... scripts to define additional functions 
for ad-hoc queries and analysis [including] compiled and managed languages like 
... Java". F1 has an interesting twist on the UDF function registry.

* "Querying protocol buffers presents many of the same challenges as 
semi-structured data formats like ... JSON...Where JSON is entirely dynamically 
typed and often stored in human readable format, protocol buffers are 
statically typed and typically stored in a compact binary format, enabling much 
more efficient decoding. ... The exact structure and types of all protos 
referenced in a query are known at query planning time" That is, F1 does not 
waste time with ill-defined, loosey-goosey JSON. However he user can define an 
up-front schema with reading CSV (and presumably JSON.)

That should be enough to inspire you to read the full paper.

Thanks,
- Paul

[1] https://research.google/pubs/pub47224/


[2] 
https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf

Google F1 Query

Reply via email to