Definitely going to read the paper.  Thanks for sending!

> On Jan 24, 2020, at 6:47 PM, Paul Rogers <[email protected]> wrote:
> 
> Hi All,
> 
> The Google F1 query engine [1] has many features similar to Drill and 
> suggests many ideas that we might adopt. I believe F1 has been mentioned here 
> before; the paper is worth a read if you've not yet done so. Like Drill, F1 
> derives from the original Dremel work [2] done at Google. Just as Drill 1.0 
> was inspired by Dremel, perhaps Drill 2.0 can be inspired by F1 Query. Some 
> highlights below.
> 
> Areas where F1 Query is similar to Drill:
> 
> * "F1 Query is a stand-alone, federated query processing platform that 
> executes SQL queries against data stored in different filebased formats as 
> well as different storage systems at Google (e.g., Bigtable, Spanner, Google 
> Spreadsheets, etc.)."
> 
> * "The data processing and analysis use cases in large organizations like 
> Google exhibit diverse requirements in data sizes, latency, data sources and 
> sinks, freshness, and the need for custom business logic. ... F1 Query, [is] 
> an SQL query engine that is unique not because of its focus on doing one 
> thing well, but instead because it aims to cover all corners of the 
> requirements space for enterprise data processing and analysis."
> 
> * "F1 Query decouples database storage from query processing, and as a 
> result, it can serve as an engine for all data in the datacenter"
> 
> * The entire planner framework is similar to Drill (though not based on 
> Calcite) and execution uses a Volcano-like iterator similar to Drill's 
> RecordBatch iterator.
> 
> Of course, F1 operates at a scale just a bit larger than Drill: "F1 Query is 
> highly decentralized and replicated over multiple datacenters, using hundreds 
> to thousands of machines at each datacenter."
> 
> Areas where Drill can learn from F1 Query:
> 
> * "F1 Query runs as a stand-alone, federated query processing platform to 
> execute declarative queries against data stored in different file-based 
> formats as well as different remote storage systems ... enabling use cases 
> large and small, with simple or highly customized business logic, and across 
> whichever data sources the data resides in" Similar to Drill, but the 
> business logic support appears better.
> 
> * "In F1 Query, short queries are executed on a single node, while larger 
> queries are executed in a low-overhead distributed execution mode with no 
> checkpointing and limited reliability guarantees. The largest queries are run 
> in a reliable batch-oriented execution mode that uses the MapReduce 
> framework" (Drill has the middle bits; the single-node and delegate-to-MR 
> (Spark?)modes are interesting extensions.)
> 
> * "F1 Query abstracts away the details of each storage type. It makes all 
> data appear as if it is stored in relational tables (with rich structured 
> data types in the form of Protocol Buffers" (Subtle point: F1 supports any 
> data source, but requires that each provide a schema.)
> 
> * "F1 Query is extensible in various ways: it supports custom data sources as 
> well as user defined scalar functions (UDFs), aggregation functions (UDAs), 
> and table-valued functions (TVFs). User defined functions can use any type of 
> data as input and output, including Protocol Buffers. Clients may express 
> user-defined logic in SQL syntax, providing them with a simple way of 
> abstracting common concepts from their queries and making them more readable 
> and maintainable. They may also use ... scripts to define additional 
> functions for ad-hoc queries and analysis [including] compiled and managed 
> languages like ... Java". F1 has an interesting twist on the UDF function 
> registry.
> 
> * "Querying protocol buffers presents many of the same challenges as 
> semi-structured data formats like ... JSON...Where JSON is entirely 
> dynamically typed and often stored in human readable format, protocol buffers 
> are statically typed and typically stored in a compact binary format, 
> enabling much more efficient decoding. ... The exact structure and types of 
> all protos referenced in a query are known at query planning time" That is, 
> F1 does not waste time with ill-defined, loosey-goosey JSON. However he user 
> can define an up-front schema with reading CSV (and presumably JSON.)
> 
> That should be enough to inspire you to read the full paper.
> 
> Thanks,
> - Paul
> 
> [1] https://research.google/pubs/pub47224/
> 
> 
> [2] 
> https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf
> 
> 
> 

Reply via email to