Definitely going to read the paper. Thanks for sending!
> On Jan 24, 2020, at 6:47 PM, Paul Rogers <[email protected]> wrote:
>
> Hi All,
>
> The Google F1 query engine [1] has many features similar to Drill and
> suggests many ideas that we might adopt. I believe F1 has been mentioned here
> before; the paper is worth a read if you've not yet done so. Like Drill, F1
> derives from the original Dremel work [2] done at Google. Just as Drill 1.0
> was inspired by Dremel, perhaps Drill 2.0 can be inspired by F1 Query. Some
> highlights below.
>
> Areas where F1 Query is similar to Drill:
>
> * "F1 Query is a stand-alone, federated query processing platform that
> executes SQL queries against data stored in different filebased formats as
> well as different storage systems at Google (e.g., Bigtable, Spanner, Google
> Spreadsheets, etc.)."
>
> * "The data processing and analysis use cases in large organizations like
> Google exhibit diverse requirements in data sizes, latency, data sources and
> sinks, freshness, and the need for custom business logic. ... F1 Query, [is]
> an SQL query engine that is unique not because of its focus on doing one
> thing well, but instead because it aims to cover all corners of the
> requirements space for enterprise data processing and analysis."
>
> * "F1 Query decouples database storage from query processing, and as a
> result, it can serve as an engine for all data in the datacenter"
>
> * The entire planner framework is similar to Drill (though not based on
> Calcite) and execution uses a Volcano-like iterator similar to Drill's
> RecordBatch iterator.
>
> Of course, F1 operates at a scale just a bit larger than Drill: "F1 Query is
> highly decentralized and replicated over multiple datacenters, using hundreds
> to thousands of machines at each datacenter."
>
> Areas where Drill can learn from F1 Query:
>
> * "F1 Query runs as a stand-alone, federated query processing platform to
> execute declarative queries against data stored in different file-based
> formats as well as different remote storage systems ... enabling use cases
> large and small, with simple or highly customized business logic, and across
> whichever data sources the data resides in" Similar to Drill, but the
> business logic support appears better.
>
> * "In F1 Query, short queries are executed on a single node, while larger
> queries are executed in a low-overhead distributed execution mode with no
> checkpointing and limited reliability guarantees. The largest queries are run
> in a reliable batch-oriented execution mode that uses the MapReduce
> framework" (Drill has the middle bits; the single-node and delegate-to-MR
> (Spark?)modes are interesting extensions.)
>
> * "F1 Query abstracts away the details of each storage type. It makes all
> data appear as if it is stored in relational tables (with rich structured
> data types in the form of Protocol Buffers" (Subtle point: F1 supports any
> data source, but requires that each provide a schema.)
>
> * "F1 Query is extensible in various ways: it supports custom data sources as
> well as user defined scalar functions (UDFs), aggregation functions (UDAs),
> and table-valued functions (TVFs). User defined functions can use any type of
> data as input and output, including Protocol Buffers. Clients may express
> user-defined logic in SQL syntax, providing them with a simple way of
> abstracting common concepts from their queries and making them more readable
> and maintainable. They may also use ... scripts to define additional
> functions for ad-hoc queries and analysis [including] compiled and managed
> languages like ... Java". F1 has an interesting twist on the UDF function
> registry.
>
> * "Querying protocol buffers presents many of the same challenges as
> semi-structured data formats like ... JSON...Where JSON is entirely
> dynamically typed and often stored in human readable format, protocol buffers
> are statically typed and typically stored in a compact binary format,
> enabling much more efficient decoding. ... The exact structure and types of
> all protos referenced in a query are known at query planning time" That is,
> F1 does not waste time with ill-defined, loosey-goosey JSON. However he user
> can define an up-front schema with reading CSV (and presumably JSON.)
>
> That should be enough to inspire you to read the full paper.
>
> Thanks,
> - Paul
>
> [1] https://research.google/pubs/pub47224/
>
>
> [2]
> https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf
>
>
>