On Sep 18, 2012, at 4:09 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> Excellent.
> 
> I think that the key next steps are:
> 
> 1) codify some API's (Jason and Julian are on this)
> 
> 2) build out the "hello world" app using the available parser and Optiq and
> a trivial data model (#1 is not a blocker for this)
> 
> 3) evaluate where we are implications are relative to nested data and
> columnar formats

It would be really useful to establish a list of queries. Minimal, but 
demonstrate the key features (e.g. "COUNT(a.b) WITHIN RECORD", the ability to 
aggregate over nested collections.

> Once we have that, we should be able to open up the parallelism of
> development massively.
> 
> Some open questions/tasks that many people may be able to determine include:
> 
> - is it possible to change Optiq operators and optimization rules without a
> recompile?

Yes. (From Optiq's perspective, anyway, it is very straightforward. Optiq 
starts from scratch each time a query is prepared, and takes whatever metadata 
sources, operators and rules it is given.)

> - what has to happen to make Optiq handle nested data?

I am still mulling the answer to that question. The list of queries I mentioned 
above will help me understand what exactly is "nested data" and the operations 
on it.

I have been using http://developers.google.com/bigquery/docs/query-reference as 
my main resource for BigQuery/DrQL. Let me know if there is a more 
authoritative source.

For the record (I'll only make this whine once, I promise) the BigQuery 
language has some usability flaws, particularly inconsistencies with SQL. Some 
of these may have been deliberate (for instance, to try to make the language 
more concise for the user to type), or may have been made out of ignorance for 
SQL.

* The paper has an expression like "COUNT(x > 5)". SQL's COUNT operator counts 
not-NULL values, not boolean TRUE values, so in SQL this expression would not 
have the desired effect

* My jaw dropped when I read "Unlike many other SQL-based systems, BigQuery 
uses the comma syntax to indicate table unions, not joins". I see how commas 
are a convenient abbreviation for UNION, but that's like redefining "if" in a 
programming language.

* "Note that the HAVING clause can only refer to fields defined in your SELECT 
clause (if the field has an alias, you must use it; if it doesn't, use the 
aggregate field name instead)." This kind of design decision makes the language 
difficult to generate for.

* Apparently string literals can be enclosed in either single or double quotes. 
This may be for the convenience of the user, but is at odds with SQL, which 
allows only single quotes. (Many SQL dialects use double quotes to quote 
identifiers that contain mixed case or spaces.)

* BigQuery allows table names to be prefixed with an optional "projectname:".

* The syntax for accessing nested collections at depth 2 or more. If "a" is a 
table alias, and b is a collection-valued field, then "a.b" is the collection 
of sub-records, and "a.b.c" is the union of the collections of the "c" field of 
all records in the "a.b" collection. In 'a.b.c', the first '.' is the 
conventional operator that accesses a field of a record. But the second '.' is 
a strange beast that operates on a set of records.

Most of these I offer as evidence that DrQL is not a superset of SQL. I suppose 
we can vive la difference, implement separate parsers/validators for the two 
languages, including implementing DrQL "features" we think are ill-advised.

But I was looking to DrQL to find out how to query nested collections, and I 
didn't find much depth. How, for instance, to convert a "dept" relation with 
nested "emp" records into a "flat" relation? Or given an 
"shipment.order.lineitem" nested relation, sum up lineitem.discount only in 
orders that have "order.prepaid=true"? Are nested collections ordered?

SQL's support for nested collections is powerful & consistent. (But not very 
concise, or intuitive for the non-expert.) It has operators like UNNEST that 
convert a nested collection into a relation (note that collections and 
relations are different beasts), and operators such as CARDINALITY to aggregate 
them. (My friend John Sichi wrote a great overview here: 
http://farrago.sourceforge.net/design/CollectionTypes.html.)

To be clear, I am not advocating adding SQL's collection constructs to Drill's 
query language. But I am drawing inspiration from them when designing how Optiq 
would represent collections, because a query planner needs semantics to be 
very, very precise. I can see how most of DrQL's constructs would map onto 
SQL's constructs.

I'll have a better answer to the "what has to happen to make Optiq handle 
nested data" question in a day or two.

Julian

Reply via email to