Julian, I'm not sure if this will help, but there are several example queries against a nested dataset here (it's the m-lab dataset): https://developers.google.com/bigquery/docs/dataset-mlab. There are also a few examples in the Experiments section of the Dremel paper, but I'm sure you've already seen those.
Cheers, Ben On 9/19/12 1:33 PM, "Julian Hyde" <julianh...@gmail.com> wrote: >On Sep 18, 2012, at 4:09 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > >> Excellent. >> >> I think that the key next steps are: >> >> 1) codify some API's (Jason and Julian are on this) >> >> 2) build out the "hello world" app using the available parser and Optiq >>and >> a trivial data model (#1 is not a blocker for this) >> >> 3) evaluate where we are implications are relative to nested data and >> columnar formats > >It would be really useful to establish a list of queries. Minimal, but >demonstrate the key features (e.g. "COUNT(a.b) WITHIN RECORD", the >ability to aggregate over nested collections. > >> Once we have that, we should be able to open up the parallelism of >> development massively. >> >> Some open questions/tasks that many people may be able to determine >>include: >> >> - is it possible to change Optiq operators and optimization rules >>without a >> recompile? > >Yes. (From Optiq's perspective, anyway, it is very straightforward. Optiq >starts from scratch each time a query is prepared, and takes whatever >metadata sources, operators and rules it is given.) > >> - what has to happen to make Optiq handle nested data? > >I am still mulling the answer to that question. The list of queries I >mentioned above will help me understand what exactly is "nested data" and >the operations on it. > >I have been using >http://developers.google.com/bigquery/docs/query-reference as my main >resource for BigQuery/DrQL. Let me know if there is a more authoritative >source. > >For the record (I'll only make this whine once, I promise) the BigQuery >language has some usability flaws, particularly inconsistencies with SQL. >Some of these may have been deliberate (for instance, to try to make the >language more concise for the user to type), or may have been made out of >ignorance for SQL. > >* The paper has an expression like "COUNT(x > 5)". SQL's COUNT operator >counts not-NULL values, not boolean TRUE values, so in SQL this >expression would not have the desired effect > >* My jaw dropped when I read "Unlike many other SQL-based systems, >BigQuery uses the comma syntax to indicate table unions, not joins". I >see how commas are a convenient abbreviation for UNION, but that's like >redefining "if" in a programming language. > >* "Note that the HAVING clause can only refer to fields defined in your >SELECT clause (if the field has an alias, you must use it; if it doesn't, >use the aggregate field name instead)." This kind of design decision >makes the language difficult to generate for. > >* Apparently string literals can be enclosed in either single or double >quotes. This may be for the convenience of the user, but is at odds with >SQL, which allows only single quotes. (Many SQL dialects use double >quotes to quote identifiers that contain mixed case or spaces.) > >* BigQuery allows table names to be prefixed with an optional >"projectname:". > >* The syntax for accessing nested collections at depth 2 or more. If "a" >is a table alias, and b is a collection-valued field, then "a.b" is the >collection of sub-records, and "a.b.c" is the union of the collections of >the "c" field of all records in the "a.b" collection. In 'a.b.c', the >first '.' is the conventional operator that accesses a field of a record. >But the second '.' is a strange beast that operates on a set of records. > >Most of these I offer as evidence that DrQL is not a superset of SQL. I >suppose we can vive la difference, implement separate parsers/validators >for the two languages, including implementing DrQL "features" we think >are ill-advised. > >But I was looking to DrQL to find out how to query nested collections, >and I didn't find much depth. How, for instance, to convert a "dept" >relation with nested "emp" records into a "flat" relation? Or given an >"shipment.order.lineitem" nested relation, sum up lineitem.discount only >in orders that have "order.prepaid=true"? Are nested collections ordered? > >SQL's support for nested collections is powerful & consistent. (But not >very concise, or intuitive for the non-expert.) It has operators like >UNNEST that convert a nested collection into a relation (note that >collections and relations are different beasts), and operators such as >CARDINALITY to aggregate them. (My friend John Sichi wrote a great >overview here: >http://farrago.sourceforge.net/design/CollectionTypes.html.) > >To be clear, I am not advocating adding SQL's collection constructs to >Drill's query language. But I am drawing inspiration from them when >designing how Optiq would represent collections, because a query planner >needs semantics to be very, very precise. I can see how most of DrQL's >constructs would map onto SQL's constructs. > >I'll have a better answer to the "what has to happen to make Optiq handle >nested data" question in a day or two. > >Julian