At 4:01 AM +0100 7/21/07, Matt S Trout wrote:
This is the most coherent message to show the lines along which we were
thinking (hdp is confound on #dbix-class).

Those of you who don't have context for this, poke the list archives.

Those of you who do, please bear in mind when responding what anything
involving the creation of assloads of objects will be laughed at and/or
ignored since the intention is SQL::Abstract-level performance.
<snippage>

Since its already nearly 2am here, I'll just put out a few of my ideas and leave the rest for later.

1. I'm assuming your comment about objects means that you don't want to have, eg, an object for each scalar value or entity name or expression tree node or whatever. Sure, that's fine, and should be good for speed, and perhaps brevity, but taking your examples as a point of departure, we'll probably want to add at least one more element to each of the many array refs defining expression nodes so to provide meta-data about the node, such as a replacement for the meta-data that the name of the class an object is blessed into provides.

2. As far as I recall from a discussion we had on IRC, the new AST we are defining here is supposed to work not just with SQL databases but also databases accessed via some other language, such as LDAP, or my Muldis D. Partly for this reason, and partly just because SQL DBMSs differ from themselves enough that talking to them is like multiple languages, I believe that our AST should not conceptually be limited by some SQL lowest common denominator, and it should not simply try to mirror the structure of a simple select query.

3. Don't go lowest common denominator. If our AST is good, glue code that talks to a less capable back-end should be able to break down what the AST says into smaller chunks that the back-end understands, and feed them appropriately so that the back-end still does the right thing, and appropriately gather the results and return them as if the back-end was able to do that natively.

4. As should be the nature of ASTs, the focus of ours should be to accurately representing the *semantics* or meaning of what the user wants. The AST should provide the means to explicitly say what the desired behaviour is for particular constructs, any time there is a reasonable chance that either different backends have different defaults in that regard, or users are likely to have different default expectations. Of course, our AST can have various default behaviours defined for it such that users don't have to be explicit about some details if their desires match the defaults, but we still need to specify it in the design docs of our AST itself, and not just leave a lot of things to be back end implementation defined.

5. Our AST should be strongly typed from end to end, which assists in semantics. Any piece of data that it carries should know whether it is text or a number or whatever. That way, if we have '0124', we know how round-tripping it through a database would retain the leading zero or not. Matters of case-sensitivity need to be defined and not left to back-end defaults. That's not to say that we can't have generic types, as per Perl scalars, but these should be defined over stronger types, such that eg every value is of a certain stronger type, but a particular variable is allowed to hold values of any of several types.

6. We need to define our own full set of system-defined types and operators, which users of our AST invoke, and which back-ends gluing our AST then convert into or emulate native equivalents. Moreover, I recommend that our names for all such things are spelled with just letters, eg use 'equal' and 'not_equal' rather than '==' or '!=' etc.

7. It is essential to have the distinct concept of a logical boolean data type, and values, and operators. This is the result type of equality tests or and|or etc.

8. The most important distinct simple data types are: boolean, integer, bit string (blob), character string (text); then other numerics, then temporal types, then whatever such as spatial types if we want them.

9. The AST should support the concept of having collection-typed values, so that eg we can have table field values that are themselves eg tables, rows, arrays, etc. Nevermind whether the back-end DBMS can do this, some can, some can't, where they don't, we can fake it by splitting tables behind the scenes. If we have native support like this, it should be easy to, say, formulate a query over eg a one-to-many table relationship that returns both a parent record and its child records, in a single result set, without duplication; the result set eg has one row per parent record, and one field of that row is table-typed and contains the child records.

10. Tables should always contain, and queries should always return, no duplicate rows, if not always than at least by default. Users should have to explicitly say if they want duplicates, and if not then every row will be distinct. This is what most people want anyway, and doing it by default will significantly reduce bugs in user code that crop up due to duplicates being present.

11. Any operator that is conceptually N-ary should simply be defined to take N similar arguments, that argument being an array ref or something. Similarly, 'and' and 'or' should be ordinary N-ary boolean operators, as are string concatenation, and numerical addition and multiplication. For that matter, relational union, intersection, and natural join are all N-ary as well. All the examples are commutative, save concatenation, and all are associative. Put another way, any N-ary operator is a "reduce operator", iterating over a list to produce one result.

12. Our AST should be setup to only allow column names in a rowset to be distinct. If eg 2 tables are joined that have common column names, then if those 2 columns represent the same data and are redundant following the join, then eliminate one, or else if they don't rep the same data, rename one (SQL has 'as' for a reason).

13. Columns should be referred to by name only, not by any ordinal position. When specifying a relational union, the column names of both operands need to be the same, and columns will match up on common names.

14. Relational joins should all be natural joins, such that given 2 rowset/table operands, the join should simply match them up on columns of the same name (if necessary, columns of the operands can be renamed first to either be the same or different as needed). Doing it this way lets an N-table join be commutative, and the result won't have any duplicate columns. Its also easier to specify since, aside from possible column renaming of the operands, you don't need to specify join conditions to do a join. And if the db schema is well designed in the first place, you often won't have to rename columns either when joining them, or not often.

15. There should be distinct entity name spaces for system-defined types and operators, and user-defined ones. Eg, have either a tag or a name prefix on eg every operator call to specify. So then, our AST can specify invocation of stored procedures or functions et al the same way it specifies using other operators. Eg, 'sys.Int.add' versus 'user.bar_schema.foo_proc'. Doing it this way, there's no concern about reserved words.

16. The AST should treat a query as an arbitrary depth self-similar expression tree, where both scalar and relational operators can be called in any place. In SQL terms, the AST should embrace derived tables or subqueries or whatever. Don't leave these out just because some backends don't have them; we can fake it there if we have to.

17. The AST should just use such as the various simpler relational algebra or calculus operators rather than monolithic 'select'. For example, each of these is done using a separate operator, calls to which can be chained: selecting a subset of columns, filtering rows, joining rowsets, unioning rowsets, attaching new columns, grouping rows, summarizing rows, sorting rows, etc.

So that's probably a good start. I can suggest specific alterations to the example syntax if that is useful or people can't get what I'm saying above without such examples.

Meanwhile, I highly recommend looking at http://search.cpan.org/dist/Muldis-DB/lib/Muldis/DB/Language/Core.pod , which is the currently defined list of core system-defined operators (and data types) of Muldis D. This may give you some good ideas for what specific operators you want to have built-in to our AST definition. Note that if you don't understand some of my terminology, or want some context, you may want to read http://search.cpan.org/dist/Muldis-DB/lib/Muldis/DB/Language.pod first, at least the NOTES ON TERMINOLOGY section. For example, I say 'relation value' rather than 'rowset' and 'relation variable' rather than 'table'.

P.S.  And now it's 3am here.

-- Darren Duncan

_______________________________________________
List: http://lists.rawmode.org/cgi-bin/mailman/listinfo/dbix-class
Wiki: http://dbix-class.shadowcatsystems.co.uk/
IRC: irc.perl.org#dbix-class
SVN: http://dev.catalyst.perl.org/repos/bast/trunk/DBIx-Class/
Searchable Archive: http://www.mail-archive.com/dbix-class@lists.rawmode.org/

Reply via email to