Hi Hanu, The problem with views as is, even with casts, is that the casting comes too late to resolve he issues I highlighted in earlier messages. Ted's cast push-down idea causes the conversion to happen at read time so that we can, say, cast a string to an int, or cast a null to the proper type.
Today, if we use a cast, such as SELECT cast(a AS INT) FROM myTable then we get a DAG that has tree parts (to keep things simple): * Scan the data, using types inferred from the data itself * In a Filter operator, convert the type of data to INT * In Screen, return the result to the user If the type is ambiguous in the file, then the first step above fails; data never gets far enough for the Filter to kick in and apply the cast. Also, if a file contains a run of nulls, the scanner will choose Nullable Int, then fail when it finds, say, a string. The key point is that the cast push-down means that the query will not fail due to dicey files: the cast resolves the ambiguity. If we push the cast down, then it is the SCAN operator that resolves the conflict and does the cast; avoiding the failures we've been discussing. I like the idea you seem to be proposing: cascading views. Have a table view that cleans up each table. Then, these can be combined in higher-order views for specialized purposes. The beauty of the cast push-down idea is that no metadata is needed other than the query. If the user wants metadata, they use existing views (that contain the casts and cause the cast push-down.) This seems like such a simple, elegant solution that we could try it out quickly (if we get past the planner issues Aman mentioned.) In fact, the new scan operator code (done as part of the batch sizing work) already has a prototype mechanism for type hints. If the type hint is provided to the scanner, it uses them, otherwise it infers the type. We'd just hook up the cast push down data to that prototype and we could try out the result quickly. (The new scan operator is still in my private branch, in case anyone goes looking for it...) Some of your discussion talks about automatically inferring the schema. I really don't think we need to do that. The hint (cast push-down) is sufficient to resolve ambiguities in the existing scan-time schema inference. The syntax trick would be to find a way to provide hints just for those columns that are issues. If I have a table with columns a, b, ... z, but only b is a problem, I don't want to have to do: SELECT a, CAST(b AS INT), c, ... z FROM myTable Would be great if we could just do: SELECT *, CAST(b AS INT) FROM myTable I realize the above has issues; the key idea is: provide casts only for the problem fields without spelling out all fields. If we really want to get fancy, we can do UDF push down for the complex cases you mentioned. Maybe: SELECT *, CAST(b AS INT), parseCode(c) ... We are diving into design here; maybe you can file a JIRA and we can shift detailed design discussion to that JIRA. Salim already has one related to schema change errors, which was why the "Death" article caught my eye. Thanks, - Paul On Friday, April 6, 2018, 4:59:40 PM PDT, Hanumath Rao Maduri <hanu....@gmail.com> wrote: Hello, Thanks for Ted & Paul for clarifying my questions. Sorry for not being clear in my previous post, When I said create view I was under the impression for simple views where we use cast expressions currently to cast them to types. In this case planner can use this information to force the scans to use this as the schema. If the query fails then it fails at the scan and not after inferring the schema by the scanner. I know that views can get complicated with joins and expressions. For schema hinting through views I assume they should be created on single tables with corresponding columns one wants to project from the table. Regarding the same question, today we had a discussion with Aman. Here view can be considered as a "view" of the table with schema in place. We can change some syntax to suite it for specifying schema. something like this. create schema[optional] view(/virtual table ) v1 as (a: int, b : int) select a, b from t1 with some other rules as to conversion of scalar to complex types. Then the queries when used on this view (below) should enable the scanner to use this type information and then use it to convert the data into the appropriate types. select * from v1 For the possibility of schema information not being known by the user, may be use something like this. create schema[optional] view(/virtual table) v1 as select a, b from t1 infer schema. This view when used to query the table should trigger the logic of inferring and consolidating the schema and attaching that inferred schema to the view. In future when we use the same view, we should be using the inferred schema. This view either can be local view pertaining to the session or a global view so that other queries across sessions can use them. By default we can apply certain rules such as converting simple scalar values to other scalar values (like int to double etc). But we should be also able to give option to the customer to enable rules such as scalar int to array[int] when creating the view itself. Thanks, -Hanu On Fri, Apr 6, 2018 at 3:10 PM, Paul Rogers <par0...@yahoo.com.invalid> wrote: > Ted, this is why your participation in Drill is such a gift: cast > push-down is an elegant, simple solution that even works in views. > Beautiful. > > Thanks, > - Paul > > > > On Friday, April 6, 2018, 11:35:37 AM PDT, Ted Dunning < > ted.dunn...@gmail.com> wrote: > > On Thu, Apr 5, 2018 at 9:43 PM, Paul Rogers <par0...@yahoo.com.invalid> > wrote: > > > Great discussion. Really appreciate the insight from the Drill users! > > > > To Ted's points: the simplest possible solution is to allow a table > > function to express types. Just making stuff up: > > > > SELECT a FROM schema(myTable, (a: INT)) > > > > Why not just allow cast to be pushed down to the reader? > > Why invent new language features? > > Or, really ugly, a session option: > > > > ALTER SESSION SET schema.myTable="a: INT" > > > > These are a big problem. > >