Re: "Death of Schema-on-Read"

Paul Rogers Fri, 06 Apr 2018 18:53:13 -0700

Hi Hanu,

The problem with views as is, even with casts, is that the casting comes too 
late to resolve he issues I highlighted in earlier messages. Ted's cast 
push-down idea causes the conversion to happen at read time so that we can, 
say, cast a string to an int, or cast a null to the proper type.

Today, if we use a cast, such as SELECT cast(a AS INT) FROM myTable then we get 
a DAG that has tree parts (to keep things simple):

* Scan the data, using types inferred from the data itself
* In a Filter operator, convert the type of data to INT
* In Screen, return the result to the user

If the type is ambiguous in the file, then the first step above fails; data 
never gets far enough for the Filter to kick in and apply the cast. Also, if a 
file contains a run of nulls, the scanner will choose Nullable Int, then fail 
when it finds, say, a string.

The key point is that the cast push-down means that the query will not fail due 
to dicey files: the cast resolves the ambiguity. If we push the cast down, then 
it is the SCAN operator that resolves the conflict and does the cast; avoiding 
the failures we've been discussing.

I like the idea you seem to be proposing: cascading views. Have a table view 
that cleans up each table. Then, these can be combined in higher-order views 
for specialized purposes.

The beauty of the cast push-down idea is that no metadata is needed other than 
the query. If the user wants metadata, they use existing views (that contain 
the casts and cause the cast push-down.)

This seems like such a simple, elegant solution that we could try it out 
quickly (if we get past the planner issues Aman mentioned.) In fact, the new 
scan operator code (done as part of the batch sizing work) already has a 
prototype mechanism for type hints. If the type hint is provided to the 
scanner, it uses them, otherwise it infers the type. We'd just hook up the cast 
push down data to that prototype and we could try out the result quickly. (The 
new scan operator is still in my private branch, in case anyone goes looking 
for it...)

Some of your discussion talks about automatically inferring the schema. I 
really don't think we need to do that. The hint (cast push-down) is sufficient 
to resolve ambiguities in the existing scan-time schema inference.

The syntax trick would be to find a way to provide hints just for those columns 
that are issues. If I have a table with columns a, b, ... z, but only b is a 
problem, I don't want to have to do:

SELECT a, CAST(b AS INT), c, ... z FROM myTable

Would be great if we could just do:

SELECT *, CAST(b AS INT) FROM myTable

I realize the above has issues; the key idea is: provide casts only for the 
problem fields without spelling out all fields.

If we really want to get fancy, we can do UDF push down for the complex cases 
you mentioned. Maybe:

SELECT *, CAST(b AS INT), parseCode(c) ...

We are diving into design here; maybe you can file a JIRA and we can shift 
detailed design discussion to that JIRA. Salim already has one related to 
schema change errors, which was why the "Death" article caught my eye.

Thanks,
- Paul

    On Friday, April 6, 2018, 4:59:40 PM PDT, Hanumath Rao Maduri 
<hanu....@gmail.com> wrote:  

 Hello,

Thanks for Ted & Paul for clarifying my questions.
Sorry for not being clear in my previous post, When I said create view I
was under the impression for simple views where we use cast expressions
currently to cast them to types. In this case planner can use this
information to force the scans to use this as the schema.

If the query fails then it fails at the scan and not after inferring the
schema by the scanner.

I know that views can get complicated with joins and expressions. For
schema hinting through views I assume they should be created on single
tables with corresponding columns one wants to project from the table.

Regarding the same question, today we had a discussion with Aman. Here view
can be considered as a "view" of the table with schema in place.

We can change some syntax to suite it for specifying schema. something like
this.

create schema[optional] view(/virtual table ) v1 as (a: int, b : int)
select a, b from t1 with some other rules as to conversion of scalar to
complex types.

Then the queries when used on this view (below) should enable the scanner
to use this type information and then use it to convert the data into the
appropriate types.
select * from v1

For the possibility of schema information not being known by the user, may
be use something like this.

create schema[optional] view(/virtual table) v1 as select a, b from t1
infer schema.

This view when used to query the table should trigger the logic of
inferring and consolidating the schema and attaching that inferred schema
to the view. In future when we use the same view, we should be using the
inferred schema. This view either can be local view pertaining to the
session or a global view so that other queries across sessions can use them.

By default we can apply certain rules such as converting simple scalar
values to other scalar values (like int to double etc). But we should be
also able to give option to the customer to enable rules such as scalar int
to array[int] when creating the view itself.

Thanks,
-Hanu

On Fri, Apr 6, 2018 at 3:10 PM, Paul Rogers <par0...@yahoo.com.invalid>
wrote:

> Ted, this is why your participation in Drill is such a gift: cast
> push-down is an elegant, simple solution that even works in views.
> Beautiful.
>
> Thanks,
> - Paul
>
>
>
>    On Friday, April 6, 2018, 11:35:37 AM PDT, Ted Dunning <
> ted.dunn...@gmail.com> wrote:
>
>  On Thu, Apr 5, 2018 at 9:43 PM, Paul Rogers <par0...@yahoo.com.invalid>
> wrote:
>
> > Great discussion. Really appreciate the insight from the Drill users!
> >
> > To Ted's points: the simplest possible solution is to allow a table
> > function to express types. Just making stuff up:
> >
> > SELECT a FROM schema(myTable, (a: INT))
> >
>
> Why not just allow cast to be pushed down to the reader?
>
> Why invent new language features?
>
> Or, really ugly, a session option:
> >
> > ALTER SESSION SET schema.myTable="a: INT"
> >
>
> These are a big problem.
>
>

Re: "Death of Schema-on-Read"

Reply via email to