Re: "Death of Schema-on-Read"

Jinfeng Ni Wed, 04 Apr 2018 13:36:11 -0700

I feel it's probably premature to cal it "death of schema-on-read" just
based on one application case. For one product I have been working on
recently, one use case is for IOT related application where data is sent
from a variety of small devices (sensors, camera, etc). It would be a hard
requirement to pre-define schema upfront for each device, before write data
into the system. Further, the value of data is likely to decrease
significantly over time; data within hours/days is way more important than
that of weeks/months ago. It's unimaginable to wait for weeks to run data
clean/preparation job, before user could query such data. In other words,
for application with requirements of  flexibility and time-sensitivity,
'schema-on-read' provides a huge benefit, compared with traditional
ETL-then-query approach.


Drill's schema-on-read is actually trying to solve a rather hard problem,
in that we deal with not only relational type, but also nested type. In
that sense, Drill is walking in an uncharted territory where not many
others are doing similar things.  Dealing with undocumented/unstructured
data is a big challenge. Although Drill's solution is not perfect, IMHO,
it's still a big step towards such a problem.

With that said, I agreed with points people raised earlier. In addition to
"schema-on-read", Drill has to do a better to handle the traditional cases
where schema is known beforehand, by introducing a meta-store /catalog, or
by allowing users to declare schema upfront ( I probably will not call
Drill "schema-forbidden"). The restart strategy seems to be also
interesting to handle failure caused by missing schema / schema change.




On Tue, Apr 3, 2018 at 10:01 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> Well, the restart strategy still works for your examples. And you only pay
> once. From them you look at the cached type information and used an upper
> bound data type as you read the data. Since it works to read the values in
> the right order, it is obviously possible to push down typing information
> even into the json reader.
>
>
>
> On Tue, Apr 3, 2018, 21:42 Paul Rogers <par0...@yahoo.com.invalid> wrote:
>
> > Subtle point. I can provide schema with Parquet, as you note. (Actually,
> > for Parquet, Drill is schema-required: I can't not provide a schema due
> to
> > the nature of Parquet...)
> >
> > But, I can't provide a schema for JSON, CSV, etc. The point is, Drill
> > forbids the user from providing a schema; only the file format itself can
> > provide the schema (or not, in the case of JSON). This is the very heart
> of
> > the problem.
> >
> > The root cause of our schema change exception is that vectors are,
> indeed,
> > strongly typed. But, file columns are not. Here is my favorite:
> >
> > {x: 10} {x: 10.1}
> >
> > Blam! Query fails because the vector is chosen as BigInt, then we
> discover
> > it really should have been Float8. (If the answer is: go back and rebuild
> > the vector with the new type, consider the case that 100K records
> separate
> > the two above so that the first batch is long gone by the time we see the
> > offending record. If only I could tell Drill to use Float8 (or Decimal)
> up
> > front...
> >
> > Views won't help here because the failure occurs before a view can kick
> > in. However, presumably, I could write a view to handle a different
> classic
> > case:
> >
> > myDir /
> > |- File 1: {a: 10, b: "foo"}
> > |- File 2: {a: 20}
> >
> > With query: SELECT a, b FROM myDir
> >
> > For File 2, Drill will guess that b is a Nullable Int, but it is really
> > VarChar. I think I could write clever SQL that says:
> >
> > If b is of type Nullable Int, return NULL cast to nullable VarChar, else
> > return b
> >
> > The irony is that I must to write procedural code to declare a static
> > attribute of the data. Yet SQL is otherwise declarative: I state what I
> > want, not how to implement it.
> >
> > Life would be so much easier if I could just say, "trust me, when you
> read
> > column b, it is a VarChar."
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >     On Tuesday, April 3, 2018, 10:53:27 AM PDT, Ted Dunning <
> > ted.dunn...@gmail.com> wrote:
> >
> >  I don't see why you say that Drill is schema-forbidden.
> >
> > The Parquet reader, for instance, makes strong use of the implied schema
> to
> > facilitate reading of typed data.
> >
> > Likewise, the vectorized internal format is strongly typed and, as such,
> > uses schema information.
> >
> > Views are another way to communicate schema information.
> >
> > It is true that you can't, say, view comments on fields from the command
> > line. But I don't understand saying "schema-forbidden".
> >
> >
> > On Tue, Apr 3, 2018 at 10:01 AM, Paul Rogers <par0...@yahoo.com.invalid>
> > wrote:
> >
> > > Here is another way to think about it. Today, Drill is
> > "schema-forbidden":
> > > even if I know the schema, I can't communicate that to Drill; Drill
> must
> > > figure it out on its own, making the same mistakes every time on
> > ambiguous
> > > schemas.
> > >
> >
>

Re: "Death of Schema-on-Read"

Reply via email to