Re: "Death of Schema-on-Read"

Paul Rogers Tue, 03 Apr 2018 10:02:09 -0700

Here is another way to think about it. Today, Drill is "schema-forbidden": even 
if I know the schema, I can't communicate that to Drill; Drill must figure it 
out on its own, making the same mistakes every time on ambiguous schemas.

Contrast this with Hive, which is "schema-required": I must tell Hive the 
schema even in cases where Hive could easily figure it out on its own. 

Perhaps Drill can occupy a middle ground: "schema-optional": Drill will figure 
out the schema as best it can, but will accept suggestions (hints) which the 
user can provide when it is the most efficient path to get work done.

Once hints are supported, then a system such as Ted's can be built on top: 
retry the query to use a bit of machine learning to infer the schema. Or, get 
the schema from Hive. Or from Comcast's Avro files. Or whatever.

The point is, if the user knows the schema, and is willing to resolve the 
ambiguities for us, what value do we provide by refusing to accept those hints?

On the other hand, since schema is optional, then Drill can continue to be used 
for Parth's schema exploration use case. 

Still, after doing a bit of exploration; the needs to move into getting work 
done based on that exploration. This seems to be the case at Comcast: they've 
move past exploration into production. But, Drill has limited means to use the 
result of exploration to resolve schema ambiguities on future queries. (Views 
are a partial answer, but have gaps.)

Ted makes a good point: Drill works most of the time already. The suggestion is 
that users might prefer that Drill works not just most of the time, but rather 
all of the time so users can reliably get their work done with no surprises, 
even with less-than-perfect schemas. If providing a few schema hints is the 
cost to pay to get that reliability, shouldn't the user in a position to choose 
to make that tradeoff?

Thanks,
- Paul

    On Tuesday, April 3, 2018, 2:32:05 AM PDT, Parth Chandra 
<par...@apache.org> wrote:  

 This, of course, begs the question [1], doesn't it?

If you have the schema, then you have either a) spent time designing and
documenting your data (both the schema and dictionary containing the
semantics) or b) spent time "finding, interpreting, and cleaning data" to
discover the data schema and dictionary.

Data that has "no documentation beyond attribute names, which may be
inscrutable, vacuous, or even misleading" will continue to be so even after
you specify the schema.

Asking users to design their schemas when they have already accumulated
data that is unclean and undocumented is asking them to do the work that
they use your software for in the first place.

The goal of schema on read is to facilitate the task of interpreting the
data that already exists, is mutating, and is undocumented (or documented
badly).

[1] https://en.wikipedia.org/wiki/Begging_the_question

On Mon, Apr 2, 2018 at 11:16 AM, Paul Rogers <par0...@yahoo.com.invalid>
wrote:

> ...is the name of a provocative blog post [1].
> Quote: "Once found, diverse data sets are very hard to integrate, since
> the data typically contains no documentation on the semantics of its
> attributes. ... The rule of thumb is that data scientists spend 70% of
> their time finding, interpreting, and cleaning data, and only 30% actually
> analyzing it. Schema on read offers no help in these tasks, because data
> gives up none of its secrets until actually read, and even when read has no
> documentation beyond attribute names, which may be inscrutable, vacuous, or
> even misleading."
> This quote relates to a discussion Salim & I have been having: that Drill
> struggles to extract a usable schema directly from anything but the
> cleanest of data sets, leading to unwanted and unexpected schema change
> exceptions due to inherent ambiguities in how to interpret the data. (E.g.
> in JSON, if we see nothing but nulls, what type is the null?)
> A possible answer is further down in the post: "At Comcast, for instance,
> Kafka topics are associated with Apache Avro schemas that include
> non-trivial documentation on every attribute and use common subschemas to
> capture commonly used data... 'Schema on read' using Avro files thus
> includes rich documentation and common structures and naming conventions."
> Food for thought.
> Thanks,
> - Paul
> [1] https://www.oreilly.com/ideas/data-governance-and-the-
> death-of-schema-on-read?imm_mid=0fc3c6&cmp=em-data-na-na-newsltr_20180328
>
>
>
>
>

Re: "Death of Schema-on-Read"

Reply via email to