Re: "Death of Schema-on-Read"

Paul Rogers Fri, 06 Apr 2018 15:07:35 -0700

Hi Hanu,
Sorry, I tossed in a new topic late in the discussion. We started by noting 
that views don't always work to resolve low-level schema conflicts for the 
reasons we discussed, so we need something else. That led to the schema hint 
discussion.


The additional point I raised was that views are still very useful for other 
tasks such as, say, computed columns (extended_price = price * quantity), for 
filtering out data (rejecting certain kinds of unwanted records) and so on.

If we need both hints and views (they serve distinct purposes), we'd want to 
ask how could a user can combine them into a single file-based schema file so 
that query users just see a simplified version of the table (with the hints and 
views applied.)

Since I tossed in new ideas, here is one more. We once saw a wild-and-crazy 
form of JSON with embedded metadata. Assume a list of customers:

{name: {type: "string", value: "Fred"}, age: {type: "int", value: 40}}

In such a case, would be great to be able to transform the data so that the 
fields become simple value like this:

{name: "Fred", age: 40}

Views can do this for top-level fields. But, there is no syntax in Drill to do 
this in nested maps:

{... address: {street: {type: "string", value "301 Cobblestone Way"}, ...}}

Ideally, we'd transform this to:

{name: "Fred", age: 40, address: {street: "301 Cobblestone Way", ...}}

So, if we come up with a metadata hint system, we (or the community) should be 
able to add rules for the type of messy data actually encountered in the field.

Thanks,
- Paul

 

    On Thursday, April 5, 2018, 10:22:46 PM PDT, Hanumath Rao Maduri 
<hanu....@gmail.com> wrote:  
 
 Hello,

Thank you Paul for starting this discussion.
However, I was not clear on the latest point as to how providing hints and
creating a view(mechanism which already exists in DRILL) is different.
I do think that creating a view can be cumbersome (in terms of syntax).
Providing hints are ephemeral and hence it can be used for quick validation
of the schema for a query execution. But if the user absolutely knows the
schema, then I think creating a view and using it might be a better option.
Can you please share your thoughts on this.

Thank you Ted for your valuable suggestions, as regards to your comment on
"metastore is good but centralized is bad" can you please share your view
point on what all design issues it can cause. I know that it can be
bottleneck but just want to know about other issues.
Put in other terms if centralized metastore engineered in a good way to
avoid most of the bottleneck, then do you think it can be good to use for
metadata?

Thanks,
-Hanu

On Thu, Apr 5, 2018 at 9:43 PM, Paul Rogers <par0...@yahoo.com.invalid>
wrote:

> Great discussion. Really appreciate the insight from the Drill users!
>
> To Ted's points: the simplest possible solution is to allow a table
> function to express types. Just making stuff up:
>
> SELECT a FROM schema(myTable, (a: INT))
>
> Or, a SQL extension:
>
> SELECT a FROM myTable(a: INT)
>
> Or, really ugly, a session option:
>
> ALTER SESSION SET schema.myTable="a: INT"
>
> All these are ephemeral and not compatible with, say, Tableau.
>
> Building on Ted's suggestion of using the (distributed) file system we can
> toss out a few half-baked ideas. Maybe use a directory to represent a name
> space, with files representing tables. If I have "weblogs" as my directory,
> I might have a file called "jsonlog" to describe the (messy) format of my
> JSON-formatted log files. And "csvlog" to describe my CSV-format logs.
> Different directories represent different SQL databases (schemas),
> different files represent tables within the schema.
>
>
> The table files can store column hints. But, it could do more. Maybe
> define the partitioning scheme (by year, month, day, say) so that can be
> mapped to a column. Wouldn't it be be great if Drill could figure out the
> partitioning itself if we gave a date range?
>
> The file could also define the format plugin to use, and its options, to
> avoid the need to define this format separate from the data, and to reduce
> the need for table functions.
>
> Today, Drill matches files to format plugins using only extensions. The
> table file could provide a regex for those old-style files (such as real
> web logs) that don't use suffixes. Or, to differentiate between "sales.csv"
> and "returns.csv" in the same data directory.
>
>
> While we're at it, the file might as well contain a standard view to apply
> to the table to define computed columns, do data conversions and so on.
>
> If Drill does automatic scans (to detect schema, to gather stats), maybe
> store that alongside the table file: "csvlogs.drill" for the
> Drill-generated info.
>
>
> Voila! A nice schema definition with no formal metastore. Because the info
> is in files, it easy to version using git, etc. (especially if the
> directory can be mounted using NFS as a normal directory.) Atomic updates
> can be done via the rename trick (which, sadly, does not work on S3...)
>
>
> Or, maybe store all information in ZK in JSON as we do for plugin
> configurations. (Hard to version and modify though...)
>
>
> Lots of ways to skin this cat once we agree that hints are, in fact,
> useful additions to Drill's automatic schema detection.
>
>
> Thanks,
> - Paul
>
>
>
>    On Thursday, April 5, 2018, 3:22:07 PM PDT, Ted Dunning <
> ted.dunn...@gmail.com> wrote:
>
>  On Thu, Apr 5, 2018 at 7:24 AM, Joel Pfaff <joel.pf...@gmail.com> wrote:
>
> > Hello,
> >
> > A lot of versioning problems arise when trying to share data through
> kafka
> > between multiple applications with different lifecycles and maintainers,
> > since by default, a single message in Kafka is just a blob.
> > One way to solve that is to agree on a single serialization format,
> > friendly with a record per record storage (like avro) and in order to not
> > have to serialize the schema in use for every message, just reference an
> > entry in the Avro Schema Registry (this flow is described here:
> > https://medium.com/@stephane.maarek/introduction-to-
> > schemas-in-apache-kafka-with-the-confluent-schema-registry-3bf55e401321
> > ).
> > On top of the schema registry, specific client libs allow to validate the
> > message structure prior to the injection in kafka.
> > So while comcast mentions the usage of an Avro Schema to describe its
> > feeds, it does not mention directly the usage of avro files (to describe
> > the schema).
> >
>
> This is all good except for the assumption of a single schema for all time.
> You can mutate schemas in Avro (or JSON) in a future-proof manner, but it
> is important to recognize the simple truth that the data in a stream will
> not necessarily be uniform (and is even unlikely to be uniform)
>
>
>
>
> >
> > .... But the usage of CSV/JSON still is problematic. I like the idea of
> > having
> > an optional way to describe the expected types somewhere (either in a
> > central meta-store, or in a structured file next to the dataset).
> >
>
> Central meta-stores are seriously bad problems and are the single biggest
> nightmare in trying to upgrade Hive users. Let's avoid that if possible.
>
> Writing meta-data next to the file is also problematic if it needs to be
> written by the processing doing a query (the directory may not be
> writable).
>
> Having a convention for redirecting the meta-data cache to a parallel
> directory might solve the problem of non-writable local locations.
>
> In the worst case that Drill can't have any place to persist what it has
> learned but wants to do a restart, there needs to be SOME place to cache
> meta-data or else restarts will get no further than the original failed
> query.
>
>

Re: "Death of Schema-on-Read"

Reply via email to