Re: "Death of Schema-on-Read"

Paul Rogers Thu, 05 Apr 2018 21:45:01 -0700

Great discussion. Really appreciate the insight from the Drill users!

To Ted's points: the simplest possible solution is to allow a table function to 
express types. Just making stuff up:

SELECT a FROM schema(myTable, (a: INT))

Or, a SQL extension:

SELECT a FROM myTable(a: INT)

Or, really ugly, a session option:

ALTER SESSION SET schema.myTable="a: INT"

All these are ephemeral and not compatible with, say, Tableau.

Building on Ted's suggestion of using the (distributed) file system we can toss 
out a few half-baked ideas. Maybe use a directory to represent a name space, 
with files representing tables. If I have "weblogs" as my directory, I might 
have a file called "jsonlog" to describe the (messy) format of my 
JSON-formatted log files. And "csvlog" to describe my CSV-format logs. 
Different directories represent different SQL databases (schemas), different 
files represent tables within the schema.

The table files can store column hints. But, it could do more. Maybe define the 
partitioning scheme (by year, month, day, say) so that can be mapped to a 
column. Wouldn't it be be great if Drill could figure out the partitioning 
itself if we gave a date range?

The file could also define the format plugin to use, and its options, to avoid 
the need to define this format separate from the data, and to reduce the need 
for table functions.

Today, Drill matches files to format plugins using only extensions. The table 
file could provide a regex for those old-style files (such as real web logs) 
that don't use suffixes. Or, to differentiate between "sales.csv" and 
"returns.csv" in the same data directory.

While we're at it, the file might as well contain a standard view to apply to 
the table to define computed columns, do data conversions and so on.

If Drill does automatic scans (to detect schema, to gather stats), maybe store 
that alongside the table file: "csvlogs.drill" for the Drill-generated info.

Voila! A nice schema definition with no formal metastore. Because the info is 
in files, it easy to version using git, etc. (especially if the directory can 
be mounted using NFS as a normal directory.) Atomic updates can be done via the 
rename trick (which, sadly, does not work on S3...)

Or, maybe store all information in ZK in JSON as we do for plugin 
configurations. (Hard to version and modify though...)

Lots of ways to skin this cat once we agree that hints are, in fact, useful 
additions to Drill's automatic schema detection.

Thanks,
- Paul

    On Thursday, April 5, 2018, 3:22:07 PM PDT, Ted Dunning 
<ted.dunn...@gmail.com> wrote:  

 On Thu, Apr 5, 2018 at 7:24 AM, Joel Pfaff <joel.pf...@gmail.com> wrote:

> Hello,
>
> A lot of versioning problems arise when trying to share data through kafka
> between multiple applications with different lifecycles and maintainers,
> since by default, a single message in Kafka is just a blob.
> One way to solve that is to agree on a single serialization format,
> friendly with a record per record storage (like avro) and in order to not
> have to serialize the schema in use for every message, just reference an
> entry in the Avro Schema Registry (this flow is described here:
> https://medium.com/@stephane.maarek/introduction-to-
> schemas-in-apache-kafka-with-the-confluent-schema-registry-3bf55e401321
> ).
> On top of the schema registry, specific client libs allow to validate the
> message structure prior to the injection in kafka.
> So while comcast mentions the usage of an Avro Schema to describe its
> feeds, it does not mention directly the usage of avro files (to describe
> the schema).
>

This is all good except for the assumption of a single schema for all time.
You can mutate schemas in Avro (or JSON) in a future-proof manner, but it
is important to recognize the simple truth that the data in a stream will
not necessarily be uniform (and is even unlikely to be uniform)

>
> .... But the usage of CSV/JSON still is problematic. I like the idea of
> having
> an optional way to describe the expected types somewhere (either in a
> central meta-store, or in a structured file next to the dataset).
>

Central meta-stores are seriously bad problems and are the single biggest
nightmare in trying to upgrade Hive users. Let's avoid that if possible.

Writing meta-data next to the file is also problematic if it needs to be
written by the processing doing a query (the directory may not be writable).

Having a convention for redirecting the meta-data cache to a parallel
directory might solve the problem of non-writable local locations.

In the worst case that Drill can't have any place to persist what it has
learned but wants to do a restart, there needs to be SOME place to cache
meta-data or else restarts will get no further than the original failed
query.

Re: "Death of Schema-on-Read"

Reply via email to