Re: [DISCUSS] Schema queries - solutions?

Igor Guzenko Tue, 18 Feb 2020 00:09:14 -0800

Hello Paul,

Yes, Calcite's Schema interface directly says about nesting in its Javadoc.
Particularly in Drill, everything starts
from org.apache.drill.exec.store.AbstractSchema, then each storage plugin
extends the class to manage sub-schema and table registration.
Since hdf5 implemented as format plugin I'm not sure whether the mechanism
can be easily applied to it.


Thanks,
Igor

On Tue, Feb 18, 2020 at 9:48 AM Paul Rogers <[email protected]>
wrote:

> Hi Igor,
>
> Thanks! I should have remembered that bit of SQL.
>
> Yes, if we can generalize `DESCRIBE`, we could create another path of some
> kind through the plugins that say, "return schema, not data."
>
> Then, for the HDF5 use case we could have:
>
> DESCRIBE TABLE `dfs`.`myFile.hdf5` -- returns schema
>
>
> And
>
> SELECT * FROM `dfs`.`myFile.hdf5` -- returns data
>
>
> Nice solution! I'll file a feature request.
>
>
> The next interesting bit about HDF5 is that it is a file system, it
> contains multiple data sets. Would be great to be able to express that in
> the FROM clause:
>
> SELECT * FROM `dfs`.`myFile.hdf5`.`dataSet1`
>
> From my random walks though Calcite, it appears that we can have any level
> of schema/table path. True? We'd need some way to resolve a name part to a
> file, then ask the format plugin for that file if it supports additional
> parts. This seems pretty obscure. Have we done anything like that before?
> Maybe in storage (rather than format) plugin?
>
> Thanks,
> - Paul
>
>
>
>     On Monday, February 17, 2020, 11:34:48 PM PST, Igor Guzenko <
> [email protected]> wrote:
>
>  Hello Paul,
>
> Seems like we simply need to improve our DESCRIBE [1] table functionality.
>
> [1] https://drill.apache.org/docs/describe/
>
> Thanks,
> Igor
>
> On Tue, Feb 18, 2020 at 9:23 AM Paul Rogers <[email protected]>
> wrote:
>
> > Hi All,
> >
> > Charles has a little PR,  #1978, that adds a convenient feature to his
> > HDF5 format reader: the ability to query the schema of the file. (It
> seems
> > that HDF5 is a bit like a zip file: it contains a set of files. Unlike
> zip,
> > each file is a data set with a schema.) Charles added a clever way to
> tell
> > the reader that the user wants a schema rather than data.
> >
> > If we think a bit, we realize that a schema query would be handy for any
> > data source. Maybe I want to know the fields in a JSON or Parquet file
> > without getting the data for those fields (and, for example, inferring
> type
> > and nullability from data.)
> >
> > In a relational DB, we'd get the schema by querying system tables. We'd
> do
> > the same thing in Hive because Hive requires an up-front schema. But,
> Drill
> > is unique in that it can infer schema at run time; no previous schema
> > required. So, we have no system tables to answer schema questions.
> Instead,
> > we want to get the schema directly from the data source itself by doing a
> > query.
> >
> > (This feature would be in addition to the case when the Metastore does
> > hold a schema.)
> >
> >
> > How might we accomplish the same result? Can we create some kind of
> > "virtual" system table that tells us to rewrite the query to get schema?
> > Something like:
> >
> > SELECT * FROM sys.columns WHERE tableName = `dfs`.`my/path/someFile.json`
> >
> > Or, maybe some implied columns in the table schema?
> >
> >
> > SELECT schema.* FROM `dfs`.`my/path/someFile.json`
> >
> >
> > Or, maybe some special schema name space?
> >
> > SELECT schema.* FROM schema.`dfs`.`my/path/someFile.json`
> >
> >
> > Anyone know of any system that has an elegant solution we could mimic?
> > Other suggestions?
> >
> >
> > Thanks,
> > - Paul
> >
> >
>

Re: [DISCUSS] Schema queries - solutions?

Reply via email to