[DISCUSS] Schema queries - solutions?

Paul Rogers Mon, 17 Feb 2020 23:24:13 -0800

Hi All,

Charles has a little PR,  #1978, that adds a convenient feature to his HDF5 
format reader: the ability to query the schema of the file. (It seems that HDF5 
is a bit like a zip file: it contains a set of files. Unlike zip, each file is 
a data set with a schema.) Charles added a clever way to tell the reader that 
the user wants a schema rather than data.


If we think a bit, we realize that a schema query would be handy for any data 
source. Maybe I want to know the fields in a JSON or Parquet file without 
getting the data for those fields (and, for example, inferring type and 
nullability from data.)

In a relational DB, we'd get the schema by querying system tables. We'd do the 
same thing in Hive because Hive requires an up-front schema. But, Drill is 
unique in that it can infer schema at run time; no previous schema required. 
So, we have no system tables to answer schema questions. Instead, we want to 
get the schema directly from the data source itself by doing a query.

(This feature would be in addition to the case when the Metastore does hold a 
schema.)


How might we accomplish the same result? Can we create some kind of "virtual" 
system table that tells us to rewrite the query to get schema? Something like:

SELECT * FROM sys.columns WHERE tableName = `dfs`.`my/path/someFile.json`

Or, maybe some implied columns in the table schema?


SELECT schema.* FROM `dfs`.`my/path/someFile.json`


Or, maybe some special schema name space?

SELECT schema.* FROM schema.`dfs`.`my/path/someFile.json`


Anyone know of any system that has an elegant solution we could mimic? Other 
suggestions?


Thanks,
- Paul

[DISCUSS] Schema queries - solutions?

Reply via email to