Re: Possible way to specify column types in query

Paul Rogers Sun, 09 Sep 2018 21:43:06 -0700

Hi Weijie,

Thanks for the paper pointer. F1 uses the same syntax as Scope (the system 
cited in my earlier note): data type after the name.


Another description is [1]. Neither paper describe how F1 handles arrays. 
However, this second paper points out that Protobuf is F1's native format, and 
so F1 has support for nested types. Drill does also, but in Drill, a reference 
to "customer.phone.cell" cause the nested "cell" column to be projected as a 
top-level column. And, neither paper say whether F1 is used with O/JDBC, and if 
so, how they handle the mapping from nested types to the flat tuple structure 
required by xDBC.

Have you come across these details?

Thanks,
- Paul

 

    On Thursday, September 6, 2018, 8:43:57 PM PDT, weijie tong 
<[email protected]> wrote:  
 
 Google's latest paper about F1[1] claims to support any data sources by
using an extension api called TVF see section 6.3. Also need to declare
column datatype before the query.


[1] http://www.vldb.org/pvldb/vol11/p1835-samwel.pdf

On Fri, Sep 7, 2018 at 9:47 AM Paul Rogers <[email protected]>
wrote:

> Hi All,
>
> We've discussed quite a few times whether Drill should or should not
> support or require schemas, and if so, how the user might express the
> schema.
>
> I came across a paper [1] that suggests a simple, elegant SQL extension:
>
> EXTRACT <column>[:<type>] {,<column>[:<type>]}
> FROM <stream_name>
>
> Paraphrasing into Drill's SQL:
>
> SELECT <column>[:<type>][AS <alias>] {,<column>[:<type>][AS <alias>]}
> FROM <table_name>
>
> Have a collection of JSON files in which string column `foo` appears in
> only half the files? Don't want to get schema conflicts with VARCHAR and
> nullable INT? Just do:
>
> SELECT name:VARCHAR, age:INT, foo:VARCHAR
> FROM `my-dir` ...
>
> Not only can the syntax be used to specify the "natural" type for a
> column, it might also specify a preferred type. For example. "age:INT" says
> that "age" is an INT, even though JSON would normally parse it as a BIGINT.
> Similarly, using this syntax is a easy way to tell Drill how to convert CSV
> columns from strings to DATE, INT, FLOAT, etc. without the need for CAST
> functions. (CAST functions read the data in one format, then convert it to
> another in a Project operator. Using a column type might let the reader do
> the conversion -- something that is easy to implement if using the "result
> set loader" mechanism.)
>
> Plus, the syntax fits nicely into the existing view file structure. If the
> types appear in views, then client tools can continue to use standard SQL
> without the type information.
>
> When this idea came up in the past, someone mentioned the issue of
> nullable vs. non-nullable. (Let's also include arrays, since Drill supports
> that. Maybe add a suffix to the the name:
>
> SELECT req:VARCHAR NOT NULL, opt:INT NULL, arr:FLOAT[] FROM ...
>
> Not pretty, but works with the existing SQL syntax rules.
>
> Obviously, Drill has much on its plate, so not suggestion that Drill
> should do this soon. Just passing it along as yet another option to
> consider.
>
> Thanks,
> - Paul
>
> [1] http://www.cs.columbia.edu/~jrzhou/pub/Scope-VLDBJ.pdf

Re: Possible way to specify column types in query

Reply via email to