All,

After a recent Apache Drill talk in Berlin, one of the people in the audience, 
Gunnar (in CC) raised some very interesting schema-related challenges. He was 
so kind to write up his questions, see below. I hope we do, as a community, a 
good job addressing these.

Again, thanks a lot for your interest in Drill, Gunnar, and hope you'll find 
time to test drive it once it's available in all its beauty! 

Cheers,
                Michael

--
Michael Hausenblas
Ireland, Europe
http://mhausenblas.info/

Begin forwarded message:

> From: Gunnar Schröder <[email protected]>
> Subject: Apache Drill and Schema
> Date: 9 May 2013 22:15:59 GMT+01:00
> 
> Hi Michael,
> 
> as I promised I e-mail you some questions regarding Apache Drill, that 
> crossed my mind when hearing your talk. Since you might want to forward this 
> e-mail or parts of it to the mailing list, I will write you in English. :-)
> 
> The question that puzzled me most after hearing your talk was how Apache 
> Drill handles schema information about the data in any of its sources.
> 
> RDBMSs for example have a very strict schema definition (Schema on write). 
> Hive and Impala use the metastore to define a very similar schema definition, 
> that must be available before querying data (Schema on Read). This schema is 
> necessary for Hive to query HBase which has a more flexible schema, e.g. 
> dynamic columns and no explicit types for data.
> 
> I have problems understanding, how you want to implement SQL as query 
> language (or any other high level query language) if Schema information is 
> not or only partly available? If you reuse the schema information of Hive in 
> its metastore or relational database such as MySQL you are fine. But how do 
> you handle a raw CSV file or HBase without Hive schema information?
> 
> Some simple examples:
> 
> Select * from table order by col1;
> 
> If the table is a CSV file or HBase table, how do you determine whether the 
> ordering is done using string sort order or numeric order?
> 
> Select * from table;
> 
> I'm no expert on this, but I assume that if you query a table using JDBC or 
> ODBC you need full schema information before you obtain the first result of 
> the query.
> 
> Select * from table1, table2 where table1.col1 = table2.col2;
> 
> Let's assume the column to join on is a string. Some storage engines may make 
> a distinction between null and empty string. Some storage engines may be non 
> first normal form and may have multiple strings.
> 
> These are just some examples, but I have doubts, whether it is possible to 
> implement SQL 2003 as query language without full schema information.
> 
> I scrolled through the Apache Drill design document and it mentions a 
> metadata repository but the information regarding schema and typing is very 
> sketchy.
> 
> My recommendation would be:
> Start out by reusing the Hive metadata repository and implement Apache Drill 
> on top of this schema information. Then gradually replace or supplement this 
> with your own (possibly more dynamic and flexible) metadata repository.
> 
> To be successful:
> - Keep it Apache Open Source Licence
> - Your major competitor is Impala, so think about in which respect you can be 
> better
> - SQL and JDBC is very important to be successful (interface to DBs and 
> reporting tools)
> - HBase would be great with a low latency SQL querying facility (different 
> attempts are ongoing)
> - Provide interfaces to plug in new storage or file format handlers (this is 
> a strength of Hive, but seems hard to do with Impala), e.g. to access files 
> of your own particular format within HDFS
> 
> Maybe these random thought about Apache Drill are useful for you. I would 
> love to see a true open source alternative to Impala, which supplements Hive 
> for real time queries, that is implemented in Java and provides pluggable 
> interfaces to extend it.
> 
> Cheers,
> Gunnar Schröder

Reply via email to