All,
After a recent Apache Drill talk in Berlin, one of the people in the audience,
Gunnar (in CC) raised some very interesting schema-related challenges. He was
so kind to write up his questions, see below. I hope we do, as a community, a
good job addressing these.
Again, thanks a lot for your interest in Drill, Gunnar, and hope you'll find
time to test drive it once it's available in all its beauty!
Cheers,
Michael
--
Michael Hausenblas
Ireland, Europe
http://mhausenblas.info/
Begin forwarded message:
> From: Gunnar Schröder <[email protected]>
> Subject: Apache Drill and Schema
> Date: 9 May 2013 22:15:59 GMT+01:00
>
> Hi Michael,
>
> as I promised I e-mail you some questions regarding Apache Drill, that
> crossed my mind when hearing your talk. Since you might want to forward this
> e-mail or parts of it to the mailing list, I will write you in English. :-)
>
> The question that puzzled me most after hearing your talk was how Apache
> Drill handles schema information about the data in any of its sources.
>
> RDBMSs for example have a very strict schema definition (Schema on write).
> Hive and Impala use the metastore to define a very similar schema definition,
> that must be available before querying data (Schema on Read). This schema is
> necessary for Hive to query HBase which has a more flexible schema, e.g.
> dynamic columns and no explicit types for data.
>
> I have problems understanding, how you want to implement SQL as query
> language (or any other high level query language) if Schema information is
> not or only partly available? If you reuse the schema information of Hive in
> its metastore or relational database such as MySQL you are fine. But how do
> you handle a raw CSV file or HBase without Hive schema information?
>
> Some simple examples:
>
> Select * from table order by col1;
>
> If the table is a CSV file or HBase table, how do you determine whether the
> ordering is done using string sort order or numeric order?
>
> Select * from table;
>
> I'm no expert on this, but I assume that if you query a table using JDBC or
> ODBC you need full schema information before you obtain the first result of
> the query.
>
> Select * from table1, table2 where table1.col1 = table2.col2;
>
> Let's assume the column to join on is a string. Some storage engines may make
> a distinction between null and empty string. Some storage engines may be non
> first normal form and may have multiple strings.
>
> These are just some examples, but I have doubts, whether it is possible to
> implement SQL 2003 as query language without full schema information.
>
> I scrolled through the Apache Drill design document and it mentions a
> metadata repository but the information regarding schema and typing is very
> sketchy.
>
> My recommendation would be:
> Start out by reusing the Hive metadata repository and implement Apache Drill
> on top of this schema information. Then gradually replace or supplement this
> with your own (possibly more dynamic and flexible) metadata repository.
>
> To be successful:
> - Keep it Apache Open Source Licence
> - Your major competitor is Impala, so think about in which respect you can be
> better
> - SQL and JDBC is very important to be successful (interface to DBs and
> reporting tools)
> - HBase would be great with a low latency SQL querying facility (different
> attempts are ongoing)
> - Provide interfaces to plug in new storage or file format handlers (this is
> a strength of Hive, but seems hard to do with Impala), e.g. to access files
> of your own particular format within HDFS
>
> Maybe these random thought about Apache Drill are useful for you. I would
> love to see a true open source alternative to Impala, which supplements Hive
> for real time queries, that is implemented in Java and provides pluggable
> interfaces to extend it.
>
> Cheers,
> Gunnar Schröder