Hey Gunnar, thanks for the great questions! A great set of questions. The key thing to remember is that we're trying to allow late-bind schema (as opposed to never-bind). We still need the storage engine to be able to determine the schema. It just doesn't have to happen until we're actually executing a query (e.g. reading the data off disk). This is more about the execution engine than about metadata management. (I don't think there is any desire to build a new metadata management solution today. Leveraging available schema information in Hive metastore is defintely a key priority.) The key for Drill is being able to execute some queries even if we don't have information. This means different things depending on the type of underlying storage.
For example, in the case of JSON, JSON does a nice job of differentiating between some value types and giving everything names. In this case, things will work pretty well. The same goes for MongoDB. On the flipside, HBase doesn't have a first class approach to data types (yet) but does have a clear named schmea system. In that case, we'll support simple queries but have to treat everything as opaque byte values unless you do an explicit cast. (In time, we could enable implicit casts if it seems reasonable. For example, if we discover a column that is always a fixed four bytes in width, we may support implicit casting to Hadoop's version of a writable int.) CSV and other delimited formats are even less clean. I expect that we will allow index-based retrieval where we know delimiter and just default to particular sort orders depending on what the data looks like. A late-bind schema system also helps where data has a clear format as it means that no pre-setup is required before you start to query the data. For example, if you have a bunch of Parquet or Sequence files, just point the system at that data and start querying. The files express their schema information. This does make things more challenging from other perspectives. From the JDBC perspective, the simplest implementation provides a ResultSet in return for Statement. In this case, schema isn't needed until we return a ResultSet so things aren't that bad for flat data types (nesting is more awkward). INFORMATION_SCHEMA support is harder, however I think most query tools leverage that information primarily for help in query generation and won't block execution of what appears to be an invalid query according to their records (since they they wouldn't know about late bound fields). As far as your recommendations, they are directly inline with our goals for Apache Drill. To summarize: >> My recommendation would be: >> Start out by reusing the Hive metadata repository and implement Apache Drill >> on top of this schema information. Then gradually replace or supplement this >> with your own (possibly more dynamic and flexible) metadata repository. Agreed. >> >> To be successful: >> - Keep it Apache Open Source Licence >> - Your major competitor is Impala, so think about in which respect you can >> be better >> - SQL and JDBC is very important to be successful (interface to DBs and >> reporting tools) >> - HBase would be great with a low latency SQL querying facility (different >> attempts are ongoing) >> - Provide interfaces to plug in new storage or file format handlers (this is >> a strength of Hive, but seems hard to do with Impala), e.g. to access files >> of your own particular format within HDFS Agreed. Thanks, Jacques On Sat, May 11, 2013 at 11:43 AM, Michael Hausenblas <[email protected]> wrote: > All, > > After a recent Apache Drill talk in Berlin, one of the people in the > audience, Gunnar (in CC) raised some very interesting schema-related > challenges. He was so kind to write up his questions, see below. I hope we > do, as a community, a good job addressing these. > > Again, thanks a lot for your interest in Drill, Gunnar, and hope you'll find > time to test drive it once it's available in all its beauty! > > Cheers, > Michael > > -- > Michael Hausenblas > Ireland, Europe > http://mhausenblas.info/ > > Begin forwarded message: > >> From: Gunnar Schröder <[email protected]> >> Subject: Apache Drill and Schema >> Date: 9 May 2013 22:15:59 GMT+01:00 >> >> Hi Michael, >> >> as I promised I e-mail you some questions regarding Apache Drill, that >> crossed my mind when hearing your talk. Since you might want to forward this >> e-mail or parts of it to the mailing list, I will write you in English. :-) >> >> The question that puzzled me most after hearing your talk was how Apache >> Drill handles schema information about the data in any of its sources. >> >> RDBMSs for example have a very strict schema definition (Schema on write). >> Hive and Impala use the metastore to define a very similar schema >> definition, that must be available before querying data (Schema on Read). >> This schema is necessary for Hive to query HBase which has a more flexible >> schema, e.g. dynamic columns and no explicit types for data. >> >> I have problems understanding, how you want to implement SQL as query >> language (or any other high level query language) if Schema information is >> not or only partly available? If you reuse the schema information of Hive in >> its metastore or relational database such as MySQL you are fine. But how do >> you handle a raw CSV file or HBase without Hive schema information? >> >> Some simple examples: >> >> Select * from table order by col1; >> >> If the table is a CSV file or HBase table, how do you determine whether the >> ordering is done using string sort order or numeric order? >> >> Select * from table; >> >> I'm no expert on this, but I assume that if you query a table using JDBC or >> ODBC you need full schema information before you obtain the first result of >> the query. >> >> Select * from table1, table2 where table1.col1 = table2.col2; >> >> Let's assume the column to join on is a string. Some storage engines may >> make a distinction between null and empty string. Some storage engines may >> be non first normal form and may have multiple strings. >> >> These are just some examples, but I have doubts, whether it is possible to >> implement SQL 2003 as query language without full schema information. >> >> I scrolled through the Apache Drill design document and it mentions a >> metadata repository but the information regarding schema and typing is very >> sketchy. >> >> My recommendation would be: >> Start out by reusing the Hive metadata repository and implement Apache Drill >> on top of this schema information. Then gradually replace or supplement this >> with your own (possibly more dynamic and flexible) metadata repository. >> >> To be successful: >> - Keep it Apache Open Source Licence >> - Your major competitor is Impala, so think about in which respect you can >> be better >> - SQL and JDBC is very important to be successful (interface to DBs and >> reporting tools) >> - HBase would be great with a low latency SQL querying facility (different >> attempts are ongoing) >> - Provide interfaces to plug in new storage or file format handlers (this is >> a strength of Hive, but seems hard to do with Impala), e.g. to access files >> of your own particular format within HDFS >> >> Maybe these random thought about Apache Drill are useful for you. I would >> love to see a true open source alternative to Impala, which supplements Hive >> for real time queries, that is implemented in Java and provides pluggable >> interfaces to extend it. >> >> Cheers, >> Gunnar Schröder >
