Incrementally generated an "as-observed" schema is an interesting idea. It should come nearly for free since the records need to be parsed in any case.
I would guess that this won't work so well with data that has highly variable records structure, but it hardly seems likely to be any worse than the truly schema-free design that we had been talking about to now. On Wed, Nov 14, 2012 at 3:17 PM, Jacques Nadeau <[email protected]>wrote: > One of the goals we've talked about for Drill is the ability to consume > "schemaless" data. What this really means to me is data such as JSON where > the schema of data could change from record to record (and isn't known > until query execution). I'd suggest that in most cases, the schema within > a JSON 'source' (collection of similar files) is mostly stable. The > default JSON format passes this schema data with each record. This would > be the simplest way to manage this data. However, if Drill operated in > this manner we'd likely have to manage fairly different code paths for data > with schema versus those without. There also seems like there would be a > substantial processing and message size overhead interacting with all the > schema information for each record. Couple of notes: > > - By schema here I more mean the structure of the key names and nested > structure of the data as opposed to value data types... > - A simple example: we have a user table and one of the query > expressions is user.phone-numbers. If we query that without schema, we > don't know if that is a scalar, a map or an array. Thus... we can't > figure > out the number of "fields" in the output stream. > > > Separately, we've also talked before about having all the main executional > components operating on a batches of records as a single work unit > (probably in MsgPack streaming format or similar). > > One way to manage schemaless data within these parameters is to generate a > concrete schema of data as we're reading it and sending it with each batch > of records. To start, we could resend it with every batch. Later, we > could add an optimization that the schema is only sent when it changes. A > nice additional option would be to store this schema stream as we're > running the first queries so we can treat this data as fully schemaed on > later queries. (And also provide that schema back to whatever query parser > is being used.) > > Thoughts? What about thoughts on data types discovery in schemaless data > such as JSON, CSV, etc? > > Jacques >
