Hi Jacques, Ill add a simple IDL that we can iterate on.
About the filtering discussion, do you want to bring this discussion to a google doc? Tim Sent from my iPhone On Nov 14, 2012, at 9:42 PM, Jacques Nadeau <[email protected]> wrote: > Hey Timothy, > > It's great that you started pulling something together. Thanks for taking > the initiative! Do you want to spend some time looking at trying to define > an IDL for MsgPack for schema information and add that to your work? > > We also need to come up with a standard selection/filter > vocabulary/approach. It would preferably cover things like: > > - Support simple field/tree inclusion lists and wildcards. > - Classic relational like {column1, column2, column3} > - Nested like {arrayColumn1.[*], mapColumn.foo} > - Support some kind of filters such that could prune record, leaves, or > branches > - only include the first three sub elements > - only include map keys that start with "user%" > - only include this record where at least one > arrayColumn.phone-number starts with "415%" > > One idea might be to conceive of a fourth concept on top of the classic > (table|scalar|aggregate) functions called tree functions and generate a set > of primitives for that. Then allow scalar functions inside tree function > evaluation. (I haven't thought a great deal about what this means.) > I've also thought that xpath might be a good place to look for conceptual > inspiration. (But I don't think we have any interest to go to that > level...) > > Does any of this sound interesting? (That also goes for anyone out there > who is lurking...) > > Thanks again, > Jacques > > > On Wed, Nov 14, 2012 at 5:45 PM, Timothy Chen <[email protected]> wrote: > >> I don't have much to add to the options you've suggested, I do agree >> storing the schema and sending the diffs will be the most ideal way to go. >> >> And since we already need to look at every row, we can build the schema >> diffs pretty easily. >> >> I currently have a simple JSON -> MsgPack impl using Yajl here: >> https://github.com/tnachen/incubator-drill/tree/executor/sandbox/executor >> >> Depending on the parser we use, most already have basic types detection and >> we can extend more data types discovery later on as extensions. >> >> Tim >> >> >> >> On Wed, Nov 14, 2012 at 3:17 PM, Jacques Nadeau <[email protected] >>> wrote: >> >>> One of the goals we've talked about for Drill is the ability to consume >>> "schemaless" data. What this really means to me is data such as JSON >> where >>> the schema of data could change from record to record (and isn't known >>> until query execution). I'd suggest that in most cases, the schema >> within >>> a JSON 'source' (collection of similar files) is mostly stable. The >>> default JSON format passes this schema data with each record. This would >>> be the simplest way to manage this data. However, if Drill operated in >>> this manner we'd likely have to manage fairly different code paths for >> data >>> with schema versus those without. There also seems like there would be a >>> substantial processing and message size overhead interacting with all the >>> schema information for each record. Couple of notes: >>> >>> - By schema here I more mean the structure of the key names and nested >>> structure of the data as opposed to value data types... >>> - A simple example: we have a user table and one of the query >>> expressions is user.phone-numbers. If we query that without schema, >> we >>> don't know if that is a scalar, a map or an array. Thus... we can't >>> figure >>> out the number of "fields" in the output stream. >>> >>> >>> Separately, we've also talked before about having all the main >> executional >>> components operating on a batches of records as a single work unit >>> (probably in MsgPack streaming format or similar). >>> >>> One way to manage schemaless data within these parameters is to generate >> a >>> concrete schema of data as we're reading it and sending it with each >> batch >>> of records. To start, we could resend it with every batch. Later, we >>> could add an optimization that the schema is only sent when it changes. >> A >>> nice additional option would be to store this schema stream as we're >>> running the first queries so we can treat this data as fully schemaed on >>> later queries. (And also provide that schema back to whatever query >> parser >>> is being used.) >>> >>> Thoughts? What about thoughts on data types discovery in schemaless data >>> such as JSON, CSV, etc? >>> >>> Jacques >>
