Timothy, >IDL Great!
>Filtering I've created a document--just copying and pasting my email thoughts. You and I have edit, everyone else has comment. (If anyone else wants edit, let us know.) We can start to iterate there. https://docs.google.com/document/d/1hQMjmqmjw7ptBx0TbBRlGtEeXWhE2a-CinZx5ffm9Bg/edit Thanks, Jacques On Thu, Nov 15, 2012 at 12:02 PM, Timothy Chen <[email protected]> wrote: > Hi Jacques, > > Ill add a simple IDL that we can iterate on. > > About the filtering discussion, do you want to bring this discussion to a > google doc? > > Tim > > Sent from my iPhone > > On Nov 14, 2012, at 9:42 PM, Jacques Nadeau <[email protected]> > wrote: > > > Hey Timothy, > > > > It's great that you started pulling something together. Thanks for > taking > > the initiative! Do you want to spend some time looking at trying to > define > > an IDL for MsgPack for schema information and add that to your work? > > > > We also need to come up with a standard selection/filter > > vocabulary/approach. It would preferably cover things like: > > > > - Support simple field/tree inclusion lists and wildcards. > > - Classic relational like {column1, column2, column3} > > - Nested like {arrayColumn1.[*], mapColumn.foo} > > - Support some kind of filters such that could prune record, leaves, or > > branches > > - only include the first three sub elements > > - only include map keys that start with "user%" > > - only include this record where at least one > > arrayColumn.phone-number starts with "415%" > > > > One idea might be to conceive of a fourth concept on top of the classic > > (table|scalar|aggregate) functions called tree functions and generate a > set > > of primitives for that. Then allow scalar functions inside tree function > > evaluation. (I haven't thought a great deal about what this means.) > > I've also thought that xpath might be a good place to look for conceptual > > inspiration. (But I don't think we have any interest to go to that > > level...) > > > > Does any of this sound interesting? (That also goes for anyone out > there > > who is lurking...) > > > > Thanks again, > > Jacques > > > > > > On Wed, Nov 14, 2012 at 5:45 PM, Timothy Chen <[email protected]> wrote: > > > >> I don't have much to add to the options you've suggested, I do agree > >> storing the schema and sending the diffs will be the most ideal way to > go. > >> > >> And since we already need to look at every row, we can build the schema > >> diffs pretty easily. > >> > >> I currently have a simple JSON -> MsgPack impl using Yajl here: > >> > https://github.com/tnachen/incubator-drill/tree/executor/sandbox/executor > >> > >> Depending on the parser we use, most already have basic types detection > and > >> we can extend more data types discovery later on as extensions. > >> > >> Tim > >> > >> > >> > >> On Wed, Nov 14, 2012 at 3:17 PM, Jacques Nadeau < > [email protected] > >>> wrote: > >> > >>> One of the goals we've talked about for Drill is the ability to consume > >>> "schemaless" data. What this really means to me is data such as JSON > >> where > >>> the schema of data could change from record to record (and isn't known > >>> until query execution). I'd suggest that in most cases, the schema > >> within > >>> a JSON 'source' (collection of similar files) is mostly stable. The > >>> default JSON format passes this schema data with each record. This > would > >>> be the simplest way to manage this data. However, if Drill operated in > >>> this manner we'd likely have to manage fairly different code paths for > >> data > >>> with schema versus those without. There also seems like there would > be a > >>> substantial processing and message size overhead interacting with all > the > >>> schema information for each record. Couple of notes: > >>> > >>> - By schema here I more mean the structure of the key names and > nested > >>> structure of the data as opposed to value data types... > >>> - A simple example: we have a user table and one of the query > >>> expressions is user.phone-numbers. If we query that without schema, > >> we > >>> don't know if that is a scalar, a map or an array. Thus... we can't > >>> figure > >>> out the number of "fields" in the output stream. > >>> > >>> > >>> Separately, we've also talked before about having all the main > >> executional > >>> components operating on a batches of records as a single work unit > >>> (probably in MsgPack streaming format or similar). > >>> > >>> One way to manage schemaless data within these parameters is to > generate > >> a > >>> concrete schema of data as we're reading it and sending it with each > >> batch > >>> of records. To start, we could resend it with every batch. Later, we > >>> could add an optimization that the schema is only sent when it changes. > >> A > >>> nice additional option would be to store this schema stream as we're > >>> running the first queries so we can treat this data as fully schemaed > on > >>> later queries. (And also provide that schema back to whatever query > >> parser > >>> is being used.) > >>> > >>> Thoughts? What about thoughts on data types discovery in schemaless > data > >>> such as JSON, CSV, etc? > >>> > >>> Jacques > >> >
