for schemaless data, i think the tree trunk exists and is known , so we can detect changes only.
Jacques Nadeau <[email protected]>编写: >One of the goals we've talked about for Drill is the ability to consume >"schemaless" data. What this really means to me is data such as JSON where >the schema of data could change from record to record (and isn't known >until query execution). I'd suggest that in most cases, the schema within >a JSON 'source' (collection of similar files) is mostly stable. The >default JSON format passes this schema data with each record. This would >be the simplest way to manage this data. However, if Drill operated in >this manner we'd likely have to manage fairly different code paths for data >with schema versus those without. There also seems like there would be a >substantial processing and message size overhead interacting with all the >schema information for each record. Couple of notes: > > - By schema here I more mean the structure of the key names and nested > structure of the data as opposed to value data types... > - A simple example: we have a user table and one of the query > expressions is user.phone-numbers. If we query that without schema, we > don't know if that is a scalar, a map or an array. Thus... we can't figure > out the number of "fields" in the output stream. > > >Separately, we've also talked before about having all the main executional >components operating on a batches of records as a single work unit >(probably in MsgPack streaming format or similar). > >One way to manage schemaless data within these parameters is to generate a >concrete schema of data as we're reading it and sending it with each batch >of records. To start, we could resend it with every batch. Later, we >could add an optimization that the schema is only sent when it changes. A >nice additional option would be to store this schema stream as we're >running the first queries so we can treat this data as fully schemaed on >later queries. (And also provide that schema back to whatever query parser >is being used.) > >Thoughts? What about thoughts on data types discovery in schemaless data >such as JSON, CSV, etc? > >Jacques
