Wail, To help us understand the technical details of the approach, is it possible to use a few illustrative examples to show the input and output?
Thanks, and happy new year to you too! Chen On Wed, Dec 30, 2015 at 10:26 PM, Wail Alkowaileet <[email protected]> wrote: > Hi Chen, > > The schema inferencer API currently works on the printer sides (i.e. it's > for the result output). Therefore, the scheme is computed per partition and > when the user asks for the schema, the schemas of all partitions get > "unioned" with some certain policy defined by the implementation of the > schema inferencer API. > > The inferencer works per item type. Therefore, for open and closed types > mix, it doesn't matter if the data is homogeneous (i.e there are *no* two > items in the same nesting level having different types) as the resulting > schema will be the union with nullables for the missing fields. However, > for heterogeneous types, it's again up to the API implementation. In Spark > world, heterogeneous types are considered strings and it's up to the user > to parse that string. In Asterix case, we might have a different approach > by utilizing the current built-in union type. > > For the "inferred" type, I imagine to have some sort of versioning approach > as described in [1] and build a secondary index on "version_id" instead of > storing the ids in the property-node. That's why I actually asked about the > histograms, which can play a big role about what would be the expected > schema for a query at compile time instead of inspecting every type by the > execution engine. It's a JIT-like compiler for AQL. > > I know it sounds "ugly" as it probably requires index and metadata look ups > for every insert. But the whole idea is undercooked and needs more > elaboration to have a good picture if that would be beneficial. > > [1] > http://btw-2015.de/res/proceedings/Hauptband/Wiss/Klettke-Schema_Extraction_and_Stru.pdf > > Thanks and Happy New Year :-) > > > On Wed, Dec 30, 2015 at 10:05 PM, Chen Li <[email protected]> wrote: > >> Sounds very interesting. A basic question about "inference." Is the >> inferred schema unique? In other words, is it possible to get two >> schemas from the same instance, especially considering open types and >> close types? >> >> Chen >> >> On Fri, Dec 25, 2015 at 3:20 PM, Wail Alkowaileet <[email protected]> >> wrote: >> > Dears Dev, >> > >> > First of all, Happy Holidays :) >> > >> > I want to share with you my latest work on AsterixDB, Asterix Schema >> > Provider Framework. >> > The design document will be shared soon once I fully integrate it with >> the >> > new Asterix Messaging Framework. >> > >> > Summary: >> > The main aim of the Schema Provider Framework is to help the user to >> > understand the schema of the query result. >> > >> > Motivation: >> > I'm currently working on building AsterxDB-Spark connector. Spark works >> with >> > JSON perfectly, however, it has to scan the whole result to infer the >> > schema. To prevent Spark from doing this pass, Asterix can infer the >> schema >> > while materializing the result. >> > >> > Additionally, Asterix users can get the schema information in a >> > Thrift/ADM-like format which can help them to build the required classes >> to >> > deserialize the result on their code. >> > >> > Brief description of how it works: >> > Once the user ask for the schema to be inferred, the schema builder will >> > follow the result printer (APrinterVisitor) to build up the information >> > about the records, lists and fields types. Then it will compute the final >> > schema (union) of the resulting output in a single pass. >> > >> > User-model: >> > To see the "tentative" of the user-model, please check the doc: >> > >> https://github.com/Nullification/incubator-asterixdb/blob/master/asterix-doc/src/site/markdown/api.md >> > >> > Also see the attached images for screenshots of the web-gui interface >> > including the resulting schema. >> > >> > >> > Future "Ambitious" Applications: >> > One low-hanging-fruit application is to extend Asterix open/closed to >> > include yet another type called "inferred". >> > inferred types will ask Asterix to build the schema information on >> > ingestion. Inferred types can be very helpful, at least when you have a >> > schema looks like one of our datasets (see attached wosType.adm) where >> you >> > can have multiple fields with similar names and different "schemas" or >> > nested types. >> > >> > inferred type is a hybrid type (closed and open) which can have the >> > flexibility of the open type and close performance and storage footprint >> of >> > the closed type. >> > >> > Probably inferred type is good for read-intensive application. For >> > write-intensive where every CPU cycle counts, this can introduce some >> > unnecessary overhead. But probably there is a clever solution with some >> > adaptive sampling techniques. >> > >> > I'll be investigating more about this and share my thoughts later on :-)) >> > >> > Have a wonderful holiday and happy weekend! >> > -- >> > >> > Regards, >> > Wail Alkowaileet >> > > > > -- > > *Regards,* > Wail Alkowaileet
