Re: Asterix Schema Provider Framework

Chen Li Wed, 30 Dec 2015 22:58:07 -0800

Wail,

To help us understand the technical details of the approach, is it
possible to use a few illustrative examples to show the input and
output?


Thanks, and happy new year to you too!

Chen

On Wed, Dec 30, 2015 at 10:26 PM, Wail Alkowaileet <[email protected]> wrote:
> Hi Chen,
>
> The schema inferencer API currently works on the printer sides (i.e. it's
> for the result output). Therefore, the scheme is computed per partition and
> when the user asks for the schema, the schemas of all partitions get
> "unioned" with some certain policy defined by the implementation of the
> schema inferencer API.
>
> The inferencer works per item type. Therefore, for open and closed types
> mix, it doesn't matter if the data is homogeneous (i.e there are *no* two
> items in the same nesting level having different types) as the resulting
> schema will be the union with nullables for the missing fields. However,
> for heterogeneous types, it's again up to the API implementation. In Spark
> world, heterogeneous types are considered strings and it's up to the user
> to parse that string. In Asterix case, we might have a different approach
> by utilizing the current built-in union type.
>
> For the "inferred" type, I imagine to have some sort of versioning approach
> as described in [1] and build a secondary index on "version_id" instead of
> storing the ids in the property-node. That's why I actually asked about the
> histograms, which can play a big role about what would be the expected
> schema for a query at compile time instead of inspecting every type by the
> execution engine. It's a JIT-like compiler for AQL.
>
> I know it sounds "ugly" as it probably requires index and metadata look ups
> for every insert. But the whole idea is undercooked and needs more
> elaboration to have a good picture if that would be beneficial.
>
> [1]
> http://btw-2015.de/res/proceedings/Hauptband/Wiss/Klettke-Schema_Extraction_and_Stru.pdf
>
> Thanks and Happy New Year :-)
>
>
> On Wed, Dec 30, 2015 at 10:05 PM, Chen Li <[email protected]> wrote:
>
>> Sounds very interesting.  A basic question about "inference."  Is the
>> inferred schema unique?  In other words, is it possible to get two
>> schemas from the same instance, especially considering open types and
>> close types?
>>
>> Chen
>>
>> On Fri, Dec 25, 2015 at 3:20 PM, Wail Alkowaileet <[email protected]>
>> wrote:
>> > Dears Dev,
>> >
>> > First of all, Happy Holidays :)
>> >
>> > I want to share with you my latest work on AsterixDB, Asterix Schema
>> > Provider Framework.
>> > The design document will be shared soon once I fully integrate it with
>> the
>> > new Asterix Messaging Framework.
>> >
>> > Summary:
>> > The main aim of the Schema Provider Framework is to help the user to
>> > understand the schema of the query result.
>> >
>> > Motivation:
>> > I'm currently working on building AsterxDB-Spark connector. Spark works
>> with
>> > JSON perfectly, however, it has to scan the whole result to infer the
>> > schema. To prevent Spark from doing this pass, Asterix can infer the
>> schema
>> > while materializing the result.
>> >
>> > Additionally, Asterix users can get the schema information in a
>> > Thrift/ADM-like format which can help them to build the required classes
>> to
>> > deserialize the result on their code.
>> >
>> > Brief description of how it works:
>> > Once the user ask for the schema to be inferred, the schema builder will
>> > follow the result printer (APrinterVisitor) to build up the information
>> > about the records, lists and fields types. Then it will compute the final
>> > schema (union) of the resulting output in a single pass.
>> >
>> > User-model:
>> > To see the "tentative" of the user-model, please check the doc:
>> >
>> https://github.com/Nullification/incubator-asterixdb/blob/master/asterix-doc/src/site/markdown/api.md
>> >
>> > Also see the attached images for screenshots of the web-gui interface
>> > including the resulting schema.
>> >
>> >
>> > Future "Ambitious" Applications:
>> > One low-hanging-fruit application is to extend Asterix open/closed to
>> > include yet another type called "inferred".
>> > inferred types will ask Asterix to build the schema information on
>> > ingestion. Inferred types can be very helpful, at least when you have a
>> > schema looks like one of our datasets (see attached wosType.adm) where
>> you
>> > can have multiple fields with similar names and different "schemas" or
>> > nested types.
>> >
>> > inferred type is a hybrid type (closed and open) which can have the
>> > flexibility of the open type and close performance and storage footprint
>> of
>> > the closed type.
>> >
>> > Probably inferred type is good for read-intensive application. For
>> > write-intensive where every CPU cycle counts, this can introduce some
>> > unnecessary overhead. But probably there is a clever solution with some
>> > adaptive sampling techniques.
>> >
>> > I'll be investigating more about this and share my thoughts later on :-))
>> >
>> > Have a wonderful holiday and happy weekend!
>> > --
>> >
>> > Regards,
>> > Wail Alkowaileet
>>
>
>
>
> --
>
> *Regards,*
> Wail Alkowaileet

Re: Asterix Schema Provider Framework

Reply via email to