Re: Asterix Schema Provider Framework

Wail Alkowaileet Wed, 13 Jan 2016 14:05:59 -0800

Sorry I forgot to put a link to the code:
https://github.com/Nullification/incubator-asterixdb
https://github.com/Nullification/incubator-asterixdb-hyracks


it currently lives in my github, I will push it soon to the gerrit.

Thanks.

On Wed, Jan 13, 2016 at 4:55 PM, Wail Alkowaileet <[email protected]>
wrote:

> Hello Chen,
>
> Sorry for the late reply,, I was hammered preparing for a workshop here in
> Boston.
> Also I wanted to prepare a comprehensive design document that includes all
> the details about schema inferencer framework I built.
>
> Please refer to it @:
> https://docs.google.com/document/d/1Ue-yAWoLChOJ8JlkbXWdDW0tSW9szzP76ePL4wQRmP0/edit#
>
> So just for the sake of your time (the document is a bit long):
> Let's assume we have the following input:
>
> {name: {
>    display_name: "Boxer, Laurence",
>    first_name: "Laurence",
>    full_name: "Boxer, Laurence",
>    reprint: "Y",
>    role: "author",
>    wos_standard: "Boxer, L",
>    last_name: "Boxer",
>    seq_no: "1"
> }}
>
> {name:{
>    display_name: "Adamek, Jiri",
>    first_name: "Jiri",
>    addr_no: "1",
>    full_name: "Adamek, Jiri",
>    reprint: "Y",
>    role: "author",
>    wos_standard: "Adamek, J",
>    last_name: "Adamek",
>    dais_id: "10121636",
>    seq_no: "1"
> }}
>
> As the "tuples" are all of type record, the schema inferencer will compute
> the schema as the union of all records fields.
>
> *as an ADM:*
> create type nameType1 as closed{
>
> display_name: string,
> first_name:string,
> addr_no:string?,
> full_name: string,
> reprint:string,
> role:string,
> wos_standard:string,
> last_name:string,
> dais_id:string?,
> seq_no:string
>
> }
>
> create datasetType as closed{
>
> name: nameType1
>
> }
>
> However for heterogeneous types as in the following example:
>
> name: {
>    display_name: "Boxer, Laurence",
>    first_name: "Laurence",
>    full_name: "Boxer, Laurence",
>    reprint: "Y",
>    role: "author",
>    wos_standard: "Boxer, L",
>    last_name: "Boxer",
>    seq_no: "1"
> }
>
> name: [
>    {
>        display_name: "Adamek, Jiri",
>        first_name: "Jiri",
>        addr_no: "1",
>        full_name: "Adamek, Jiri",
>        reprint: "Y",
>        role: "author",
>        wos_standard: "Adamek, J",
>        last_name: "Adamek",
>        dais_id: "10121636",
>        seq_no: "1"
>    },
>    {
>        display_name: "Koubek, Vaclav",
>        first_name: "Vaclav",
>        addr_no: "2",
>        full_name: "Koubek, Vaclav",
>        role: "author",
>        wos_standard: "Koubek, V",
>        last_name: "Koubek",
>        dais_id: "12279647",
>        seq_no: "2"
>    }
> ]
>
> As you can see that field "name" is sometimes a record and sometimes is an
> ordered list. What Apache Spark does it infers name simply as a String.
>
> In Asterix case, we can infer this type as UNION of both record and a list
> of records.
>
> *as an ADM:*
> create type nameType1 as closed{
>
> display_name: string,
> first_name:string,
> full_name: string,
> reprint:string,
> role:string,
> wos_standard:string,
> last_name:string,
> seq_no:string
>
> }
>
> create type nameType2 as closed{
>
> display_name: string,
> first_name:string,
> addr_no:string,
> full_name: string,
> reprint:string,
> role:string,
> wos_standard:string,
> last_name:string,
> dais_id:string,
> seq_no:string
>
> }
>
> create datasetType as closed{
>
> name: union(nameType1, [nameType2])
>
> }
>
>
>


-- 

*Regards,*
Wail Alkowaileet

Re: Asterix Schema Provider Framework

Reply via email to