Re: Asterix Schema Provider Framework

Till Westmann Wed, 13 Jan 2016 14:21:18 -0800

Hi Wail,

thanks for writing this up!

I took a brief look and everything good great, but there’s one thingthat surprised me a bit: the modifications in Algebricks. It seemed tome that all the actual data and schema management should happen inAsterixDB and that Algebricks doesn’t really need to know about this.

Is there a (clean) way to keep all of this in AsterixDB?

Or do you think that we need a (possibly more generic) extension pointin Algebricks to support this feature?


Cheers,
Till

On 13 Jan 2016, at 14:04, Wail Alkowaileet wrote:

Sorry I forgot to put a link to the code:
https://github.com/Nullification/incubator-asterixdb
https://github.com/Nullification/incubator-asterixdb-hyracks

it currently lives in my github, I will push it soon to the gerrit.

Thanks.

On Wed, Jan 13, 2016 at 4:55 PM, Wail Alkowaileet <[email protected]>
wrote:

Hello Chen,

Sorry for the late reply,, I was hammered preparing for a workshophere in

Boston.

Also I wanted to prepare a comprehensive design document thatincludes all

the details about schema inferencer framework I built.

Please refer to it @:
https://docs.google.com/document/d/1Ue-yAWoLChOJ8JlkbXWdDW0tSW9szzP76ePL4wQRmP0/edit#

So just for the sake of your time (the document is a bit long):
Let's assume we have the following input:

{name: {
display_name: "Boxer, Laurence",
first_name: "Laurence",
full_name: "Boxer, Laurence",
reprint: "Y",
role: "author",
wos_standard: "Boxer, L",
last_name: "Boxer",
seq_no: "1"
}}

{name:{
display_name: "Adamek, Jiri",
first_name: "Jiri",
addr_no: "1",
full_name: "Adamek, Jiri",
reprint: "Y",
role: "author",
wos_standard: "Adamek, J",
last_name: "Adamek",
dais_id: "10121636",
seq_no: "1"
}}

As the "tuples" are all of type record, the schema inferencer willcompute

the schema as the union of all records fields.

*as an ADM:*
create type nameType1 as closed{

display_name: string,
first_name:string,
addr_no:string?,
full_name: string,
reprint:string,
role:string,
wos_standard:string,
last_name:string,
dais_id:string?,
seq_no:string

}

create datasetType as closed{

name: nameType1

}

However for heterogeneous types as in the following example:

name: {
display_name: "Boxer, Laurence",
first_name: "Laurence",
full_name: "Boxer, Laurence",
reprint: "Y",
role: "author",
wos_standard: "Boxer, L",
last_name: "Boxer",
seq_no: "1"
}

name: [
{
    display_name: "Adamek, Jiri",
    first_name: "Jiri",
    addr_no: "1",
    full_name: "Adamek, Jiri",
    reprint: "Y",
    role: "author",
    wos_standard: "Adamek, J",
    last_name: "Adamek",
    dais_id: "10121636",
    seq_no: "1"
},
{
    display_name: "Koubek, Vaclav",
    first_name: "Vaclav",
    addr_no: "2",
    full_name: "Koubek, Vaclav",
    role: "author",
    wos_standard: "Koubek, V",
    last_name: "Koubek",
    dais_id: "12279647",
    seq_no: "2"
}
]

As you can see that field "name" is sometimes a record and sometimesis anordered list. What Apache Spark does it infers name simply as aString.

In Asterix case, we can infer this type as UNION of both record and alist

of records.

*as an ADM:*
create type nameType1 as closed{

display_name: string,
first_name:string,
full_name: string,
reprint:string,
role:string,
wos_standard:string,
last_name:string,
seq_no:string

}

create type nameType2 as closed{

display_name: string,
first_name:string,
addr_no:string,
full_name: string,
reprint:string,
role:string,
wos_standard:string,
last_name:string,
dais_id:string,
seq_no:string

}

create datasetType as closed{

name: union(nameType1, [nameType2])

}



--

*Regards,*
Wail Alkowaileet

Re: Asterix Schema Provider Framework

Reply via email to