Hi Wail,

thanks for writing this up!

I took a brief look and everything good great, but there’s one thing that surprised me a bit: the modifications in Algebricks. It seemed to me that all the actual data and schema management should happen in AsterixDB and that Algebricks doesn’t really need to know about this.
Is there a (clean) way to keep all of this in AsterixDB?
Or do you think that we need a (possibly more generic) extension point in Algebricks to support this feature?

Cheers,
Till

On 13 Jan 2016, at 14:04, Wail Alkowaileet wrote:

Sorry I forgot to put a link to the code:
https://github.com/Nullification/incubator-asterixdb
https://github.com/Nullification/incubator-asterixdb-hyracks

it currently lives in my github, I will push it soon to the gerrit.

Thanks.

On Wed, Jan 13, 2016 at 4:55 PM, Wail Alkowaileet <[email protected]>
wrote:

Hello Chen,

Sorry for the late reply,, I was hammered preparing for a workshop here in
Boston.
Also I wanted to prepare a comprehensive design document that includes all
the details about schema inferencer framework I built.

Please refer to it @:
https://docs.google.com/document/d/1Ue-yAWoLChOJ8JlkbXWdDW0tSW9szzP76ePL4wQRmP0/edit#

So just for the sake of your time (the document is a bit long):
Let's assume we have the following input:

{name: {
display_name: "Boxer, Laurence",
first_name: "Laurence",
full_name: "Boxer, Laurence",
reprint: "Y",
role: "author",
wos_standard: "Boxer, L",
last_name: "Boxer",
seq_no: "1"
}}

{name:{
display_name: "Adamek, Jiri",
first_name: "Jiri",
addr_no: "1",
full_name: "Adamek, Jiri",
reprint: "Y",
role: "author",
wos_standard: "Adamek, J",
last_name: "Adamek",
dais_id: "10121636",
seq_no: "1"
}}

As the "tuples" are all of type record, the schema inferencer will compute
the schema as the union of all records fields.

*as an ADM:*
create type nameType1 as closed{

display_name: string,
first_name:string,
addr_no:string?,
full_name: string,
reprint:string,
role:string,
wos_standard:string,
last_name:string,
dais_id:string?,
seq_no:string

}

create datasetType as closed{

name: nameType1

}

However for heterogeneous types as in the following example:

name: {
display_name: "Boxer, Laurence",
first_name: "Laurence",
full_name: "Boxer, Laurence",
reprint: "Y",
role: "author",
wos_standard: "Boxer, L",
last_name: "Boxer",
seq_no: "1"
}

name: [
{
    display_name: "Adamek, Jiri",
    first_name: "Jiri",
    addr_no: "1",
    full_name: "Adamek, Jiri",
    reprint: "Y",
    role: "author",
    wos_standard: "Adamek, J",
    last_name: "Adamek",
    dais_id: "10121636",
    seq_no: "1"
},
{
    display_name: "Koubek, Vaclav",
    first_name: "Vaclav",
    addr_no: "2",
    full_name: "Koubek, Vaclav",
    role: "author",
    wos_standard: "Koubek, V",
    last_name: "Koubek",
    dais_id: "12279647",
    seq_no: "2"
}
]

As you can see that field "name" is sometimes a record and sometimes is an ordered list. What Apache Spark does it infers name simply as a String.

In Asterix case, we can infer this type as UNION of both record and a list
of records.

*as an ADM:*
create type nameType1 as closed{

display_name: string,
first_name:string,
full_name: string,
reprint:string,
role:string,
wos_standard:string,
last_name:string,
seq_no:string

}

create type nameType2 as closed{

display_name: string,
first_name:string,
addr_no:string,
full_name: string,
reprint:string,
role:string,
wos_standard:string,
last_name:string,
dais_id:string,
seq_no:string

}

create datasetType as closed{

name: union(nameType1, [nameType2])

}





--

*Regards,*
Wail Alkowaileet

Reply via email to