Re: Asterix Schema Provider Framework

Wail Alkowaileet Wed, 13 Jan 2016 13:55:42 -0800

Hello Chen,

Sorry for the late reply,, I was hammered preparing for a workshop here in
Boston.
Also I wanted to prepare a comprehensive design document that includes all
the details about schema inferencer framework I built.


Please refer to it @:
https://docs.google.com/document/d/1Ue-yAWoLChOJ8JlkbXWdDW0tSW9szzP76ePL4wQRmP0/edit#

So just for the sake of your time (the document is a bit long):
Let's assume we have the following input:

{name: {
   display_name: "Boxer, Laurence",
   first_name: "Laurence",
   full_name: "Boxer, Laurence",
   reprint: "Y",
   role: "author",
   wos_standard: "Boxer, L",
   last_name: "Boxer",
   seq_no: "1"
}}

{name:{
   display_name: "Adamek, Jiri",
   first_name: "Jiri",
   addr_no: "1",
   full_name: "Adamek, Jiri",
   reprint: "Y",
   role: "author",
   wos_standard: "Adamek, J",
   last_name: "Adamek",
   dais_id: "10121636",
   seq_no: "1"
}}

As the "tuples" are all of type record, the schema inferencer will compute
the schema as the union of all records fields.

*as an ADM:*
create type nameType1 as closed{

display_name: string,
first_name:string,
addr_no:string?,
full_name: string,
reprint:string,
role:string,
wos_standard:string,
last_name:string,
dais_id:string?,
seq_no:string

}

create datasetType as closed{

name: nameType1

}

However for heterogeneous types as in the following example:

name: {
   display_name: "Boxer, Laurence",
   first_name: "Laurence",
   full_name: "Boxer, Laurence",
   reprint: "Y",
   role: "author",
   wos_standard: "Boxer, L",
   last_name: "Boxer",
   seq_no: "1"
}

name: [
   {
       display_name: "Adamek, Jiri",
       first_name: "Jiri",
       addr_no: "1",
       full_name: "Adamek, Jiri",
       reprint: "Y",
       role: "author",
       wos_standard: "Adamek, J",
       last_name: "Adamek",
       dais_id: "10121636",
       seq_no: "1"
   },
   {
       display_name: "Koubek, Vaclav",
       first_name: "Vaclav",
       addr_no: "2",
       full_name: "Koubek, Vaclav",
       role: "author",
       wos_standard: "Koubek, V",
       last_name: "Koubek",
       dais_id: "12279647",
       seq_no: "2"
   }
]

As you can see that field "name" is sometimes a record and sometimes is an
ordered list. What Apache Spark does it infers name simply as a String.

In Asterix case, we can infer this type as UNION of both record and a list
of records.

*as an ADM:*
create type nameType1 as closed{

display_name: string,
first_name:string,
full_name: string,
reprint:string,
role:string,
wos_standard:string,
last_name:string,
seq_no:string

}

create type nameType2 as closed{

display_name: string,
first_name:string,
addr_no:string,
full_name: string,
reprint:string,
role:string,
wos_standard:string,
last_name:string,
dais_id:string,
seq_no:string

}

create datasetType as closed{

name: union(nameType1, [nameType2])

}

Re: Asterix Schema Provider Framework

Reply via email to