Sorry for the late reply,, I was hammered preparing for a workshop
here
in
Boston.
Also I wanted to prepare a comprehensive design document that
includes
all
the details about schema inferencer framework I built.
Please refer to it @:
https://docs.google.com/document/d/1Ue-yAWoLChOJ8JlkbXWdDW0tSW9szzP76ePL4wQRmP0/edit#
So just for the sake of your time (the document is a bit long):
Let's assume we have the following input:
{name: {
display_name: "Boxer, Laurence",
first_name: "Laurence",
full_name: "Boxer, Laurence",
reprint: "Y",
role: "author",
wos_standard: "Boxer, L",
last_name: "Boxer",
seq_no: "1"
}}
{name:{
display_name: "Adamek, Jiri",
first_name: "Jiri",
addr_no: "1",
full_name: "Adamek, Jiri",
reprint: "Y",
role: "author",
wos_standard: "Adamek, J",
last_name: "Adamek",
dais_id: "10121636",
seq_no: "1"
}}
As the "tuples" are all of type record, the schema inferencer will
compute
the schema as the union of all records fields.
*as an ADM:*
create type nameType1 as closed{
display_name: string,
first_name:string,
addr_no:string?,
full_name: string,
reprint:string,
role:string,
wos_standard:string,
last_name:string,
dais_id:string?,
seq_no:string
}
create datasetType as closed{
name: nameType1
}
However for heterogeneous types as in the following example:
name: {
display_name: "Boxer, Laurence",
first_name: "Laurence",
full_name: "Boxer, Laurence",
reprint: "Y",
role: "author",
wos_standard: "Boxer, L",
last_name: "Boxer",
seq_no: "1"
}
name: [
{
display_name: "Adamek, Jiri",
first_name: "Jiri",
addr_no: "1",
full_name: "Adamek, Jiri",
reprint: "Y",
role: "author",
wos_standard: "Adamek, J",
last_name: "Adamek",
dais_id: "10121636",
seq_no: "1"
},
{
display_name: "Koubek, Vaclav",
first_name: "Vaclav",
addr_no: "2",
full_name: "Koubek, Vaclav",
role: "author",
wos_standard: "Koubek, V",
last_name: "Koubek",
dais_id: "12279647",
seq_no: "2"
}
]
As you can see that field "name" is sometimes a record and
sometimes is
an
ordered list. What Apache Spark does it infers name simply as a
String.
In Asterix case, we can infer this type as UNION of both record and
a
list
of records.
*as an ADM:*
create type nameType1 as closed{
display_name: string,
first_name:string,
full_name: string,
reprint:string,
role:string,
wos_standard:string,
last_name:string,
seq_no:string
}
create type nameType2 as closed{
display_name: string,
first_name:string,
addr_no:string,
full_name: string,
reprint:string,
role:string,
wos_standard:string,
last_name:string,
dais_id:string,
seq_no:string
}
create datasetType as closed{
name: union(nameType1, [nameType2])
}