Sorry I forgot to put a link to the code: https://github.com/Nullification/incubator-asterixdb https://github.com/Nullification/incubator-asterixdb-hyracks
it currently lives in my github, I will push it soon to the gerrit. Thanks. On Wed, Jan 13, 2016 at 4:55 PM, Wail Alkowaileet <[email protected]> wrote: > Hello Chen, > > Sorry for the late reply,, I was hammered preparing for a workshop here in > Boston. > Also I wanted to prepare a comprehensive design document that includes all > the details about schema inferencer framework I built. > > Please refer to it @: > https://docs.google.com/document/d/1Ue-yAWoLChOJ8JlkbXWdDW0tSW9szzP76ePL4wQRmP0/edit# > > So just for the sake of your time (the document is a bit long): > Let's assume we have the following input: > > {name: { > display_name: "Boxer, Laurence", > first_name: "Laurence", > full_name: "Boxer, Laurence", > reprint: "Y", > role: "author", > wos_standard: "Boxer, L", > last_name: "Boxer", > seq_no: "1" > }} > > {name:{ > display_name: "Adamek, Jiri", > first_name: "Jiri", > addr_no: "1", > full_name: "Adamek, Jiri", > reprint: "Y", > role: "author", > wos_standard: "Adamek, J", > last_name: "Adamek", > dais_id: "10121636", > seq_no: "1" > }} > > As the "tuples" are all of type record, the schema inferencer will compute > the schema as the union of all records fields. > > *as an ADM:* > create type nameType1 as closed{ > > display_name: string, > first_name:string, > addr_no:string?, > full_name: string, > reprint:string, > role:string, > wos_standard:string, > last_name:string, > dais_id:string?, > seq_no:string > > } > > create datasetType as closed{ > > name: nameType1 > > } > > However for heterogeneous types as in the following example: > > name: { > display_name: "Boxer, Laurence", > first_name: "Laurence", > full_name: "Boxer, Laurence", > reprint: "Y", > role: "author", > wos_standard: "Boxer, L", > last_name: "Boxer", > seq_no: "1" > } > > name: [ > { > display_name: "Adamek, Jiri", > first_name: "Jiri", > addr_no: "1", > full_name: "Adamek, Jiri", > reprint: "Y", > role: "author", > wos_standard: "Adamek, J", > last_name: "Adamek", > dais_id: "10121636", > seq_no: "1" > }, > { > display_name: "Koubek, Vaclav", > first_name: "Vaclav", > addr_no: "2", > full_name: "Koubek, Vaclav", > role: "author", > wos_standard: "Koubek, V", > last_name: "Koubek", > dais_id: "12279647", > seq_no: "2" > } > ] > > As you can see that field "name" is sometimes a record and sometimes is an > ordered list. What Apache Spark does it infers name simply as a String. > > In Asterix case, we can infer this type as UNION of both record and a list > of records. > > *as an ADM:* > create type nameType1 as closed{ > > display_name: string, > first_name:string, > full_name: string, > reprint:string, > role:string, > wos_standard:string, > last_name:string, > seq_no:string > > } > > create type nameType2 as closed{ > > display_name: string, > first_name:string, > addr_no:string, > full_name: string, > reprint:string, > role:string, > wos_standard:string, > last_name:string, > dais_id:string, > seq_no:string > > } > > create datasetType as closed{ > > name: union(nameType1, [nameType2]) > > } > > > -- *Regards,* Wail Alkowaileet
