Hello Chen, Sorry for the late reply,, I was hammered preparing for a workshop here in Boston. Also I wanted to prepare a comprehensive design document that includes all the details about schema inferencer framework I built.
Please refer to it @: https://docs.google.com/document/d/1Ue-yAWoLChOJ8JlkbXWdDW0tSW9szzP76ePL4wQRmP0/edit# So just for the sake of your time (the document is a bit long): Let's assume we have the following input: {name: { display_name: "Boxer, Laurence", first_name: "Laurence", full_name: "Boxer, Laurence", reprint: "Y", role: "author", wos_standard: "Boxer, L", last_name: "Boxer", seq_no: "1" }} {name:{ display_name: "Adamek, Jiri", first_name: "Jiri", addr_no: "1", full_name: "Adamek, Jiri", reprint: "Y", role: "author", wos_standard: "Adamek, J", last_name: "Adamek", dais_id: "10121636", seq_no: "1" }} As the "tuples" are all of type record, the schema inferencer will compute the schema as the union of all records fields. *as an ADM:* create type nameType1 as closed{ display_name: string, first_name:string, addr_no:string?, full_name: string, reprint:string, role:string, wos_standard:string, last_name:string, dais_id:string?, seq_no:string } create datasetType as closed{ name: nameType1 } However for heterogeneous types as in the following example: name: { display_name: "Boxer, Laurence", first_name: "Laurence", full_name: "Boxer, Laurence", reprint: "Y", role: "author", wos_standard: "Boxer, L", last_name: "Boxer", seq_no: "1" } name: [ { display_name: "Adamek, Jiri", first_name: "Jiri", addr_no: "1", full_name: "Adamek, Jiri", reprint: "Y", role: "author", wos_standard: "Adamek, J", last_name: "Adamek", dais_id: "10121636", seq_no: "1" }, { display_name: "Koubek, Vaclav", first_name: "Vaclav", addr_no: "2", full_name: "Koubek, Vaclav", role: "author", wos_standard: "Koubek, V", last_name: "Koubek", dais_id: "12279647", seq_no: "2" } ] As you can see that field "name" is sometimes a record and sometimes is an ordered list. What Apache Spark does it infers name simply as a String. In Asterix case, we can infer this type as UNION of both record and a list of records. *as an ADM:* create type nameType1 as closed{ display_name: string, first_name:string, full_name: string, reprint:string, role:string, wos_standard:string, last_name:string, seq_no:string } create type nameType2 as closed{ display_name: string, first_name:string, addr_no:string, full_name: string, reprint:string, role:string, wos_standard:string, last_name:string, dais_id:string, seq_no:string } create datasetType as closed{ name: union(nameType1, [nameType2]) }
