Hi Till, I'm glad you brought that up. I tried to think about a better approach where the whole thing lives in Asterix.
The problem appears when I need to pass the information to SchemaBuilder which lives in the "custom" IPrinterFactory. AFAIK, there're only two paths. Either by doing it the way I did. (i.e. JobGenHelper.mkPrinters() set the SchemaID and the HeterogeneousTypeComputer). Or.. for every query, I create a new AqlCleanJSONWithSchemaPrinterFactoryProvider that holds the information the SchemaBuilder needs. Then it prepares IPrinterFactory with the necessary information. Both way works .. I chose the first one as I wanted to keep the same singleton pattern of all implementation of IPrinterFactoryProvider. So it's actually possible :-)) If that seems better, I can re-do it that way. Thanks. On Wed, Jan 13, 2016 at 5:20 PM, Till Westmann <[email protected]> wrote: > Hi Wail, > > thanks for writing this up! > > I took a brief look and everything good great, but there’s one thing that > surprised me a bit: the modifications in Algebricks. It seemed to me that > all the actual data and schema management should happen in AsterixDB and > that Algebricks doesn’t really need to know about this. > Is there a (clean) way to keep all of this in AsterixDB? > Or do you think that we need a (possibly more generic) extension point in > Algebricks to support this feature? > > Cheers, > Till > > > On 13 Jan 2016, at 14:04, Wail Alkowaileet wrote: > > Sorry I forgot to put a link to the code: >> https://github.com/Nullification/incubator-asterixdb >> https://github.com/Nullification/incubator-asterixdb-hyracks >> >> it currently lives in my github, I will push it soon to the gerrit. >> >> Thanks. >> >> On Wed, Jan 13, 2016 at 4:55 PM, Wail Alkowaileet <[email protected]> >> wrote: >> >> Hello Chen, >>> >>> Sorry for the late reply,, I was hammered preparing for a workshop here >>> in >>> Boston. >>> Also I wanted to prepare a comprehensive design document that includes >>> all >>> the details about schema inferencer framework I built. >>> >>> Please refer to it @: >>> >>> https://docs.google.com/document/d/1Ue-yAWoLChOJ8JlkbXWdDW0tSW9szzP76ePL4wQRmP0/edit# >>> >>> So just for the sake of your time (the document is a bit long): >>> Let's assume we have the following input: >>> >>> {name: { >>> display_name: "Boxer, Laurence", >>> first_name: "Laurence", >>> full_name: "Boxer, Laurence", >>> reprint: "Y", >>> role: "author", >>> wos_standard: "Boxer, L", >>> last_name: "Boxer", >>> seq_no: "1" >>> }} >>> >>> {name:{ >>> display_name: "Adamek, Jiri", >>> first_name: "Jiri", >>> addr_no: "1", >>> full_name: "Adamek, Jiri", >>> reprint: "Y", >>> role: "author", >>> wos_standard: "Adamek, J", >>> last_name: "Adamek", >>> dais_id: "10121636", >>> seq_no: "1" >>> }} >>> >>> As the "tuples" are all of type record, the schema inferencer will >>> compute >>> the schema as the union of all records fields. >>> >>> *as an ADM:* >>> >>> create type nameType1 as closed{ >>> >>> display_name: string, >>> first_name:string, >>> addr_no:string?, >>> full_name: string, >>> reprint:string, >>> role:string, >>> wos_standard:string, >>> last_name:string, >>> dais_id:string?, >>> seq_no:string >>> >>> } >>> >>> create datasetType as closed{ >>> >>> name: nameType1 >>> >>> } >>> >>> However for heterogeneous types as in the following example: >>> >>> name: { >>> display_name: "Boxer, Laurence", >>> first_name: "Laurence", >>> full_name: "Boxer, Laurence", >>> reprint: "Y", >>> role: "author", >>> wos_standard: "Boxer, L", >>> last_name: "Boxer", >>> seq_no: "1" >>> } >>> >>> name: [ >>> { >>> display_name: "Adamek, Jiri", >>> first_name: "Jiri", >>> addr_no: "1", >>> full_name: "Adamek, Jiri", >>> reprint: "Y", >>> role: "author", >>> wos_standard: "Adamek, J", >>> last_name: "Adamek", >>> dais_id: "10121636", >>> seq_no: "1" >>> }, >>> { >>> display_name: "Koubek, Vaclav", >>> first_name: "Vaclav", >>> addr_no: "2", >>> full_name: "Koubek, Vaclav", >>> role: "author", >>> wos_standard: "Koubek, V", >>> last_name: "Koubek", >>> dais_id: "12279647", >>> seq_no: "2" >>> } >>> ] >>> >>> As you can see that field "name" is sometimes a record and sometimes is >>> an >>> ordered list. What Apache Spark does it infers name simply as a String. >>> >>> In Asterix case, we can infer this type as UNION of both record and a >>> list >>> of records. >>> >>> *as an ADM:* >>> create type nameType1 as closed{ >>> >>> display_name: string, >>> first_name:string, >>> full_name: string, >>> reprint:string, >>> role:string, >>> wos_standard:string, >>> last_name:string, >>> seq_no:string >>> >>> } >>> >>> create type nameType2 as closed{ >>> >>> display_name: string, >>> first_name:string, >>> addr_no:string, >>> full_name: string, >>> reprint:string, >>> role:string, >>> wos_standard:string, >>> last_name:string, >>> dais_id:string, >>> seq_no:string >>> >>> } >>> >>> create datasetType as closed{ >>> >>> name: union(nameType1, [nameType2]) >>> >>> } >>> >>> >>> >>> >> >> -- >> >> *Regards,* >> Wail Alkowaileet >> > -- *Regards,* Wail Alkowaileet
