Re: Asterix Schema Provider Framework

Wail Alkowaileet Wed, 13 Jan 2016 14:50:08 -0800

Hi Till,

I'm glad you brought that up. I tried to think about a better approach
where the whole thing lives in Asterix.


The problem appears when I need to pass the information to SchemaBuilder
which lives in the "custom" IPrinterFactory. AFAIK, there're only two
paths. Either by doing it the way I did. (i.e. JobGenHelper.mkPrinters()
set the SchemaID and the HeterogeneousTypeComputer). Or.. for every query,
I create a new AqlCleanJSONWithSchemaPrinterFactoryProvider that holds the
information the SchemaBuilder needs. Then it prepares IPrinterFactory with
the necessary information. Both way works ..

I chose the first one as I wanted to keep the same singleton pattern of all
implementation of IPrinterFactoryProvider.
So it's actually possible :-))

If that seems better, I can re-do it that way.

Thanks.

On Wed, Jan 13, 2016 at 5:20 PM, Till Westmann <[email protected]> wrote:

> Hi Wail,
>
> thanks for writing this up!
>
> I took a brief look and everything good great, but there’s one thing that
> surprised me a bit: the modifications in Algebricks. It seemed to me that
> all the actual data and schema management should happen in AsterixDB and
> that Algebricks doesn’t really need to know about this.
> Is there a (clean) way to keep all of this in AsterixDB?
> Or do you think that we need a (possibly more generic) extension point in
> Algebricks to support this feature?
>
> Cheers,
> Till
>
>
> On 13 Jan 2016, at 14:04, Wail Alkowaileet wrote:
>
> Sorry I forgot to put a link to the code:
>> https://github.com/Nullification/incubator-asterixdb
>> https://github.com/Nullification/incubator-asterixdb-hyracks
>>
>> it currently lives in my github, I will push it soon to the gerrit.
>>
>> Thanks.
>>
>> On Wed, Jan 13, 2016 at 4:55 PM, Wail Alkowaileet <[email protected]>
>> wrote:
>>
>> Hello Chen,
>>>
>>> Sorry for the late reply,, I was hammered preparing for a workshop here
>>> in
>>> Boston.
>>> Also I wanted to prepare a comprehensive design document that includes
>>> all
>>> the details about schema inferencer framework I built.
>>>
>>> Please refer to it @:
>>>
>>> https://docs.google.com/document/d/1Ue-yAWoLChOJ8JlkbXWdDW0tSW9szzP76ePL4wQRmP0/edit#
>>>
>>> So just for the sake of your time (the document is a bit long):
>>> Let's assume we have the following input:
>>>
>>> {name: {
>>> display_name: "Boxer, Laurence",
>>> first_name: "Laurence",
>>> full_name: "Boxer, Laurence",
>>> reprint: "Y",
>>> role: "author",
>>> wos_standard: "Boxer, L",
>>> last_name: "Boxer",
>>> seq_no: "1"
>>> }}
>>>
>>> {name:{
>>> display_name: "Adamek, Jiri",
>>> first_name: "Jiri",
>>> addr_no: "1",
>>> full_name: "Adamek, Jiri",
>>> reprint: "Y",
>>> role: "author",
>>> wos_standard: "Adamek, J",
>>> last_name: "Adamek",
>>> dais_id: "10121636",
>>> seq_no: "1"
>>> }}
>>>
>>> As the "tuples" are all of type record, the schema inferencer will
>>> compute
>>> the schema as the union of all records fields.
>>>
>>> *as an ADM:*
>>>
>>> create type nameType1 as closed{
>>>
>>> display_name: string,
>>> first_name:string,
>>> addr_no:string?,
>>> full_name: string,
>>> reprint:string,
>>> role:string,
>>> wos_standard:string,
>>> last_name:string,
>>> dais_id:string?,
>>> seq_no:string
>>>
>>> }
>>>
>>> create datasetType as closed{
>>>
>>> name: nameType1
>>>
>>> }
>>>
>>> However for heterogeneous types as in the following example:
>>>
>>> name: {
>>> display_name: "Boxer, Laurence",
>>> first_name: "Laurence",
>>> full_name: "Boxer, Laurence",
>>> reprint: "Y",
>>> role: "author",
>>> wos_standard: "Boxer, L",
>>> last_name: "Boxer",
>>> seq_no: "1"
>>> }
>>>
>>> name: [
>>> {
>>>     display_name: "Adamek, Jiri",
>>>     first_name: "Jiri",
>>>     addr_no: "1",
>>>     full_name: "Adamek, Jiri",
>>>     reprint: "Y",
>>>     role: "author",
>>>     wos_standard: "Adamek, J",
>>>     last_name: "Adamek",
>>>     dais_id: "10121636",
>>>     seq_no: "1"
>>> },
>>> {
>>>     display_name: "Koubek, Vaclav",
>>>     first_name: "Vaclav",
>>>     addr_no: "2",
>>>     full_name: "Koubek, Vaclav",
>>>     role: "author",
>>>     wos_standard: "Koubek, V",
>>>     last_name: "Koubek",
>>>     dais_id: "12279647",
>>>     seq_no: "2"
>>> }
>>> ]
>>>
>>> As you can see that field "name" is sometimes a record and sometimes is
>>> an
>>> ordered list. What Apache Spark does it infers name simply as a String.
>>>
>>> In Asterix case, we can infer this type as UNION of both record and a
>>> list
>>> of records.
>>>
>>> *as an ADM:*
>>> create type nameType1 as closed{
>>>
>>> display_name: string,
>>> first_name:string,
>>> full_name: string,
>>> reprint:string,
>>> role:string,
>>> wos_standard:string,
>>> last_name:string,
>>> seq_no:string
>>>
>>> }
>>>
>>> create type nameType2 as closed{
>>>
>>> display_name: string,
>>> first_name:string,
>>> addr_no:string,
>>> full_name: string,
>>> reprint:string,
>>> role:string,
>>> wos_standard:string,
>>> last_name:string,
>>> dais_id:string,
>>> seq_no:string
>>>
>>> }
>>>
>>> create datasetType as closed{
>>>
>>> name: union(nameType1, [nameType2])
>>>
>>> }
>>>
>>>
>>>
>>>
>>
>> --
>>
>> *Regards,*
>> Wail Alkowaileet
>>
>


-- 

*Regards,*
Wail Alkowaileet

Re: Asterix Schema Provider Framework

Reply via email to