Re: Asterix Schema Provider Framework

Till Westmann Wed, 13 Jan 2016 15:01:01 -0800

Hi Wail,

creating one more object per query doesn’t scare me too much -especially if it’s only done if the schema is actually requested :)Also, it doesn’t seem that the construction would be very expensive.

So I think that that I’d prefer the alternate way.


What do other people think?

Cheers,
Till

On 13 Jan 2016, at 14:49, Wail Alkowaileet wrote:

Hi Till,

I'm glad you brought that up. I tried to think about a better approach
where the whole thing lives in Asterix.

The problem appears when I need to pass the information toSchemaBuilder

which lives in the "custom" IPrinterFactory. AFAIK, there're only two

paths. Either by doing it the way I did. (i.e.JobGenHelper.mkPrinters()set the SchemaID and the HeterogeneousTypeComputer). Or.. for everyquery,I create a new AqlCleanJSONWithSchemaPrinterFactoryProvider that holdstheinformation the SchemaBuilder needs. Then it prepares IPrinterFactorywith

the necessary information. Both way works ..

I chose the first one as I wanted to keep the same singleton patternof all

implementation of IPrinterFactoryProvider.
So it's actually possible :-))

If that seems better, I can re-do it that way.

Thanks.

On Wed, Jan 13, 2016 at 5:20 PM, Till Westmann <[email protected]>wrote:

Hi Wail,

thanks for writing this up!

I took a brief look and everything good great, but there’s onething thatsurprised me a bit: the modifications in Algebricks. It seemed to methatall the actual data and schema management should happen in AsterixDBand

that Algebricks doesn’t really need to know about this.
Is there a (clean) way to keep all of this in AsterixDB?

Or do you think that we need a (possibly more generic) extensionpoint in

Algebricks to support this feature?

Cheers,
Till


On 13 Jan 2016, at 14:04, Wail Alkowaileet wrote:

Sorry I forgot to put a link to the code:

https://github.com/Nullification/incubator-asterixdb
https://github.com/Nullification/incubator-asterixdb-hyracks

it currently lives in my github, I will push it soon to the gerrit.

Thanks.

On Wed, Jan 13, 2016 at 4:55 PM, Wail Alkowaileet<[email protected]>

wrote:

Hello Chen,

Sorry for the late reply,, I was hammered preparing for a workshophere

in
Boston.

Also I wanted to prepare a comprehensive design document thatincludes

all
the details about schema inferencer framework I built.

Please refer to it @:

https://docs.google.com/document/d/1Ue-yAWoLChOJ8JlkbXWdDW0tSW9szzP76ePL4wQRmP0/edit#

So just for the sake of your time (the document is a bit long):
Let's assume we have the following input:

{name: {
display_name: "Boxer, Laurence",
first_name: "Laurence",
full_name: "Boxer, Laurence",
reprint: "Y",
role: "author",
wos_standard: "Boxer, L",
last_name: "Boxer",
seq_no: "1"
}}

{name:{
display_name: "Adamek, Jiri",
first_name: "Jiri",
addr_no: "1",
full_name: "Adamek, Jiri",
reprint: "Y",
role: "author",
wos_standard: "Adamek, J",
last_name: "Adamek",
dais_id: "10121636",
seq_no: "1"
}}

As the "tuples" are all of type record, the schema inferencer will
compute
the schema as the union of all records fields.

*as an ADM:*

create type nameType1 as closed{

display_name: string,
first_name:string,
addr_no:string?,
full_name: string,
reprint:string,
role:string,
wos_standard:string,
last_name:string,
dais_id:string?,
seq_no:string

}

create datasetType as closed{

name: nameType1

}

However for heterogeneous types as in the following example:

name: {
display_name: "Boxer, Laurence",
first_name: "Laurence",
full_name: "Boxer, Laurence",
reprint: "Y",
role: "author",
wos_standard: "Boxer, L",
last_name: "Boxer",
seq_no: "1"
}

name: [
{
 display_name: "Adamek, Jiri",
 first_name: "Jiri",
 addr_no: "1",
 full_name: "Adamek, Jiri",
 reprint: "Y",
 role: "author",
 wos_standard: "Adamek, J",
 last_name: "Adamek",
 dais_id: "10121636",
 seq_no: "1"
},
{
 display_name: "Koubek, Vaclav",
 first_name: "Vaclav",
 addr_no: "2",
 full_name: "Koubek, Vaclav",
 role: "author",
 wos_standard: "Koubek, V",
 last_name: "Koubek",
 dais_id: "12279647",
 seq_no: "2"
}
]

As you can see that field "name" is sometimes a record andsometimes is

an

ordered list. What Apache Spark does it infers name simply as aString.

In Asterix case, we can infer this type as UNION of both record anda

list
of records.

*as an ADM:*
create type nameType1 as closed{

display_name: string,
first_name:string,
full_name: string,
reprint:string,
role:string,
wos_standard:string,
last_name:string,
seq_no:string

}

create type nameType2 as closed{

display_name: string,
first_name:string,
addr_no:string,
full_name: string,
reprint:string,
role:string,
wos_standard:string,
last_name:string,
dais_id:string,
seq_no:string

}

create datasetType as closed{

name: union(nameType1, [nameType2])

}


--

*Regards,*
Wail Alkowaileet



--

*Regards,*
Wail Alkowaileet

Re: Asterix Schema Provider Framework

Reply via email to