Hi Wail,

creating one more object per query doesn’t scare me too much - especially if it’s only done if the schema is actually requested :) Also, it doesn’t seem that the construction would be very expensive.
So I think that that I’d prefer the alternate way.

What do other people think?

Cheers,
Till

On 13 Jan 2016, at 14:49, Wail Alkowaileet wrote:

Hi Till,

I'm glad you brought that up. I tried to think about a better approach
where the whole thing lives in Asterix.

The problem appears when I need to pass the information to SchemaBuilder
which lives in the "custom" IPrinterFactory. AFAIK, there're only two
paths. Either by doing it the way I did. (i.e. JobGenHelper.mkPrinters() set the SchemaID and the HeterogeneousTypeComputer). Or.. for every query, I create a new AqlCleanJSONWithSchemaPrinterFactoryProvider that holds the information the SchemaBuilder needs. Then it prepares IPrinterFactory with
the necessary information. Both way works ..

I chose the first one as I wanted to keep the same singleton pattern of all
implementation of IPrinterFactoryProvider.
So it's actually possible :-))

If that seems better, I can re-do it that way.

Thanks.

On Wed, Jan 13, 2016 at 5:20 PM, Till Westmann <[email protected]> wrote:

Hi Wail,

thanks for writing this up!

I took a brief look and everything good great, but there’s one thing that surprised me a bit: the modifications in Algebricks. It seemed to me that all the actual data and schema management should happen in AsterixDB and
that Algebricks doesn’t really need to know about this.
Is there a (clean) way to keep all of this in AsterixDB?
Or do you think that we need a (possibly more generic) extension point in
Algebricks to support this feature?

Cheers,
Till


On 13 Jan 2016, at 14:04, Wail Alkowaileet wrote:

Sorry I forgot to put a link to the code:
https://github.com/Nullification/incubator-asterixdb
https://github.com/Nullification/incubator-asterixdb-hyracks

it currently lives in my github, I will push it soon to the gerrit.

Thanks.

On Wed, Jan 13, 2016 at 4:55 PM, Wail Alkowaileet <[email protected]>
wrote:

Hello Chen,

Sorry for the late reply,, I was hammered preparing for a workshop here
in
Boston.
Also I wanted to prepare a comprehensive design document that includes
all
the details about schema inferencer framework I built.

Please refer to it @:

https://docs.google.com/document/d/1Ue-yAWoLChOJ8JlkbXWdDW0tSW9szzP76ePL4wQRmP0/edit#

So just for the sake of your time (the document is a bit long):
Let's assume we have the following input:

{name: {
display_name: "Boxer, Laurence",
first_name: "Laurence",
full_name: "Boxer, Laurence",
reprint: "Y",
role: "author",
wos_standard: "Boxer, L",
last_name: "Boxer",
seq_no: "1"
}}

{name:{
display_name: "Adamek, Jiri",
first_name: "Jiri",
addr_no: "1",
full_name: "Adamek, Jiri",
reprint: "Y",
role: "author",
wos_standard: "Adamek, J",
last_name: "Adamek",
dais_id: "10121636",
seq_no: "1"
}}

As the "tuples" are all of type record, the schema inferencer will
compute
the schema as the union of all records fields.

*as an ADM:*

create type nameType1 as closed{

display_name: string,
first_name:string,
addr_no:string?,
full_name: string,
reprint:string,
role:string,
wos_standard:string,
last_name:string,
dais_id:string?,
seq_no:string

}

create datasetType as closed{

name: nameType1

}

However for heterogeneous types as in the following example:

name: {
display_name: "Boxer, Laurence",
first_name: "Laurence",
full_name: "Boxer, Laurence",
reprint: "Y",
role: "author",
wos_standard: "Boxer, L",
last_name: "Boxer",
seq_no: "1"
}

name: [
{
 display_name: "Adamek, Jiri",
 first_name: "Jiri",
 addr_no: "1",
 full_name: "Adamek, Jiri",
 reprint: "Y",
 role: "author",
 wos_standard: "Adamek, J",
 last_name: "Adamek",
 dais_id: "10121636",
 seq_no: "1"
},
{
 display_name: "Koubek, Vaclav",
 first_name: "Vaclav",
 addr_no: "2",
 full_name: "Koubek, Vaclav",
 role: "author",
 wos_standard: "Koubek, V",
 last_name: "Koubek",
 dais_id: "12279647",
 seq_no: "2"
}
]

As you can see that field "name" is sometimes a record and sometimes is
an
ordered list. What Apache Spark does it infers name simply as a String.

In Asterix case, we can infer this type as UNION of both record and a
list
of records.

*as an ADM:*
create type nameType1 as closed{

display_name: string,
first_name:string,
full_name: string,
reprint:string,
role:string,
wos_standard:string,
last_name:string,
seq_no:string

}

create type nameType2 as closed{

display_name: string,
first_name:string,
addr_no:string,
full_name: string,
reprint:string,
role:string,
wos_standard:string,
last_name:string,
dais_id:string,
seq_no:string

}

create datasetType as closed{

name: union(nameType1, [nameType2])

}





--

*Regards,*
Wail Alkowaileet




--

*Regards,*
Wail Alkowaileet

Reply via email to