A couple of thoughts:
The issue with removing the tuple keyword from bag definition, so we can
have bag: {a: int} instead of bag: {tuple: (a: int)}, is we had
discussed allowing bags to be bags of anything, instead of bags of
tuples. We aren't doing anything about that now, but we might in the
future. We would have to change the semantics on bag type declaration
if we made that change. Otherwise we would not know whether bag {a:
int} meant that we had a bag of tuples of one element or a bag of ints.
As for letting {} alone mean bag, I'm concerned pig latin will end up
like perl, where different brackets mean different things and it's hard
to read the code. The other extreme is ending up like sql where it
takes way too many keywords to do something. I'm open to others views
on this.
Alan.
pi song wrote:
Here is what I know:-
Tuple Schema = schema associated with "a" tuple
Bag Schema = schema of all tuples contained in a bag
Then, here is the current way to specify schema in PigType branch:-
A = LOAD 'file1' AS (fieldA: bag
{tuple1:tuple(a:int,b:long,c:float,d:double)}, fieldB: Int)
Isn't this inefficient? Since we have already agreed that a bag only
contains tuples, not datum, I think it would be better if users can do
just:-
A = LOAD 'file1' AS (fieldA: bag {a:int,b:long,c:float,d:double}, fieldB:
Int)
Or even better, due to the fact that the curly braces already indicate Bag
data type:-
A = LOAD 'file1' AS (fieldA: {a:int,b:long,c:float,d:double}, fieldB: Int)
So potentially I think the keyword "Bag" should be optional for convenience.
This is the same as when we specify tuple schema which is already indicated
by round brackets.
Any opinion? It's now time to make it easy for users.
Pi
PS. I'm willing to make the change if everybody is too busy.