Alan Gates wrote:
In the types branch we have changed pig to allow users to specify types
(int, chararray, bag, etc.) when they load their data. We have also
changed the backend to work with data in different types, and cast data
when necessary. But we have sought to maintain the feature that if the
user doesn't tell pig what type the data is, everything will still
work. Given this, there are some semantics we need to clarify and a few
changes that need to be made to support all possible cases.
So now it pig, data can be handled in one of three ways:
1) The data is typed (that is, it's an integer or a chararray or
whatever) and pig knows it because the user has told pig the type. Pig
can be told of the type by the user as part of the script (a = load
'myfile' as (x:int, y:chararray);) or by the load function through
describeSchema or by an eval function via outputSchema.
2) The data is not typed (that is, it's a bytearray). If pig needs to
then convert the data to typed (for example, the user adds an integer to
it) it will depend on the load function that loaded that data to provide
a cast. Pig uses the load function in this case because it has no idea
how data is represented inside a byte array.
3) The data is typed, but pig doesn't know about it. This might be
because neither the user nor the load function told it. It could be
because it's returned from an evaluation function that didn't implement
outputSchema. It could be because there was an operation such as UNION
that can co-mingle data of various types. It could also be because the
data was contained in another datum that may not have been completely
specified (such as a tuple or bag) or could not be completely specified
(like a map). Note that it is legitimate for the user, load function,
or eval function not to inform pig of the type. Perhaps the type
changes from row to row and so it cannot be described in a schema.
In addition, pig now attempts to guess types if the user does not
provide them. So, for a script like
a = load 'myfile' using MyLoader();
b = foreach a generate $0 + 1;
it appears that the user believes $0 to be an integer, so pig will
attempt to convert it to be an integer (or if it happens to already be
one leave it as one).
Case 3 is not yet supported, and supporting it will require some changes
to pig's backend implementation. Specifically it will need to be able
to handle the case where pig guessed that a datum was of one type, but
it turns out to be another. To use the example above, if MyLoader
actually loaded $0 as a double, then pig needs to adapt to this.
union is quite common actually - so some way to handle it would be quite
useful.
In order to handle all of this, we need some semantics that make clear
to users, pig udf developers, and pig developers how pig interacts with
these three types of data. I propose the following semantics:
1) Don't lie to the pig. If users or udf developers tell pig that a
datum is of a certain type (via specification in a script, use of
LoadFunc.determineSchema(), or EvalFunc.outputSchema()) then it is safe
for pig to assume that datum is of that type. It need not check or
place casts in the pipeline. If the datum is not of that type, than it
is an error, and an error message will be emitted.
This makes sense. If user (script/udf) declares it of some type, then it
can be expected to be of that type.
2) Pigs fly. We want to choose performance over generality. In the
example above, it is safer to always convert $0 to double, because as
long as $0 is some kind of number you can do the conversion. If $0
really is a double and pig treats it as an int it will be truncating
it. But treating it as an int is 10 times faster than treating it as a
double. And the user can specify it as "$0 + 1.0" if they really want
the double.
I disagree - it is better to treat it as a double, and warn user about
the performance implications - than to treat it as an int and generate
incorrect results.
Correctness is more important than performance.
3) Pigs eat anything, with reasonable speed. Pig will be able to run
faster in certain cases when it knows the data type. This is
particularly true if the data coming in is typed. On the fly data
conversion will be more expensive than up front knowing the right
types. Plus pig may be able to make better optimization choices when it
knows the the types. But we cannot build the system in a way that
punishes those who do not declare their types, or whose data does not
lend itself to being declared.
It is acceptable to punish user in terms of performance penalities (with
suitable warning messages ofcourse) in case there is insufficient info
for pig to optimize .... than being unusable to the user.
In general, the assumption that the udf author, the script snippet
author and the script executor are all the same is not really valid in
non-trivial cases ....
4) Pigs are friendly when treated nicely. In the cases where the user
or udf didn't tell pig the type, it isn't an error if the type of the
datum doesn't match the operation. Again, using the example above, if
$0 turns out (at least in some cases) to be a chararray which cannot be
cast to int, then a null datum plus a warning will be emitted rather
than an error.
This looks more like incompatibility of the input data with the
script/udf, no ?
For example, if script declares column 1 is integer, and it turns in the
file to be chararray, then either :
a) it is a schema error in the script - and it is useless to continue
the pipeline.
b) it is an anomaly in input data.
c) space for rent.
Different usecases might want to handle (b) differently - (a) is
universally something which should result in flagging the script as an
error. Not really sure how you will make the distinction between (a) and
(b) though ...
Regards,
Mridul
Thoughts?
Alan.