Alan Gates wrote:
In the types branch we have changed pig to allow users to specify types (int, chararray, bag, etc.) when they load their data. We have also changed the backend to work with data in different types, and cast data when necessary. But we have sought to maintain the feature that if the user doesn't tell pig what type the data is, everything will still work. Given this, there are some semantics we need to clarify and a few changes that need to be made to support all possible cases.

So now it pig, data can be handled in one of three ways:

1) The data is typed (that is, it's an integer or a chararray or whatever) and pig knows it because the user has told pig the type. Pig can be told of the type by the user as part of the script (a = load 'myfile' as (x:int, y:chararray);) or by the load function through describeSchema or by an eval function via outputSchema.

2) The data is not typed (that is, it's a bytearray). If pig needs to then convert the data to typed (for example, the user adds an integer to it) it will depend on the load function that loaded that data to provide a cast. Pig uses the load function in this case because it has no idea how data is represented inside a byte array.

3) The data is typed, but pig doesn't know about it. This might be because neither the user nor the load function told it. It could be because it's returned from an evaluation function that didn't implement outputSchema. It could be because there was an operation such as UNION that can co-mingle data of various types. It could also be because the data was contained in another datum that may not have been completely specified (such as a tuple or bag) or could not be completely specified (like a map). Note that it is legitimate for the user, load function, or eval function not to inform pig of the type. Perhaps the type changes from row to row and so it cannot be described in a schema.

In addition, pig now attempts to guess types if the user does not provide them. So, for a script like

a = load 'myfile' using MyLoader();
b = foreach a generate $0 + 1;

it appears that the user believes $0 to be an integer, so pig will attempt to convert it to be an integer (or if it happens to already be one leave it as one).

Case 3 is not yet supported, and supporting it will require some changes to pig's backend implementation. Specifically it will need to be able to handle the case where pig guessed that a datum was of one type, but it turns out to be another. To use the example above, if MyLoader actually loaded $0 as a double, then pig needs to adapt to this.


union is quite common actually - so some way to handle it would be quite useful.



In order to handle all of this, we need some semantics that make clear to users, pig udf developers, and pig developers how pig interacts with these three types of data. I propose the following semantics:

1) Don't lie to the pig. If users or udf developers tell pig that a datum is of a certain type (via specification in a script, use of LoadFunc.determineSchema(), or EvalFunc.outputSchema()) then it is safe for pig to assume that datum is of that type. It need not check or place casts in the pipeline. If the datum is not of that type, than it is an error, and an error message will be emitted.


This makes sense. If user (script/udf) declares it of some type, then it can be expected to be of that type.


2) Pigs fly. We want to choose performance over generality. In the example above, it is safer to always convert $0 to double, because as long as $0 is some kind of number you can do the conversion. If $0 really is a double and pig treats it as an int it will be truncating it. But treating it as an int is 10 times faster than treating it as a double. And the user can specify it as "$0 + 1.0" if they really want the double.


I disagree - it is better to treat it as a double, and warn user about the performance implications - than to treat it as an int and generate incorrect results.
Correctness is more important than performance.



3) Pigs eat anything, with reasonable speed. Pig will be able to run faster in certain cases when it knows the data type. This is particularly true if the data coming in is typed. On the fly data conversion will be more expensive than up front knowing the right types. Plus pig may be able to make better optimization choices when it knows the the types. But we cannot build the system in a way that punishes those who do not declare their types, or whose data does not lend itself to being declared.

It is acceptable to punish user in terms of performance penalities (with suitable warning messages ofcourse) in case there is insufficient info for pig to optimize .... than being unusable to the user. In general, the assumption that the udf author, the script snippet author and the script executor are all the same is not really valid in non-trivial cases ....


4) Pigs are friendly when treated nicely. In the cases where the user or udf didn't tell pig the type, it isn't an error if the type of the datum doesn't match the operation. Again, using the example above, if $0 turns out (at least in some cases) to be a chararray which cannot be cast to int, then a null datum plus a warning will be emitted rather than an error.


This looks more like incompatibility of the input data with the script/udf, no ? For example, if script declares column 1 is integer, and it turns in the file to be chararray, then either : a) it is a schema error in the script - and it is useless to continue the pipeline.
b) it is an anomaly in input data.
c) space for rent.


Different usecases might want to handle (b) differently - (a) is universally something which should result in flagging the script as an error. Not really sure how you will make the distinction between (a) and (b) though ...



Regards,
Mridul



Thoughts?

Alan.

Reply via email to