Re: Pig and missing metadata

Mridul Muralidharan Wed, 01 Oct 2008 15:04:54 -0700

Alan Gates wrote:

In the types branch we have changed pig to allow users to specify types(int, chararray, bag, etc.) when they load their data. We have alsochanged the backend to work with data in different types, and cast datawhen necessary. But we have sought to maintain the feature that if theuser doesn't tell pig what type the data is, everything will stillwork. Given this, there are some semantics we need to clarify and a fewchanges that need to be made to support all possible cases.
So now it pig, data can be handled in one of three ways:
1) The data is typed (that is, it's an integer or a chararray orwhatever) and pig knows it because the user has told pig the type. Pigcan be told of the type by the user as part of the script (a = load'myfile' as (x:int, y:chararray);) or by the load function throughdescribeSchema or by an eval function via outputSchema.
2) The data is not typed (that is, it's a bytearray). If pig needs tothen convert the data to typed (for example, the user adds an integer toit) it will depend on the load function that loaded that data to providea cast. Pig uses the load function in this case because it has no ideahow data is represented inside a byte array.
3) The data is typed, but pig doesn't know about it. This might bebecause neither the user nor the load function told it. It could bebecause it's returned from an evaluation function that didn't implementoutputSchema. It could be because there was an operation such as UNIONthat can co-mingle data of various types. It could also be because thedata was contained in another datum that may not have been completelyspecified (such as a tuple or bag) or could not be completely specified(like a map). Note that it is legitimate for the user, load function,or eval function not to inform pig of the type. Perhaps the typechanges from row to row and so it cannot be described in a schema.
In addition, pig now attempts to guess types if the user does notprovide them. So, for a script like
a = load 'myfile' using MyLoader();
b = foreach a generate $0 + 1;
it appears that the user believes $0 to be an integer, so pig willattempt to convert it to be an integer (or if it happens to already beone leave it as one).
Case 3 is not yet supported, and supporting it will require some changesto pig's backend implementation. Specifically it will need to be ableto handle the case where pig guessed that a datum was of one type, butit turns out to be another. To use the example above, if MyLoaderactually loaded $0 as a double, then pig needs to adapt to this.

union is quite common actually - so some way to handle it would be quiteuseful.

In order to handle all of this, we need some semantics that make clearto users, pig udf developers, and pig developers how pig interacts withthese three types of data. I propose the following semantics:
1) Don't lie to the pig. If users or udf developers tell pig that adatum is of a certain type (via specification in a script, use ofLoadFunc.determineSchema(), or EvalFunc.outputSchema()) then it is safefor pig to assume that datum is of that type. It need not check orplace casts in the pipeline. If the datum is not of that type, than itis an error, and an error message will be emitted.

This makes sense. If user (script/udf) declares it of some type, then itcan be expected to be of that type.

2) Pigs fly. We want to choose performance over generality. In theexample above, it is safer to always convert $0 to double, because aslong as $0 is some kind of number you can do the conversion. If $0really is a double and pig treats it as an int it will be truncatingit. But treating it as an int is 10 times faster than treating it as adouble. And the user can specify it as "$0 + 1.0" if they really wantthe double.

I disagree - it is better to treat it as a double, and warn user aboutthe performance implications - than to treat it as an int and generateincorrect results.

Correctness is more important than performance.

3) Pigs eat anything, with reasonable speed. Pig will be able to runfaster in certain cases when it knows the data type. This isparticularly true if the data coming in is typed. On the fly dataconversion will be more expensive than up front knowing the righttypes. Plus pig may be able to make better optimization choices when itknows the the types. But we cannot build the system in a way thatpunishes those who do not declare their types, or whose data does notlend itself to being declared.

It is acceptable to punish user in terms of performance penalities (withsuitable warning messages ofcourse) in case there is insufficient infofor pig to optimize .... than being unusable to the user.In general, the assumption that the udf author, the script snippetauthor and the script executor are all the same is not really valid innon-trivial cases ....

4) Pigs are friendly when treated nicely. In the cases where the useror udf didn't tell pig the type, it isn't an error if the type of thedatum doesn't match the operation. Again, using the example above, if$0 turns out (at least in some cases) to be a chararray which cannot becast to int, then a null datum plus a warning will be emitted ratherthan an error.

This looks more like incompatibility of the input data with thescript/udf, no ?For example, if script declares column 1 is integer, and it turns in thefile to be chararray, then either :a) it is a schema error in the script - and it is useless to continuethe pipeline.

b) it is an anomaly in input data.
c) space for rent.

Different usecases might want to handle (b) differently - (a) isuniversally something which should result in flagging the script as anerror. Not really sure how you will make the distinction between (a) and(b) though ...




Regards,
Mridul


Thoughts?

Alan.

Re: Pig and missing metadata

Reply via email to