Re: Pig and missing metadata

Alan Gates Thu, 02 Oct 2008 09:13:36 -0700


Mridul Muralidharan wrote:

Alan Gates wrote:
Case 3 is not yet supported, and supporting it will require somechanges to pig's backend implementation. Specifically it will needto be able to handle the case where pig guessed that a datum was ofone type, but it turns out to be another. To use the example above,if MyLoader actually loaded $0 as a double, then pig needs to adaptto this.
union is quite common actually - so some way to handle it would bequite useful.

We certainly plan to support union fully.

2) Pigs fly. We want to choose performance over generality. In theexample above, it is safer to always convert $0 to double, because aslong as $0 is some kind of number you can do the conversion. If $0really is a double and pig treats it as an int it will be truncatingit. But treating it as an int is 10 times faster than treating it asa double. And the user can specify it as "$0 + 1.0" if they reallywant the double.
I disagree - it is better to treat it as a double, and warn user aboutthe performance implications - than to treat it as an int and generateincorrect results.
Correctness is more important than performance.

This is not a correctness issue. When we are guessing the type, we willalways be wrong sometimes. If we say $0 + 1 implies an int, and $0 hasdouble data then we'll return 3 when the user wanted 3.14. If we say $0+ 1 is a double and $0 has int data, then we'll return 42.0 when theuser wanted 42. 42.0 is closer to 42 than 3 is to 3.14, but if the userhas given us all int data and added an integer to it, and we outputdouble data, that's still not what the user wanted.

Given that we will always be wrong sometimes, the question is when do wewant to be wrong. In this case I advocate in favor of ints for 2 reasons:

1) Performance, as noted above. Integer computations are about 10xfaster than double computations.2) Frequency of use. In my experience integral numbers are far morecommon in databases than floating points (obviously this depends on thedata you're processing).

So 90% of the time we'll produce what the user wants and run 10x fastergiven this assumption, and the other 10% we'll produce a number thatisn't exactly what the user wanted. If the user wants the double, hecan explicitly cast $0 or add 1.0 (instead of 1) to it.

4) Pigs are friendly when treated nicely. In the cases where theuser or udf didn't tell pig the type, it isn't an error if the typeof the datum doesn't match the operation. Again, using the exampleabove, if $0 turns out (at least in some cases) to be a chararraywhich cannot be cast to int, then a null datum plus a warning will beemitted rather than an error.
This looks more like incompatibility of the input data with thescript/udf, no ?For example, if script declares column 1 is integer, and it turns inthe file to be chararray, then either :a) it is a schema error in the script - and it is useless to continuethe pipeline.
b) it is an anomaly in input data.
c) space for rent.
Different usecases might want to handle (b) differently - (a) isuniversally something which should result in flagging the script as anerror. Not really sure how you will make the distinction between (a)and (b) though ...

In case 4 here I'm not talking about the situation where the user gaveus a schema and it turns out to be wrong. That falls under case 1,don't lie to the pig. I'm thinking here of situations where the userdoesn't tell us what the data is or where the data is different row torow because of a union or just inconsistent data, which pig does allow.


Alan.

Re: Pig and missing metadata

Reply via email to