Mridul Muralidharan wrote:
Alan Gates wrote:


Case 3 is not yet supported, and supporting it will require some changes to pig's backend implementation. Specifically it will need to be able to handle the case where pig guessed that a datum was of one type, but it turns out to be another. To use the example above, if MyLoader actually loaded $0 as a double, then pig needs to adapt to this.


union is quite common actually - so some way to handle it would be quite useful.
We certainly plan to support union fully.


2) Pigs fly. We want to choose performance over generality. In the example above, it is safer to always convert $0 to double, because as long as $0 is some kind of number you can do the conversion. If $0 really is a double and pig treats it as an int it will be truncating it. But treating it as an int is 10 times faster than treating it as a double. And the user can specify it as "$0 + 1.0" if they really want the double.


I disagree - it is better to treat it as a double, and warn user about the performance implications - than to treat it as an int and generate incorrect results.
Correctness is more important than performance.
This is not a correctness issue. When we are guessing the type, we will always be wrong sometimes. If we say $0 + 1 implies an int, and $0 has double data then we'll return 3 when the user wanted 3.14. If we say $0 + 1 is a double and $0 has int data, then we'll return 42.0 when the user wanted 42. 42.0 is closer to 42 than 3 is to 3.14, but if the user has given us all int data and added an integer to it, and we output double data, that's still not what the user wanted.

Given that we will always be wrong sometimes, the question is when do we want to be wrong. In this case I advocate in favor of ints for 2 reasons:

1) Performance, as noted above. Integer computations are about 10x faster than double computations. 2) Frequency of use. In my experience integral numbers are far more common in databases than floating points (obviously this depends on the data you're processing).

So 90% of the time we'll produce what the user wants and run 10x faster given this assumption, and the other 10% we'll produce a number that isn't exactly what the user wanted. If the user wants the double, he can explicitly cast $0 or add 1.0 (instead of 1) to it.

4) Pigs are friendly when treated nicely. In the cases where the user or udf didn't tell pig the type, it isn't an error if the type of the datum doesn't match the operation. Again, using the example above, if $0 turns out (at least in some cases) to be a chararray which cannot be cast to int, then a null datum plus a warning will be emitted rather than an error.


This looks more like incompatibility of the input data with the script/udf, no ? For example, if script declares column 1 is integer, and it turns in the file to be chararray, then either : a) it is a schema error in the script - and it is useless to continue the pipeline.
b) it is an anomaly in input data.
c) space for rent.


Different usecases might want to handle (b) differently - (a) is universally something which should result in flagging the script as an error. Not really sure how you will make the distinction between (a) and (b) though ...

In case 4 here I'm not talking about the situation where the user gave us a schema and it turns out to be wrong. That falls under case 1, don't lie to the pig. I'm thinking here of situations where the user doesn't tell us what the data is or where the data is different row to row because of a union or just inconsistent data, which pig does allow.

Alan.

Reply via email to