Mridul Muralidharan wrote:
Alan Gates wrote:
Case 3 is not yet supported, and supporting it will require some
changes to pig's backend implementation. Specifically it will need
to be able to handle the case where pig guessed that a datum was of
one type, but it turns out to be another. To use the example above,
if MyLoader actually loaded $0 as a double, then pig needs to adapt
to this.
union is quite common actually - so some way to handle it would be
quite useful.
We certainly plan to support union fully.
2) Pigs fly. We want to choose performance over generality. In the
example above, it is safer to always convert $0 to double, because as
long as $0 is some kind of number you can do the conversion. If $0
really is a double and pig treats it as an int it will be truncating
it. But treating it as an int is 10 times faster than treating it as
a double. And the user can specify it as "$0 + 1.0" if they really
want the double.
I disagree - it is better to treat it as a double, and warn user about
the performance implications - than to treat it as an int and generate
incorrect results.
Correctness is more important than performance.
This is not a correctness issue. When we are guessing the type, we will
always be wrong sometimes. If we say $0 + 1 implies an int, and $0 has
double data then we'll return 3 when the user wanted 3.14. If we say $0
+ 1 is a double and $0 has int data, then we'll return 42.0 when the
user wanted 42. 42.0 is closer to 42 than 3 is to 3.14, but if the user
has given us all int data and added an integer to it, and we output
double data, that's still not what the user wanted.
Given that we will always be wrong sometimes, the question is when do we
want to be wrong. In this case I advocate in favor of ints for 2 reasons:
1) Performance, as noted above. Integer computations are about 10x
faster than double computations.
2) Frequency of use. In my experience integral numbers are far more
common in databases than floating points (obviously this depends on the
data you're processing).
So 90% of the time we'll produce what the user wants and run 10x faster
given this assumption, and the other 10% we'll produce a number that
isn't exactly what the user wanted. If the user wants the double, he
can explicitly cast $0 or add 1.0 (instead of 1) to it.
4) Pigs are friendly when treated nicely. In the cases where the
user or udf didn't tell pig the type, it isn't an error if the type
of the datum doesn't match the operation. Again, using the example
above, if $0 turns out (at least in some cases) to be a chararray
which cannot be cast to int, then a null datum plus a warning will be
emitted rather than an error.
This looks more like incompatibility of the input data with the
script/udf, no ?
For example, if script declares column 1 is integer, and it turns in
the file to be chararray, then either :
a) it is a schema error in the script - and it is useless to continue
the pipeline.
b) it is an anomaly in input data.
c) space for rent.
Different usecases might want to handle (b) differently - (a) is
universally something which should result in flagging the script as an
error. Not really sure how you will make the distinction between (a)
and (b) though ...
In case 4 here I'm not talking about the situation where the user gave
us a schema and it turns out to be wrong. That falls under case 1,
don't lie to the pig. I'm thinking here of situations where the user
doesn't tell us what the data is or where the data is different row to
row because of a union or just inconsistent data, which pig does allow.
Alan.