In pig9, if you have a UDF which specifies its outputschema and that output
schema is wrong, then you with high probability will get an exception such
as:
java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
at java.lang.Integer.compareTo(Integer.java:37)
Errors like this are rare, but didn't seem to come up in Pig8, but do
in Pig9 and the opaque error messages can be hard to read.
In this case, there was a UDF that said it was outputting a Long, but
was in fact outputting an Int. At some point, it tried to cast it over
and failed.
That said, I wonder if it might be possible to add a runtime check
that checks the output of say the first output of your EvalFunc, and
if the type does not match up with the declared OutputSchema, it will
give you a warning (I don't think it should fail, but it should at
least warn you to aid in debugging). I don't think this would be too
hard and would add minimal overhead (compared to the run time of a
job). We could optionally add a flag or something for a "strict" mode
viz. schema.
Related to this, when jobs die in opaque ways, I wonder if there might
be a way to give a clearer sense of where in the pipeline it dies? You
can check pig.alias and try to figure it out by where in the map or
reduce it was, but that's tough. I know that pipelining and
optimizations could make this tough, but having a clearer sense of
what's going on would help debugging along.
Thoughts?