My issue relates to data types in schemas and compile-time vs run-time type
checking.

http://pig.apache.org/docs/r0.14.0/basic.html#schemas says:

  we encourage you to use them whenever possible; type declarations result
  in better parse-time error checking and more efficient code execution.

In the course of trying to follow this sound advice I have stumbled upon
some puzzling behavior ... and I got bit by something. I'm hoping the Pig
gods can shed some light.

The following creates a single tuple with one chararray field whose value
is 'abc'

  grunt> a = LOAD 'one-line-file.txt' AS (s:chararray);
  grunt> b = FOREACH a GENERATE 'abc' AS s:chararray;
  grunt> DESCRIBE b;
  b: {s: chararray}
  grunt> DUMP b;
  (abc)

Let's now provide a schema where I specify the type. Note that I am not
performing an explicit cast, just changing the type.

  grunt> c = FOREACH b GENERATE s AS m:long;

I fully expected this to fail at compile time, since I am taking a
chararray and saying that it is now a long.

  grunt> DESCRIBE c;
  c: {m: long}

When it didn't fail at compile time, I expected it to either:
 1) fail at runtime
 2) perform a cast

Yet, I get

  grunt> DUMP c;
  (abc)

So, it didn't complain at either compile time or runtime.

If I then apply an explicit cast operation, the system thinks that c.m is
already a long and it doesn't modify modify the value.

  grunt> d = FOREACH c GENERATE (long)m AS n;
  grunt> DESCRIBE d;
  d: {n: long}
  grunt> DUMP d;
  (abc)


This looks like a bug to me. Seems to me that the 'conversion' from
chararray => long without an explicit cast should generate a compile-time
error. Otherwise, it should be interpreted as an implicit cast. In either
case, the value 'abc' should not be allowed to 'pass' as a long integer
value.

I ran into this issue this because I am trying to auto-generate some pig
code. I thought it would be best to fully specify schemas & types during
FOREACH/GENERATE projections because it would give me more compile-time
type safety. It now looks to me like this would be a mistake, because
erroneous types might make things even worse.

Thoughts/advice appreciated.

Thanks for your great work.

Michael

Reply via email to