Hi all,
I'm new to Pig so please forgive me if there are any mistakes in my
following description.
We are working on a MapReduce engine in C++ which also supports Pig key
types.
If I'm not mistaken, Pig key type classes are those extending
*PigNullableWritable. *
We have run into several issues regarding Serialization/Deserialization.
1. By default, NullableBag will write out a DefaultDataBag which inherits
the write/readFields methods of DefaultAbstractBag. We found the two are
inconsistent:
public void readFields(DataInput in) throws IOException {
long size = in.readLong();
for (long i = 0; i < size; i++) {
try {
Object o = sedes.readDatum(in);
add((Tuple)o);
} catch (ExecException ee) {
throw ee;
}
}
}
public void write(DataOutput out) throws IOException {
sedes.writeDatum(out, this);
}
when writing out, the first byte will one of TINYBAG, SMALLBAG and BAG,
followed by size which would be a byte, short or long accordingly.
Regardless of that format, the readFields method directly reads size as a
long.
Since I'm new to Pig, I don't know how the bag is used across a MR job but
this doesn't look right to me.
2. NullableTuple may contain a generic WritableComparable which we are not
able to support in C++. Hence, we want to throw exception *early* if we
know the generic WritableComparable is in a tuple. "Early" here means at
init time or as soon as a user submit a Pig script. It is unacceptable to
throw until the map function is invoked.
Again, I don't know how a tuple is used across a MR job so the worry may
not be necessary.
Any ideas or suggestions would be appreciated.
Thanks,
Manu Zhang