Hi all,

I'm new to Pig so please forgive me if there are any mistakes in my
following description.

We are working on a MapReduce engine in C++ which also supports Pig key
types.

If I'm not mistaken, Pig key type classes are those extending
*PigNullableWritable. *
We have run into several issues regarding Serialization/Deserialization.

1. By default,  NullableBag will write out a DefaultDataBag which inherits
the write/readFields methods of DefaultAbstractBag. We found the two are
inconsistent:

    public void readFields(DataInput in) throws IOException {
        long size = in.readLong();

        for (long i = 0; i < size; i++) {
            try {
                Object o = sedes.readDatum(in);
                add((Tuple)o);
            } catch (ExecException ee) {
                throw ee;
            }
        }
    }

   public void write(DataOutput out) throws IOException {
        sedes.writeDatum(out, this);
    }

when writing out, the first byte will one of TINYBAG, SMALLBAG and BAG,
followed by size which  would be a byte, short or long accordingly.
Regardless of that format, the readFields method directly reads size as a
long.

Since I'm new to Pig, I don't know how the bag is used across a MR job but
this doesn't look right to me.

2. NullableTuple may contain a generic WritableComparable which we are not
able to support in C++. Hence, we want to throw exception *early* if we
know the generic WritableComparable is in a tuple. "Early" here means at
init time or as soon as a user submit a Pig script. It is unacceptable to
throw until the map function is invoked.

Again, I don't know how a tuple is used across a MR job so the worry may
not be necessary.

Any ideas or suggestions would be appreciated.

Thanks,
Manu Zhang

Reply via email to