Avoid serialization/deserialization costs for PigStorage data - Use custom Tuple --------------------------------------------------------------------------------
Key: PIG-1474 URL: https://issues.apache.org/jira/browse/PIG-1474 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Avoid sedes when possible for data loaded using PigStorage by implementing approach #4 proposed in http://wiki.apache.org/pig/AvoidingSedes . The write() and readFields() functions of tuple returned by TupleFactory is used to serialize data between Map and Reduce. By using a tuple that knows the serialization format of the loader, we avoid sedes at Map Recue boundary and use the load functions serialized format between Map and Reduce . To use a new custom tuple for this purpose, a custom TupleFactory that returns tuples of this type has to be specified using the property "pig.data.tuple.factory.name" . This approach will work only for a set of load functions in the query that share same serialization format for map and bags. If this approach proves to be very useful, it will build a case for more extensible approach. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.