Avoid serialization/deserialization costs for PigStorage data - Use custom Tuple
--------------------------------------------------------------------------------

                 Key: PIG-1474
                 URL: https://issues.apache.org/jira/browse/PIG-1474
             Project: Pig
          Issue Type: Improvement
    Affects Versions: 0.8.0
            Reporter: Thejas M Nair
            Assignee: Thejas M Nair
             Fix For: 0.8.0


Avoid sedes when possible for data loaded using PigStorage by implementing 
approach #4 proposed in http://wiki.apache.org/pig/AvoidingSedes .

The write() and readFields() functions of tuple returned by TupleFactory  is 
used to serialize data between Map and Reduce. By using a tuple that knows the 
serialization format of the loader, we avoid sedes at Map Recue boundary and 
use the load functions serialized format between Map and Reduce . 
To use a new custom tuple for this purpose, a custom TupleFactory that returns 
tuples of this type has to be specified using the property 
"pig.data.tuple.factory.name" .
This approach will work only for a set of load functions in the query that 
share same serialization format for map and bags. If this approach proves to be 
very useful, it will build a case for more extensible approach.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to