Avoid serialization/deserialization costs for PigStorage data - Use custom Map 
and Bag implementation
-----------------------------------------------------------------------------------------------------

                 Key: PIG-1473
                 URL: https://issues.apache.org/jira/browse/PIG-1473
             Project: Pig
          Issue Type: Improvement
    Affects Versions: 0.8.0
            Reporter: Thejas M Nair
             Fix For: 0.8.0


Cost of serialization/deserialization (sedes) can be very high and avoiding it 
will improve performance.

Avoid sedes when possible by implementing approach #3 proposed in 
http://wiki.apache.org/pig/AvoidingSedes .

The load function uses subclass of Map and DataBag which holds the serialized 
copy.  LoadFunction delays deserialization of map and bag types until a member 
function of java.util.Map or DataBag is called. 

Example of query where this will help -
{CODE}
l = LOAD 'file1' AS (a : int, b : map [ ]);
f = FOREACH l GENERATE udf1(a), b;      
fil = FILTER f BY $0 > 5;
dump fil; -- Serialization of column b can be delayed until here using this 
approach .

{CODE}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to