Avoid serialization/deserialization costs for PigStorage data - Use custom Map and Bag implementation -----------------------------------------------------------------------------------------------------
Key: PIG-1473 URL: https://issues.apache.org/jira/browse/PIG-1473 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Thejas M Nair Fix For: 0.8.0 Cost of serialization/deserialization (sedes) can be very high and avoiding it will improve performance. Avoid sedes when possible by implementing approach #3 proposed in http://wiki.apache.org/pig/AvoidingSedes . The load function uses subclass of Map and DataBag which holds the serialized copy. LoadFunction delays deserialization of map and bag types until a member function of java.util.Map or DataBag is called. Example of query where this will help - {CODE} l = LOAD 'file1' AS (a : int, b : map [ ]); f = FOREACH l GENERATE udf1(a), b; fil = FILTER f BY $0 > 5; dump fil; -- Serialization of column b can be delayed until here using this approach . {CODE} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.