[ https://issues.apache.org/jira/browse/PIG-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Thejas M Nair resolved PIG-1473. -------------------------------- Resolution: Won't Fix Implementing lazy de-serialization using this approach will introduce a non backward compatible change in PigStorage . So closing this jira as wontfix. In PigStorage, if the de-serialization fails, the value is treated as null, ie tuple.get(i) returns null . But if the de-serialization is delayed by returning a subclass of map or bag that holds the serialized data, the tuple.get(i) call will return a non null value even if the serialized format has a problem. Though this approach is not being implemented in PigStorage() for this reason, other load store functions can potentially adopt this method. > Avoid serialization/deserialization costs for PigStorage data - Use custom > Map and Bag implementation > ----------------------------------------------------------------------------------------------------- > > Key: PIG-1473 > URL: https://issues.apache.org/jira/browse/PIG-1473 > Project: Pig > Issue Type: Improvement > Affects Versions: 0.8.0 > Reporter: Thejas M Nair > Assignee: Thejas M Nair > Fix For: 0.8.0 > > > Cost of serialization/deserialization (sedes) can be very high and avoiding > it will improve performance. > Avoid sedes when possible by implementing approach #3 proposed in > http://wiki.apache.org/pig/AvoidingSedes . > The load function uses subclass of Map and DataBag which holds the serialized > copy. LoadFunction delays deserialization of map and bag types until a > member function of java.util.Map or DataBag is called. > Example of query where this will help - > {CODE} > l = LOAD 'file1' AS (a : int, b : map [ ]); > f = FOREACH l GENERATE udf1(a), b; > fil = FILTER f BY $0 > 5; > dump fil; -- Serialization of column b can be delayed until here using this > approach . > {CODE} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.