Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "AvoidingSedes" page has been changed by ThejasNair.
http://wiki.apache.org/pig/AvoidingSedes?action=diff&rev1=3&rev2=4

--------------------------------------------------

  
  
  == Delaying/Avoiding deserialization at runtime ==
- These approaches does not involve any changes to core pig code. Load 
functions, or serialization between map and reduce can be separately changed to 
improve performance.
+ These approaches (except 5) does not involve major changes to core pig code. 
Load functions, or serialization between map and reduce can be separately 
changed to improve performance.
   1. '''!LoadFunctions make use of public interface 
!LoadPushDown.pushDownProjection.''' Don't deserialize columns not that are not 
in required . This should always improve performance. !PigStorage indirectly 
works this way, if a column is not used, the optimizer removes the casting(ie 
deserialization) of the column from the type-casting foreach statement which 
comes after the load.
   1. '''!LoadFunction return a custom tuple, which deserializes fields only 
when tuple.get(i) is called.''' </b> This can be useful if the first operator 
after load is a filter operator - the whole filter expression might not have to 
be evaluated and that deserialization of all columns might not have to be done. 
Assuming the first approach is already implemented, then this approach is 
likely to have some overhead with queries where all tuple.get(i) is called on 
all columns/rows.
   1. '''!LoadFunction delays deserialization of map and bag types until a 
member function of java.util.Map or !DataBag is called. ''' The load function 
uses subclass of Map and DataBag which holds the serialized copy. This will 
help in delaying the deserialization further. This can't be done for scalar 
types because the classes pig uses for them are final; even if that were not 
the case we might not see much of performance gain because of the cost of 
creating an copy of the serialized data might be high compared to the cost of 
deserialization. This will only delay serialization up to the MR boundaries. 

Reply via email to