Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "AvoidingSedes" page has been changed by ThejasNair.
http://wiki.apache.org/pig/AvoidingSedes

--------------------------------------------------

New page:
= Avoiding Serialization/De-serialization in pig
Serialization/De-serialization is expensive and avoiding it will improve 
performance.


= Delaying/Avoiding deserialization at runtime
These approaches does not involve any changes to core pig code. Load functions, 
or serialization between map and reduce can be separately changed to improve 
performance.
 1. '''!LoadFunctions make use of public interface 
!LoadPushDown.pushDownProjection.''' Don't deserialize columns not that are not 
in required . This should always improve performance. !PigStorage indirectly 
works this way, if a column is not used, the optimizer removes the casting(ie 
deserialization) of the column from the type-casting foreach statement which 
comes after the load.
 1. '''!LoadFunction return a custom tuple, which deserializes fields only when 
tuple.get(i) is called.''' </b> This can be useful if the first operator after 
load is a filter operator - the whole filter expression might not have to be 
evaluated and that deserialization of all columns might not have to be done. 
Assuming the first approach is already implemented, then this approach is 
likely to have some overhead with queries where all tuple.get(i) is called on 
all columns/rows.
 1. '''!LoadFunction delays deserialization of map and bag types until a member 
function of java.util.Map or !DataBag is called. ''' The load function uses 
subclass of Map and DataBag which holds the serialized copy. This will help in 
delaying the deserialization further. This can't be done for scalar types 
because the classes pig uses for them are final; even if that were not the case 
we might not see much of performance gain because of the cost of creating an 
copy of the serialized data might be high compared to the cost of 
deserialization. This will only delay serialization up to the MR boundaries. 
{{{
Example of query where this will help -
l = LOAD 'file1' AS (a : int, b : map [ ]);
f = FOREACH l GENERATE udf1(a), b;       -- Approach 2 will not help in 
delaying deserialization beyond this point.
fil = FILTER f BY $0 > 5;
dump fil; -- Serialization of column b can be delayed until here using this 
approach .
}}}
 1.#4 '''Set the property "pig.data.tuple.factory.name" to use a tuple that 
understands serialization format used for bags and maps used in approach 3, so 
that serialized data can be passed from loader across MR boundaries in the 
serialization format of load function. ''' The write() and readFields() 
functions of tuple returned by TupleFactory is used to serialize data between 
Map and Reduce. To use a new custom tuple, you need to use a custom 
TupleFactory that returns tuples of this type. But this approach will work only 
for a set of load functions in the query that share same serialization format 
for map and bags.
 1. ''' Expose load function's sedes functionality in new interface and track 
lineage of columns''' This will the elegant and extensible way of doing what is 
proposed in approach 4. For each serialized column, if we know the 
deserialization function, we can delay deserialization across MR boundaries.

Reply via email to