Re: Rewire and multi-query load/store optimization

Alan Gates Tue, 16 Jun 2009 10:33:27 -0700

+1 on option one. The use of store->load was only to overcome atemporary problem in Pig. We've fixed the problem, so let's notpropagate it. We will need to document this very clearly (maybe evento the point of issuing warnings in the parser when we see this combo)so users understand that this is now a hinderance rather than a help.


Alan.


On Jun 12, 2009, at 2:19 PM, Santhosh Srinivasan wrote:

With the implementation of rewire as part of the optimizer

infrastructure, a bug was exposed in the load/store optimization inthe

multi-query feature. Below, I will articulate the bug and the
ramifications of a few possible solutions.

Load/store optimization in the multi-query feature?
---------------------------------------------------

If a script has an explicit store and a corresponding load which loads

the output of the store, the store-load combination can beoptimized. An

example will illustrate the concept.

Pre-conditions:

1. The store location and the load location should match
2. The store format and the load format should be compatible

{code}

A = load 'input';
B = group A by $0;
store B into 'output';
C = load 'output';
D = group C by $0;
store D into 'some_other_output';

{code}

In the script above, the output of the first store serves as input of

the second load (C). In addition, the store and load usePigStorage() as

the store/load mechanism. In the logical plan this combination by
splitting B into the store and D.

Bug
---

When the load in the store/load combination was removed, the innerplans

of the load's successors (in this case D), were not updated correctly.

As a result, the projections in the inner plans still heldreferences to

non-existing operators.

Consequence of the bug fix
---------------------------

During the map-reduce (M/R) compilation the split operator is compiled
into a store and a load. Prior to multi-query, for each M/R boundary
resulted in a temporary store using BinStorage. The subsequent load

could infer the type as BinStorage returns typed records, i.e., non-byte

array records.

With multi-query and the load/store optimization, the temporary

BinStorage data is not generated. Instead, the subsequent load usesthe

output of the previous store as its input. Here, the loader can get

typed or untyped records based on the loader. As a result, theoperators

in the map phase that rely on the type information (inferred from the
logical plan) will fail due to type mismatch.

Possible Solutions
------------------

Solution 1
==========
Switch the load/store optimization. Users were primarily storing
intermediate data within the same script to overcome Pig's limitation,
i.e., absence of the multi-query feature. Going forward, with

multi-query turned on, users who store intermediate data will notenjoy

all the benefits of the optimization.

Solution 2
==========
After the M/R compilation is completed, during the final pass of the

plan, fix the types of the projections to reflect typed/untypeddata. Inother words, if the loader is returning typed data then retain thetypes

else change the types to bytearray. In order to make this decision,

loaders should support an interface to indicate if the records aretyped

or untyped.


Thanks,
Santhosh

Re: Rewire and multi-query load/store optimization

Reply via email to