+1 on option one. The use of store->load was only to overcome a
temporary problem in Pig. We've fixed the problem, so let's not
propagate it. We will need to document this very clearly (maybe even
to the point of issuing warnings in the parser when we see this combo)
so users understand that this is now a hinderance rather than a help.
Alan.
On Jun 12, 2009, at 2:19 PM, Santhosh Srinivasan wrote:
With the implementation of rewire as part of the optimizer
infrastructure, a bug was exposed in the load/store optimization in
the
multi-query feature. Below, I will articulate the bug and the
ramifications of a few possible solutions.
Load/store optimization in the multi-query feature?
---------------------------------------------------
If a script has an explicit store and a corresponding load which loads
the output of the store, the store-load combination can be
optimized. An
example will illustrate the concept.
Pre-conditions:
1. The store location and the load location should match
2. The store format and the load format should be compatible
{code}
A = load 'input';
B = group A by $0;
store B into 'output';
C = load 'output';
D = group C by $0;
store D into 'some_other_output';
{code}
In the script above, the output of the first store serves as input of
the second load (C). In addition, the store and load use
PigStorage() as
the store/load mechanism. In the logical plan this combination by
splitting B into the store and D.
Bug
---
When the load in the store/load combination was removed, the inner
plans
of the load's successors (in this case D), were not updated correctly.
As a result, the projections in the inner plans still held
references to
non-existing operators.
Consequence of the bug fix
---------------------------
During the map-reduce (M/R) compilation the split operator is compiled
into a store and a load. Prior to multi-query, for each M/R boundary
resulted in a temporary store using BinStorage. The subsequent load
could infer the type as BinStorage returns typed records, i.e., non-
byte
array records.
With multi-query and the load/store optimization, the temporary
BinStorage data is not generated. Instead, the subsequent load uses
the
output of the previous store as its input. Here, the loader can get
typed or untyped records based on the loader. As a result, the
operators
in the map phase that rely on the type information (inferred from the
logical plan) will fail due to type mismatch.
Possible Solutions
------------------
Solution 1
==========
Switch the load/store optimization. Users were primarily storing
intermediate data within the same script to overcome Pig's limitation,
i.e., absence of the multi-query feature. Going forward, with
multi-query turned on, users who store intermediate data will not
enjoy
all the benefits of the optimization.
Solution 2
==========
After the M/R compilation is completed, during the final pass of the
plan, fix the types of the projections to reflect typed/untyped
data. In
other words, if the loader is returning typed data then retain the
types
else change the types to bytearray. In order to make this decision,
loaders should support an interface to indicate if the records are
typed
or untyped.
Thanks,
Santhosh