Ah I see, PIG-1324..
On Sun, Dec 4, 2011 at 8:15 AM, Dmitriy Ryaboy <[email protected]> wrote: > flatten(lineitem) uses all the fields from lineitem, hence no pruning. > > On Fri, Dec 2, 2011 at 6:42 PM, Jie Li <[email protected]> wrote: >> Sure. The two lines in bold are just dropping out non-necessary fields. >> Without them Pig would not project, especially for the table lineitem. >> >> lineitem = load '$input/lineitem' USING PigStorage('|') as >> (l_orderkey:long, l_partkey:long, l_suppkey:long, l_linenumber:long, >> l_quantity:double, l_extendedprice:double, l_discount:double, l_tax:double, >> l_returnflag:chararray, l_linestatus:chararray, l_shipdate:chararray, >> l_commitdate:chararray, l_receiptdate:chararray,l_shippingstruct:chararray, >> l_shipmode:chararray, l_comment:chararray); >> >> part = load '$input/part' USING PigStorage('|') as (p_partkey:long, >> p_name:chararray, p_mfgr:chararray, p_brand:chararray, p_type:chararray, >> p_size:long, p_container:chararray, p_retailprice:double, >> p_comment:chararray); >> >> *lineitem = foreach lineitem generate l_partkey, l_quantity, >> l_extendedprice ;* >> part = FILTER part BY p_brand == 'Brand#23' AND p_container == 'MED BOX'; >> *part = foreach part generate p_partkey;* >> >> COG1 = COGROUP part by p_partkey, lineitem by l_partkey; >> COG1 = filter COG1 by COUNT(part) > 0; >> COG2 = FOREACH COG1 GENERATE COUNT(part) as count_part, FLATTEN(lineitem), >> 0.2 * AVG(lineitem.l_quantity) as l_avg; >> >> COG3 = filter COG2 by l_quantity < l_avg; >> COG = foreach COG3 generate (l_extendedprice * count_part) as l_sum; >> >> G1 = group COG ALL; >> >> result = foreach G1 generate SUM(COG.l_sum)/7.0; >> >> >> >> On Fri, Dec 2, 2011 at 9:16 PM, Dmitriy Ryaboy <[email protected]> wrote: >> >>> Can you provide a script that shows projection not happening? We've >>> observed the opposite (and use that fact extensively) >>> >>> D >>> >>> On Fri, Dec 2, 2011 at 4:05 PM, Jie Li <[email protected]> wrote: >>> > Hi all, >>> > >>> > We just figured out Pig 0.9.1 doesn't drop those non-necessary fields >>> asap, >>> > which really affects the performance. Though >>> > >>> http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html#loadfunc_loaderpushdownsaid >>> > that "As part of its optimizations Pig analyzes Pig Latin scripts and >>> > determines what fields in an input it needs at each step in the script. >>> It >>> > uses this information to aggressively drop fields it no longer needs." >>> > >>> > We also found that Pig casts the data into the types defined in the >>> schema, >>> > which is usually unnecessary, as most of them will be soon dropped. >>> > >>> > To work around these, we have to manually drop those fields and remove >>> the >>> > types in the schema, which are really not interesting. >>> > >>> > Jie >>> >>>
