well, see that's the thing, the 'sort A by $0' is already nlg(n) ahh, I see, my own example suffers from this problem.
I guess I'm wondering how 'limit' works in conjunction with UDF's... A practical application escapes me right now, But if I do C = foreach B{ C1 = MyUdf(B.bag_on_b); C2 = limit C1 5; } does it know to push limit in this case? On Thu, May 27, 2010 at 2:32 PM, Alan Gates <ga...@yahoo-inc.com> wrote: > The default case is that a UDFs that take bags (such as COUNT, etc.) are > handed the entire bag at once. In the case where all UDFs in a foreach > implement the algebraic interface and the expression itself is algebraic > than the combiner will be used, thus significantly limiting the size of the > bag handed to the UDF. The accumulator does hand records to the UDF a few > thousand at a time. Currently it has no way to turn off the flow of > records. > > What you want might be accomplished by the LIMIT operator, which can be > used inside a nested foreach. Something like: > > C = foreach B { > C1 = sort A by $0; > C2 = limit 5 C1; > generate myUDF(C2); > } > > Alan. > > > On May 26, 2010, at 11:59 AM, hc busy wrote: > > Hey, guys, how are Bags passed to EvalFunc stored? >> >> I was looking at the Accumulator interface and it says that the reason why >> this needed for COUNT and SUM is because EvalFunc always gives you the >> entire bag when the EvalFunc is run on a bag. >> >> I always thought if I did COUNT(TABLE) or SUM(TABLE.FIELD), and the code >> inside that does >> >> >> for(Tuple entry:inputDataBag){ >> .... stuff >> } >> >> >> was an actual iterator that iterated on the bag sequentially without >> necessarily having the entire bag in memory all at once. ?? Because it's >> an >> iterator, so there's no way to do anything other than to stream through >> it. >> >> I'm looking at this because Accumulator has no way of telling Pig "I've >> seen >> enough" It streams through the entire bag no matter what happens. (like, >> hypothetically speaking, if I was writing "5th item of a sorted bag" udf), >> after I see 5th of a 5 million entry bag, I want to stop executing if >> possible. >> >> Is there a easy way to make this happen? >> > >