On Sep 5, 2012, at 6:30 PM, Prasanth J wrote:

> Ahh.. Now it makes more sense.
> 
> I think I got the solution. I was adding to List<Tuple> and then finally 
> creating a DataBag with that list.. Instead I should create a bag and keep 
> adding to it..!! Is that correct? 
Yes.

Alan.

> Thanks Alan. 
> 
> Thanks
> -- Prasanth
> 
> On Sep 5, 2012, at 9:24 PM, Alan Gates <ga...@hortonworks.com> wrote:
> 
>> You cannot modify a bag once it is written.  The implementation is written 
>> around the assumption that bags are immutable after they are written.  
>> 
>> Creating a new bag should not create an OOM exception, as bags are built to 
>> spill when they grow too large.  In fact it's this spilling feature that 
>> makes in place modification impossible.
>> 
>> Alan.
>> 
>> On Sep 5, 2012, at 6:08 PM, Prasanth J wrote:
>> 
>>> Hello devs
>>> 
>>> I have specific case where I need to modify the contents (remove a field 
>>> from each tuples) of Databag but I want to do it in-place and do not want 
>>> to create another databag with new set of tuples. 
>>> The situation is, say I have the following input tuple for an UDF
>>> 
>>> {(111,222,3,121), (112,223,2,131), (113,224,4,141)}
>>> 
>>> I want to iterate through this bag and generate an output bag removing the 
>>> 3rd the of each tuples in the bag to get the following output
>>> {(111,222,121), (112,223,131), (113,224,141)}
>>> 
>>> Since the number of tuples in this bag are expected to be large I cannot 
>>> create new set of tuples and create a bag, as this will cause OOM 
>>> exception. 
>>> 
>>> Also I do not want to flatten this bag as this bag will be passed to 
>>> DISTINCT operator for computing distinct elements in the bag.
>>> As seen from the javadocs for DataBag, there is no way to convert a bag on 
>>> the fly. I wonder if there is any other way to solve this?
>>> 
>>> Thanks
>>> -- Prasanth
>>> 
>> 
> 

Reply via email to