pig-user  

Re: Large tuples

pi song
Tue, 10 Jun 2008 20:43:12 -0700

So I think the current implementation does follow Hadoop in the way that
data can be partitioned horizontally. It doesn't do any vertical partition
at all. Other than data bag issue, we also have to think about how to
partition the process (how to perform all the logical operators
basically). I was thinking about matrix processing in Pig before but it
seemed to be difficult due to this fact.

On 6/11/08, Alan Gates <[EMAIL PROTECTED]> wrote:
>
> In general, pig expects that its data structures can fit into memory.  The
> one exception to this assumption is data bags, which are explicitly
> constructed to support spilling to disk in the case that they do not fit in
> memory.  So a tuple that contains a very large data bag (which is the case
> you give) can be handled, because the tuple itself will only contain a
> reference to the data bag.  And the data bag will spill to disk.  The case
> that pig cannot handle currently is when the rest of the tuple does not fit
> in memory.  So if a tuple contains a very large map or string, or 1+ billion
> integers, or something like that, pig will fail.
>
> All this said, the code that handles spilling bags and freeing memory does
> not work ideally yet (as you've seen from the discussions regarding the gc
> overhead bug) so pig sometimes dies when it shouldn't.
>
> For a reference on spilling see src/org/apache/pig/data/DataBag.java and
> extending classes.
>
> Alan.
>
> Mridul Muralidharan wrote:
>
>> Hi,
>>
>>  How does pig handle really large tuples.
>> Assuming after a group, the resulting alias has small subset of tuples
>> (out of the many which were generated) which are really large in size.
>> In excess of a gig as a ballpark figure (so that the tuple is spread
>> across many dfs blocks).
>>
>> Does pig handle this case ? If yes how (refs/rtfm would be great too) ?
>>
>> Thanks,
>> Mridul
>>
>