[ 
https://issues.apache.org/jira/browse/PIG-96?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566637#action_12566637
 ] 

Benjamin Reed commented on PIG-96:
----------------------------------

The bags we are spilling need to be processed on a single machine. The really 
big bag that represents a relation is already in HDFS and spread across 
machines. (I would really like to use different term for bags that represent a 
relation versus bags that represent a group of tuples inside of another tuple 
to avoid confusion in these kinds of discussions.) If the bag is being 
processed by an algebraic function, we have already applied disjoint subset 
paralellism, so the only thing left is to spill to disk. Since it must be 
processed locally, we want to keep it local and not put it on HDFS. The spill 
is also extremely temporary in nature, since the bag will be processed locally 
and then thrown away. 

> It should be possible to spill big databags to HDFS
> ---------------------------------------------------
>
>                 Key: PIG-96
>                 URL: https://issues.apache.org/jira/browse/PIG-96
>             Project: Pig
>          Issue Type: Improvement
>          Components: data
>            Reporter: Pi Song
>
> Currently databags only get spilled to local disk which costs  2  disk io 
> operations.If databags are too big, this is not efficient. 
> We should take advantage of HDFS so if the databag is too big (determined by 
> DataBag.getMemorySize() >  a big  threshold), let's spill it to HDFS. Also 
> read from HDFS in parallel when data is required.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to