[jira] Commented: (PIG-96) It should be possible to spill big databags to HDFS

Alan Gates (JIRA) Wed, 06 Feb 2008 09:03:31 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-96?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566200#action_12566200
 ]


Alan Gates commented on PIG-96:
-------------------------------

DataBags are spilled only when they are too large for memory, so and individual 
spill file isn't more than a few G.  All the spill files together could be 
larger, so we could open one HDFS spill file and keep appending.  But this 
won't work in the sorted or distinct case.  For the DefaultDataBag case we read 
the various spill files back serially anyway, so whether they are on one disk 
or many doesn't matter.  The only case where writing to HDFS would help us here 
is the case where the the total bag exceeds the size of the local disk of the 
machine.

> It should be possible to spill big databags to HDFS
> ---------------------------------------------------
>
>                 Key: PIG-96
>                 URL: https://issues.apache.org/jira/browse/PIG-96
>             Project: Pig
>          Issue Type: Improvement
>          Components: data
>            Reporter: Pi Song
>
> Currently databags only get spilled to local disk which costs  2  disk io 
> operations.If databags are too big, this is not efficient. 
> We should take advantage of HDFS so if the databag is too big (determined by 
> DataBag.getMemorySize() >  a big  threshold), let's spill it to HDFS. Also 
> read from HDFS in parallel when data is required.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-96) It should be possible to spill big databags to HDFS

Reply via email to