[jira] Commented: (PIG-96) It should be possible to spill big databags to HDFS

Pi Song (JIRA) Thu, 07 Feb 2008 03:21:29 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-96?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566557#action_12566557
 ]


Pi Song commented on PIG-96:
----------------------------

1) I would like to introduce rolling working paths like in Hadoop therefore 
spill files can be spread across multiple disks (Only for getting away from 
disk space issue). Parallelism can be thought of later on.

2) From my feeling, a big bag is a very good candidate for applying disjoint 
subset paralellism. Let me dig deeper in the code and will get back soon.

> It should be possible to spill big databags to HDFS
> ---------------------------------------------------
>
>                 Key: PIG-96
>                 URL: https://issues.apache.org/jira/browse/PIG-96
>             Project: Pig
>          Issue Type: Improvement
>          Components: data
>            Reporter: Pi Song
>
> Currently databags only get spilled to local disk which costs  2  disk io 
> operations.If databags are too big, this is not efficient. 
> We should take advantage of HDFS so if the databag is too big (determined by 
> DataBag.getMemorySize() >  a big  threshold), let's spill it to HDFS. Also 
> read from HDFS in parallel when data is required.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-96) It should be possible to spill big databags to HDFS

Reply via email to