[
https://issues.apache.org/jira/browse/PIG-96?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566200#action_12566200
]
Alan Gates commented on PIG-96:
-------------------------------
DataBags are spilled only when they are too large for memory, so and individual
spill file isn't more than a few G. All the spill files together could be
larger, so we could open one HDFS spill file and keep appending. But this
won't work in the sorted or distinct case. For the DefaultDataBag case we read
the various spill files back serially anyway, so whether they are on one disk
or many doesn't matter. The only case where writing to HDFS would help us here
is the case where the the total bag exceeds the size of the local disk of the
machine.
> It should be possible to spill big databags to HDFS
> ---------------------------------------------------
>
> Key: PIG-96
> URL: https://issues.apache.org/jira/browse/PIG-96
> Project: Pig
> Issue Type: Improvement
> Components: data
> Reporter: Pi Song
>
> Currently databags only get spilled to local disk which costs 2 disk io
> operations.If databags are too big, this is not efficient.
> We should take advantage of HDFS so if the databag is too big (determined by
> DataBag.getMemorySize() > a big threshold), let's spill it to HDFS. Also
> read from HDFS in parallel when data is required.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.