[
https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772623#action_12772623
]
Thejas M Nair commented on PIG-1062:
------------------------------------
Even after the interface changes, pig can compute the file size by adding up
size of each split (from InputSplit.getLenght()) . The documentation of the
function in the interface does not make it clear if this is size on disk ,
compressed/uncompressed etc. Assuming it is size on disk (uncompressed),
estimating the total memory it will require is a challenge, one has to make
assumption about the compression ratio and the serialization method.
Using Tuple.getMemorySize() while sampling will give more accurate numbers for
reducer memory that it will consume.
> load-store-redesign branch: change SampleLoader and subclasses to work with
> new LoadFunc interface
> ---------------------------------------------------------------------------------------------------
>
> Key: PIG-1062
> URL: https://issues.apache.org/jira/browse/PIG-1062
> Project: Pig
> Issue Type: Sub-task
> Reporter: Thejas M Nair
> Assignee: Thejas M Nair
>
> This is part of the effort to implement new load store interfaces as laid out
> in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to
> be changed to work with new LoadFunc interface.
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.