[jira] [Updated] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size

Koji Noguchi (JIRA) Wed, 20 Mar 2013 13:55:17 -0700

     [ 
https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Koji Noguchi updated PIG-3251:
------------------------------

    Attachment: pig-3251-trunk-v03.patch

bq. Makes sense, we shall move to the new approach for Hadoop 1.1.0+, use 
Bzip2TextInputFormat otherwise for backward compatibility.

Would something like this work? pig-3251-trunk-v03.patch uses 
PigTextInputFormat even for bzip if TextInputFormat can split them. (I'll 
update the other FileInputLoadFunc if this change looks ok.  Also, this works 
with 'bz2' extension but not for 'bz' unless config is added.)


                
> Bzip2TextInputFormat requires double the memory of maximum record size
> ----------------------------------------------------------------------
>
>                 Key: PIG-3251
>                 URL: https://issues.apache.org/jira/browse/PIG-3251
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Koji Noguchi
>            Assignee: Koji Noguchi
>            Priority: Minor
>         Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, 
> pig-3251-trunk-v03.patch
>
>
> While looking at user's OOM heap dump, noticed that pig's 
> Bzip2TextInputFormat consumes memory at both
> Bzip2TextInputFormat.buffer (ByteArrayOutputStream) 
> and actual Text that is returned as line.
> For example, when having one record with 160MBytes, buffer was 268MBytes and 
> Text was 160MBytes.  
> We can probably eliminate one of them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size

Reply via email to