[ 
https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3251:
------------------------------

    Attachment: pig-3251-trunk-v02.patch

(1) 
Current status (before any patch)
||hadoop version  || PigTextInputFormat          || Bzip2TextInputFormat.java 
|| 
| 0.20             | [i]  SLOW due to HADOOP-6109 | (iii) Needs EXTRA MEMORY. 
This Jira. |  
| 0.23             | [ii] Good.                   |  (iv) Needs EXTRA MEMORY. 
This Jira. | 

(2) 
My initial patch (pig-3251-trunk-v01.patch) changes this to 
||hadoop version  || PigTextInputFormat          || Bzip2TextInputFormat.java 
|| 
| 0.20             | [i]  SLOW due to HADOOP-6109 | (iii) Slow due to 
HADOOP-6109 |  
| 0.23             | [ii] Good.                   |  (iv) Good | 

(3) 
If we can backport hadoop-6109 to 0.20 + my pig-3251-trunk-v01.patch, it solves 
all the problem.
||hadoop version  || PigTextInputFormat          || Bzip2TextInputFormat.java 
|| 
| 0.20+Hadoop-6109 | [i]  Good                    | (iii) Good |  
| 0.23             | [ii] Good.                   |  (iv) Good | 

However, I've seen a discussion about pig supporting 0.20.2 users.  
So I guess we can't ask them to backport HADOOP-6109 then.


I think my remaining options are
(a) Give up.  Wait till everyone upgrades to 0.23/2.0 or backport HADOOP-6109 
to hadoop 1.2* and wait till pig moves off from 0.20.2/1.0.*. 
(b) Try to workaround without touching hadoop code.

I think (a) is reasonable but tried (b).  This patch makes the status as below.

(4) 
Patch (pig-3251-trunk-v02.patch) 
||hadoop version  || PigTextInputFormat          || Bzip2TextInputFormat.java 
|| 
| 0.20             | [i]  SLOW due to HADOOP-6109 | (iii) Good |  
| 0.23             | [ii] Good.                   |  (iv) Good | 


Penalty of not touching the hadoop code is, my patch adds two unnecessary 
bytearray copies when extending the Text size.  But frequency is low due to 
exponentially increasing sizes, so I hope the overall overhead is negligible.

                
> Bzip2TextInputFormat requires double the memory of maximum record size
> ----------------------------------------------------------------------
>
>                 Key: PIG-3251
>                 URL: https://issues.apache.org/jira/browse/PIG-3251
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Koji Noguchi
>            Assignee: Koji Noguchi
>            Priority: Minor
>         Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch
>
>
> While looking at user's OOM heap dump, noticed that pig's 
> Bzip2TextInputFormat consumes memory at both
> Bzip2TextInputFormat.buffer (ByteArrayOutputStream) 
> and actual Text that is returned as line.
> For example, when having one record with 160MBytes, buffer was 268MBytes and 
> Text was 160MBytes.  
> We can probably eliminate one of them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to