[ https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Koji Noguchi updated PIG-3251: ------------------------------ Attachment: pig-3251-trunk-v02.patch (1) Current status (before any patch) ||hadoop version || PigTextInputFormat || Bzip2TextInputFormat.java || | 0.20 | [i] SLOW due to HADOOP-6109 | (iii) Needs EXTRA MEMORY. This Jira. | | 0.23 | [ii] Good. | (iv) Needs EXTRA MEMORY. This Jira. | (2) My initial patch (pig-3251-trunk-v01.patch) changes this to ||hadoop version || PigTextInputFormat || Bzip2TextInputFormat.java || | 0.20 | [i] SLOW due to HADOOP-6109 | (iii) Slow due to HADOOP-6109 | | 0.23 | [ii] Good. | (iv) Good | (3) If we can backport hadoop-6109 to 0.20 + my pig-3251-trunk-v01.patch, it solves all the problem. ||hadoop version || PigTextInputFormat || Bzip2TextInputFormat.java || | 0.20+Hadoop-6109 | [i] Good | (iii) Good | | 0.23 | [ii] Good. | (iv) Good | However, I've seen a discussion about pig supporting 0.20.2 users. So I guess we can't ask them to backport HADOOP-6109 then. I think my remaining options are (a) Give up. Wait till everyone upgrades to 0.23/2.0 or backport HADOOP-6109 to hadoop 1.2* and wait till pig moves off from 0.20.2/1.0.*. (b) Try to workaround without touching hadoop code. I think (a) is reasonable but tried (b). This patch makes the status as below. (4) Patch (pig-3251-trunk-v02.patch) ||hadoop version || PigTextInputFormat || Bzip2TextInputFormat.java || | 0.20 | [i] SLOW due to HADOOP-6109 | (iii) Good | | 0.23 | [ii] Good. | (iv) Good | Penalty of not touching the hadoop code is, my patch adds two unnecessary bytearray copies when extending the Text size. But frequency is low due to exponentially increasing sizes, so I hope the overall overhead is negligible. > Bzip2TextInputFormat requires double the memory of maximum record size > ---------------------------------------------------------------------- > > Key: PIG-3251 > URL: https://issues.apache.org/jira/browse/PIG-3251 > Project: Pig > Issue Type: Improvement > Reporter: Koji Noguchi > Assignee: Koji Noguchi > Priority: Minor > Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch > > > While looking at user's OOM heap dump, noticed that pig's > Bzip2TextInputFormat consumes memory at both > Bzip2TextInputFormat.buffer (ByteArrayOutputStream) > and actual Text that is returned as line. > For example, when having one record with 160MBytes, buffer was 268MBytes and > Text was 160MBytes. > We can probably eliminate one of them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira