[ https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Koji Noguchi updated PIG-3251: ------------------------------ Attachment: pig-3251-trunk-v03.patch bq. Makes sense, we shall move to the new approach for Hadoop 1.1.0+, use Bzip2TextInputFormat otherwise for backward compatibility. Would something like this work? pig-3251-trunk-v03.patch uses PigTextInputFormat even for bzip if TextInputFormat can split them. (I'll update the other FileInputLoadFunc if this change looks ok. Also, this works with 'bz2' extension but not for 'bz' unless config is added.) > Bzip2TextInputFormat requires double the memory of maximum record size > ---------------------------------------------------------------------- > > Key: PIG-3251 > URL: https://issues.apache.org/jira/browse/PIG-3251 > Project: Pig > Issue Type: Improvement > Reporter: Koji Noguchi > Assignee: Koji Noguchi > Priority: Minor > Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, > pig-3251-trunk-v03.patch > > > While looking at user's OOM heap dump, noticed that pig's > Bzip2TextInputFormat consumes memory at both > Bzip2TextInputFormat.buffer (ByteArrayOutputStream) > and actual Text that is returned as line. > For example, when having one record with 160MBytes, buffer was 268MBytes and > Text was 160MBytes. > We can probably eliminate one of them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira