[jira] [Commented] (HIVE-4248) Implement a memory manager for ORC
[ https://issues.apache.org/jira/browse/HIVE-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091872#comment-14091872 ] Lefty Leverenz commented on HIVE-4248: -- This added configuration parameter *hive.exec.orc.memory.pool* to HiveConf.java in 0.11.0. It's documented in the wiki here: * [Configuration Properties -- hive.exec.orc.memory.pool | https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.exec.orc.memory.pool] Implement a memory manager for ORC -- Key: HIVE-4248 URL: https://issues.apache.org/jira/browse/HIVE-4248 Project: Hive Issue Type: New Feature Components: Serializers/Deserializers Reporter: Owen O'Malley Assignee: Owen O'Malley Fix For: 0.11.0 Attachments: HIVE-4248.D9993.1.patch, HIVE-4248.D9993.2.patch, HIVE-4248.D9993.4.patch With the large default stripe size (256MB) and dynamic partitions, it is quite easy for users to run out of memory when writing ORC files. We probably need a solution that keeps track of the total number of concurrent ORC writers and divides the available heap space between them. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-4248) Implement a memory manager for ORC
[ https://issues.apache.org/jira/browse/HIVE-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13637710#comment-13637710 ] Hudson commented on HIVE-4248: -- Integrated in Hive-trunk-h0.21 #2073 (See [https://builds.apache.org/job/Hive-trunk-h0.21/2073/]) HIVE-4248 : Implement a memory manager for ORC (Owen Omalley via Ashutosh Chauhan) (Revision 1470249) Result = FAILURE hashutosh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1470249 Files : * /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/MemoryManager.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcFile.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcOutputFormat.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/WriterImpl.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestFileDump.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestOrcFile.java Implement a memory manager for ORC -- Key: HIVE-4248 URL: https://issues.apache.org/jira/browse/HIVE-4248 Project: Hive Issue Type: New Feature Components: Serializers/Deserializers Reporter: Owen O'Malley Assignee: Owen O'Malley Fix For: 0.12.0 Attachments: HIVE-4248.D9993.1.patch, HIVE-4248.D9993.2.patch, HIVE-4248.D9993.4.patch With the large default stripe size (256MB) and dynamic partitions, it is quite easy for users to run out of memory when writing ORC files. We probably need a solution that keeps track of the total number of concurrent ORC writers and divides the available heap space between them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4248) Implement a memory manager for ORC
[ https://issues.apache.org/jira/browse/HIVE-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13637468#comment-13637468 ] Hudson commented on HIVE-4248: -- Integrated in Hive-trunk-hadoop2 #168 (See [https://builds.apache.org/job/Hive-trunk-hadoop2/168/]) HIVE-4248 : Implement a memory manager for ORC (Owen Omalley via Ashutosh Chauhan) (Revision 1470249) Result = FAILURE hashutosh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1470249 Files : * /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/MemoryManager.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcFile.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcOutputFormat.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/WriterImpl.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestFileDump.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestOrcFile.java Implement a memory manager for ORC -- Key: HIVE-4248 URL: https://issues.apache.org/jira/browse/HIVE-4248 Project: Hive Issue Type: New Feature Components: Serializers/Deserializers Reporter: Owen O'Malley Assignee: Owen O'Malley Fix For: 0.12.0 Attachments: HIVE-4248.D9993.1.patch, HIVE-4248.D9993.2.patch, HIVE-4248.D9993.4.patch With the large default stripe size (256MB) and dynamic partitions, it is quite easy for users to run out of memory when writing ORC files. We probably need a solution that keeps track of the total number of concurrent ORC writers and divides the available heap space between them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4248) Implement a memory manager for ORC
[ https://issues.apache.org/jira/browse/HIVE-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13636917#comment-13636917 ] Phabricator commented on HIVE-4248: --- ashutoshc has accepted the revision HIVE-4248 [jira] Implement a memory manager for ORC. +1 will commit if tests pass. REVISION DETAIL https://reviews.facebook.net/D9993 BRANCH h-4248 ARCANIST PROJECT hive To: JIRA, ashutoshc, omalley Cc: kevinwilfong Implement a memory manager for ORC -- Key: HIVE-4248 URL: https://issues.apache.org/jira/browse/HIVE-4248 Project: Hive Issue Type: New Feature Components: Serializers/Deserializers Reporter: Owen O'Malley Assignee: Owen O'Malley Attachments: HIVE-4248.D9993.1.patch, HIVE-4248.D9993.2.patch, HIVE-4248.D9993.4.patch With the large default stripe size (256MB) and dynamic partitions, it is quite easy for users to run out of memory when writing ORC files. We probably need a solution that keeps track of the total number of concurrent ORC writers and divides the available heap space between them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4248) Implement a memory manager for ORC
[ https://issues.apache.org/jira/browse/HIVE-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626957#comment-13626957 ] Owen O'Malley commented on HIVE-4248: - Kevin, After thinking about it a bit more, how about if I ask the writers to re-check their memory relative to their allocation when the pool has shrunk by more than 10% from the last time they checked. I ran a quick experiment where I had a pool of 1GB and an increasing set of 250MB writers. By only doing the check when the pool has changed by more than 10%, as 1000 writers were added it cut down the number checks from 1000 to 49. Does that sound reasonable? Implement a memory manager for ORC -- Key: HIVE-4248 URL: https://issues.apache.org/jira/browse/HIVE-4248 Project: Hive Issue Type: New Feature Components: Serializers/Deserializers Reporter: Owen O'Malley Assignee: Owen O'Malley Attachments: HIVE-4248.D9993.1.patch, HIVE-4248.D9993.2.patch With the large default stripe size (256MB) and dynamic partitions, it is quite easy for users to run out of memory when writing ORC files. We probably need a solution that keeps track of the total number of concurrent ORC writers and divides the available heap space between them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4248) Implement a memory manager for ORC
[ https://issues.apache.org/jira/browse/HIVE-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626037#comment-13626037 ] Phabricator commented on HIVE-4248: --- kevinwilfong has commented on the revision HIVE-4248 [jira] Implement a memory manager for ORC. This allows for cases where the memory used could exceed the amount of memory allocated by significant amounts. E.g. say totalMemoryPool = 256 Mb = stripe size, also say we have a writer that writes 255 Mb to a stripe, then a second writer is created (e.g. a new dynamic partition value is encountered) and all new rows get written to this second writer, than nothing will get written out until the second writer accumulates 128 Mb of data in the stripe using a total of 383 Mb of the allocated 256 Mb. In theory, with some terrible luck, these could be chained together to use significantly more memory (first writer writes 255 Mb, second writes 127 Mb, third writes 85 Mb, etc.) Could you loop through the stripes whenever a writer is added (shouldn't happen to frequently) and check if the estimated stripe size of any of these writers exceeds the value of stripeSize * memoryManager.getAllocationScale() (should be doable by making a couple methods public and storing a reference to the WriterImpl along with or instead of the Path). Also (could be done in a follow up) could there be an additional check to see what the total HeapMemoryUsage is? E.g. in the shouldBeFlushed method of GroupByOperator, every 1000 rows, it checks that no more than 90% of the total heap has been used, and if so it flushes the hash map. Something similar could be done for WriterImpl, and given the MemoryManager, could even flush the largest stripe, rather than just the one that pushed it over the edge. This would be particularly useful given that in the case of a map join, followed by a map aggregation, the mapjoin is allowed to use 55% of the memory, and the group by another 30%, if there was also a FileSinkOpeartor, allowing the ORC WriterImpl to use 50% could be too much. INLINE COMMENTS common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:490 could you add this to conf/hive-default.xml.template as well. REVISION DETAIL https://reviews.facebook.net/D9993 To: JIRA, omalley Cc: kevinwilfong Implement a memory manager for ORC -- Key: HIVE-4248 URL: https://issues.apache.org/jira/browse/HIVE-4248 Project: Hive Issue Type: New Feature Components: Serializers/Deserializers Reporter: Owen O'Malley Assignee: Owen O'Malley Attachments: HIVE-4248.D9993.1.patch, HIVE-4248.D9993.2.patch With the large default stripe size (256MB) and dynamic partitions, it is quite easy for users to run out of memory when writing ORC files. We probably need a solution that keeps track of the total number of concurrent ORC writers and divides the available heap space between them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4248) Implement a memory manager for ORC
[ https://issues.apache.org/jira/browse/HIVE-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13626193#comment-13626193 ] Phabricator commented on HIVE-4248: --- omalley has commented on the revision HIVE-4248 [jira] Implement a memory manager for ORC. I agree that it can overshoot, but it won't likely be by that much. Of course the normal case is that the dynamic partitions are distributed randomly, in which case the current version will do fine. Granted, if the data is already sorted by the dynamic partition, it will not do well. Ok, I'll add a check when we add a new partition. I was just concerned with each new partition addition, it will take longer and longer to do all of the checks. REVISION DETAIL https://reviews.facebook.net/D9993 To: JIRA, omalley Cc: kevinwilfong Implement a memory manager for ORC -- Key: HIVE-4248 URL: https://issues.apache.org/jira/browse/HIVE-4248 Project: Hive Issue Type: New Feature Components: Serializers/Deserializers Reporter: Owen O'Malley Assignee: Owen O'Malley Attachments: HIVE-4248.D9993.1.patch, HIVE-4248.D9993.2.patch With the large default stripe size (256MB) and dynamic partitions, it is quite easy for users to run out of memory when writing ORC files. We probably need a solution that keeps track of the total number of concurrent ORC writers and divides the available heap space between them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4248) Implement a memory manager for ORC
[ https://issues.apache.org/jira/browse/HIVE-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13624184#comment-13624184 ] Phabricator commented on HIVE-4248: --- omalley updated the revision HIVE-4248 [jira] Implement a memory manager for ORC. removed other patch Reviewers: JIRA REVISION DETAIL https://reviews.facebook.net/D9993 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D9993?vs=31311id=31317#toc AFFECTED FILES metastore/src/java/org/apache/hadoop/hive/metastore/PartitionNameWhitelistPreEventListener.java metastore/src/test/org/apache/hadoop/hive/metastore/TestPartitionNameWhitelistPreEventHook.java To: JIRA, omalley Implement a memory manager for ORC -- Key: HIVE-4248 URL: https://issues.apache.org/jira/browse/HIVE-4248 Project: Hive Issue Type: New Feature Components: Serializers/Deserializers Reporter: Owen O'Malley Assignee: Owen O'Malley Attachments: HIVE-4248.D9993.1.patch, HIVE-4248.D9993.2.patch With the large default stripe size (256MB) and dynamic partitions, it is quite easy for users to run out of memory when writing ORC files. We probably need a solution that keeps track of the total number of concurrent ORC writers and divides the available heap space between them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4248) Implement a memory manager for ORC
[ https://issues.apache.org/jira/browse/HIVE-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616691#comment-13616691 ] Owen O'Malley commented on HIVE-4248: - This may result in ORC files with smaller stripes, but that seems far better than letting the users get out of memory exceptions. Implement a memory manager for ORC -- Key: HIVE-4248 URL: https://issues.apache.org/jira/browse/HIVE-4248 Project: Hive Issue Type: New Feature Components: Serializers/Deserializers Reporter: Owen O'Malley Assignee: Owen O'Malley With the large default stripe size (256MB) and dynamic partitions, it is quite easy for users to run out of memory when writing ORC files. We probably need a solution that keeps track of the total number of concurrent ORC writers and divides the available heap space between them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira