Gopal and Prasanth-

Thanks for the info guys.   My particular table has ~300 columns, so 
https://issues.apache.org/jira/browse/HIVE-7250 was not kicking in for me.   I 
set hive.exec.orc.default.buffer.size to 32k and ensured 
hive.optimize.sort.dynamic.partition=true and things are flying for me now.

Thanks!!

Sean

On Oct 14, 2014, at 12:16 PM, Prasanth Jayachandran 
<pjayachand...@hortonworks.com<mailto:pjayachand...@hortonworks.com>> wrote:


On Oct 14, 2014, at 10:34 AM, Gopal V 
<gop...@apache.org<mailto:gop...@apache.org>> wrote:

On 10/13/14, 10:53 PM, Sean McNamara wrote:

I’ve found a condition where the MemoryManager will wait too long before 
notifying writers
to check their memory and flush.
...
This issue affects anyone who is writing a lot of columns, very large columns, 
or worst of
all: both. I have tested and confirmed this issue on hive 0.12, 0.13, and trunk.

Can you post the exact query, because this OOM is in my list of already fixed 
performance issues (HIVE-6455).

I have tested Hive-13 partitioned inserts with just "insert into table select 
*" for both 30Tb of data and 10,000 columns.

https://issues.apache.org/jira/browse/HIVE-7250 adaptively chooses compression 
buffer size for ORC based on number of columns and available memory to further 
reduce memory pressure. This code path will kick-in only for >1000 columns.


This issue happens in hive-12 & before, which keeps too many ORC files open at 
the same time.

If you are on hive-13 or later, setting the config option 
hive.optimize.sort.dynamic.partition=true; should fix this issue.

This follows a path within the FileSinkOperator which keeps exactly 1 stripe 
open at any given time, so that the this always works correctly assuming the 
orc.stripe.size fits within memory.

The issue is on line 50: ROWS_BETWEEN_CHECKS = 5000;

For large or many columns it’s easy to hit GC issues or OOM before 5k rows are 
written.

I believe that rows-between-checks should be made a configuration parameter 
that can be passed
in on the JobConf.

5000 rows is probably the wrong thing to check, for sure - but it is a sane 
default. Perhaps instead it could check between every stride index being 
written (which is every 10,000 rows) or some fraction of it.

But that check produces bad ORC files and still doesn't fix the actual issue - 
this is merely postponing the inevitable.

Let me describe the errors I hit before we had the sort.partition 
implementation.

At multiple terabyte scale, the next error you will hit will be an HDFS Lease 
Expired exception, then the system runs out of file handles and after that it 
runs of stack for DFSOutputStream threads.

Even if you don't go that far, the memory manager doesn't slice memory all the 
way down to a single row. The minimum size of a stripe is num-cols * 
compress-size, we can't shrink the stripe size below that.

The trouble is that with tiny stripes of less than 1Mb, the read-path suffers 
heavily, the split generation becomes incredibly expensive and the inter-stripe 
padding becomes a significant fraction of the HDFS space used (upto 47% of 
space will be padding).

So you can submit a patch for the JobConf to work around this, but it will 
generate sub-optimal ORC files.

The scalable & logically correct fix is already there in Hive, you have to make 
sure the config option is on.

FYI, the Hive plan we generate corresponds to an MRv2 example which combines 
LazyOutputFormat with MultipleOutputs to produce similar results.

Not sure if a similar option exists in Pig.

Cheers,
Gopal


CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader of 
this message is not the intended recipient, you are hereby notified that any 
printing, copying, dissemination, distribution, disclosure or forwarding of 
this communication is strictly prohibited. If you have received this 
communication in error, please contact the sender immediately and delete it 
from your system. Thank You.

Reply via email to