Shubham Chaurasia created HIVE-22221:
----------------------------------------

             Summary: Llap external client - Need to reduce 
LlapBaseInputFormat#getSplits() footprint  
                 Key: HIVE-22221
                 URL: https://issues.apache.org/jira/browse/HIVE-22221
             Project: Hive
          Issue Type: Bug
          Components: llap, UDF
            Reporter: Shubham Chaurasia
            Assignee: Shubham Chaurasia


While querying through llap external client, LlapBaseInputFormat#getSplits() 
invokes get_splits() (GenericUDTFGetSplits) udtf under the hoods.

GenericUDTFGetSplits returns LlapInputSplit in which planBytes[] occupies 
around 90% of the split size.
Depending on data size/partitions and plan,  LlapInputSplit can grow upto 1mb 
with planBytes[] being common to all the splits and occupying more than 850 kb. 
Also, it sometimes causes OOM on HS2 depending on HS2 heap size.

This can be resolved by separating out common parts from actual splits and 
reassembling them at client side. 
We can also provide an option where client can say it does not want to 
reassemble them and can take the control of reassembling in it's hands.

Splits can be broken like:
1) schema split
2) plan split
3) actual split 1
4) actual split 2....and so on.

This greatly reduces the memory(in my case from 5GB(~5000 splits) to around 
15MB) on server side  and hence the data transfer. And this eliminates OOM on 
HS2 side.

cc [~jdere] [~sankarh] [~thejas]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to