Eugene Koifman created HIVE-11683:
-------------------------------------
Summary: Hive Streaming may overload the metastore
Key: HIVE-11683
URL: https://issues.apache.org/jira/browse/HIVE-11683
Project: Hive
Issue Type: Bug
Components: HCatalog, Hive, Transactions
Affects Versions: 1.0.0
Reporter: Eugene Koifman
Assignee: Roshan Naik
HiveEndPoint represents a way to write to a specific partition transactionally.
Each HiveEndPoint creates TransactionBatch(es) and commits transactions.
Suppose you have 10 instances of Storm Hive bolt using Streaming API.
Each instance will create HiveEndPoints on demand when it sees an event for
particular partition value.
If events are uniformly distributed wrt partition values and the table has 1000
partitions (for example it's partitioned by CustomerId), each of 10 bolt
instances may create 1000 HiveEndPoints and thus > 10,000 (actually 10K *
num_txn_per_batch) concurrent transactions.
This creates huge amount of Metastore traffic.
HIVE-11672 is investigating how some sort of "shuffle" phase can be added route
events for a particular bucket to the same bolt instance.
The same idea should explored to route events based on partition value.
cc [~alangates],[~sriharsha],[~rbains]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)