Quanlong Huang created IMPALA-6729:
--------------------------------------
Summary: Provide startup option to disable block location cache
Key: IMPALA-6729
URL: https://issues.apache.org/jira/browse/IMPALA-6729
Project: IMPALA
Issue Type: New Feature
Components: Catalog
Reporter: Quanlong Huang
In HDFS, scheduling PlanFragments according to block locations can improve the
locality of queries. However, every coin has two sides. There’re some scenarios
that loading & keeping the block locations brings no benefits, sometimes even
becomes a burden.
{panel:title=Scenario 1}
In a Hadoop cluster with ~1000 nodes, Impala cluster is only deployed on tens
of computation nodes (i.e. with small disks but larger memory and powerful
CPUs). Data locality is poor since most of the blocks have no replicas in the
Impala nodes. Network bandwidth is 1Gbit/s so it’s ok for remote read. Queries
are only required to finish within 5 mins.
Block location info is useless since the scheduler always comes up with the
same plan.
{panel}
{panel:title=Scenario 2}
load_catalog_in_background is set to false since there’re several PB of data in
hive warehouse. If it’s set to true, the Impala cluster won’t be able to start
up (will waiting for loading block locations and finally full fill the memory
of catalogd and crash it).
Accessing a hive table containing >10,000 partitions at the first time will be
stuck for a long time. Sometimes it can’t even finish for some large tables.
Users are annoyed when they only want to describe the table or select a few
partitions on this table.
Block location info is a burden here since its loading dominates the query
time. Finally, only a little portion of the block location info can be used.
{panel}
IMPALA-3127 is relative. But we hope it's possible to not keep the block
locations.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)