Quanlong Huang created IMPALA-6729:
--------------------------------------

             Summary: Provide startup option to disable block location cache
                 Key: IMPALA-6729
                 URL: https://issues.apache.org/jira/browse/IMPALA-6729
             Project: IMPALA
          Issue Type: New Feature
          Components: Catalog
            Reporter: Quanlong Huang


In HDFS, scheduling PlanFragments according to block locations can improve the 
locality of queries. However, every coin has two sides. There’re some scenarios 
that loading & keeping the block locations brings no benefits, sometimes even 
becomes a burden.
{panel:title=Scenario 1}
In a Hadoop cluster with ~1000 nodes, Impala cluster is only deployed on tens 
of computation nodes (i.e. with small disks but larger memory and powerful 
CPUs). Data locality is poor since most of the blocks have no replicas in the 
Impala nodes. Network bandwidth is 1Gbit/s so it’s ok for remote read. Queries 
are only required to finish within 5 mins.
 
Block location info is useless since the scheduler always comes up with the 
same plan.
{panel}
{panel:title=Scenario 2}
load_catalog_in_background is set to false since there’re several PB of data in 
hive warehouse. If it’s set to true, the Impala cluster won’t be able to start 
up (will waiting for loading block locations and finally full fill the memory 
of catalogd and crash it).
Accessing a hive table containing >10,000 partitions at the first time will be 
stuck for a long time. Sometimes it can’t even finish for some large tables. 
Users are annoyed when they only want to describe the table or select a few 
partitions on this table.
 
Block location info is a burden here since its loading dominates the query 
time. Finally, only a little portion of the block location info can be used.
{panel}
IMPALA-3127 is relative. But we hope it's possible to not keep the block 
locations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to