[ 
https://issues.apache.org/jira/browse/IMPALA-6729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-6729:
-----------------------------------
    Labels: catalog-2024  (was: )

> Provide startup option to disable file and block location cache
> ---------------------------------------------------------------
>
>                 Key: IMPALA-6729
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6729
>             Project: IMPALA
>          Issue Type: New Feature
>          Components: Catalog
>            Reporter: Quanlong Huang
>            Priority: Major
>              Labels: catalog-2024
>         Attachments: Screen Shot 2018-05-04 at 12.12.21 PM.png
>
>
> In HDFS, scheduling PlanFragments according to block locations can improve 
> the locality of queries. However, every coin has two sides. There’re some 
> scenarios that loading & keeping the block locations brings no benefits, 
> sometimes even becomes a burden.
> {panel:title=Scenario 1}
> In a Hadoop cluster with ~1000 nodes, Impala cluster is only deployed on tens 
> of computation nodes (i.e. with small disks but larger memory and powerful 
> CPUs). Data locality is poor since most of the blocks have no replicas in the 
> Impala nodes. Network bandwidth is 1Gbit/s so it’s ok for remote read. 
> Queries are only required to finish within 5 mins.
>  
> Block location info is useless since the scheduler always comes up with the 
> same plan.
> {panel}
> {panel:title=Scenario 2}
> load_catalog_in_background is set to false since there’re several PB of data 
> in hive warehouse. If it’s set to true, the Impala cluster won’t be able to 
> start up (will waiting for loading block locations and finally full fill the 
> memory of catalogd and crash it).
> Accessing a hive table containing >10,000 partitions at the first time will 
> be stuck for a long time. Sometimes it can’t even finish for some large 
> tables. Users are annoyed when they only want to describe the table or select 
> a few partitions on this table.
>  
> Block location info is a burden here since its loading dominates the query 
> time. Finally, only a little portion of the block location info can be used.
> {panel}
> {panel:title=Scenario 3}
> There’re many ETL pipelines ingesting data into Hive warehouse. Some tables 
> are updated by replacing the whole data set. Some partitioned tables are 
> updated by inserting new partitions.
> Ad hoc queries are used to be served by Presto. When trying to introduce 
> Impala to replace Presto, we should add a REFRESH table step at the end of 
> each pipeline, which takes great efforts (many code changes on the existing 
> warehouse).
> IMPALA-4272 can solve this but has no progress. If file and block location 
> metadata cache can be disabled, things will be simple.
> {panel}
> IMPALA-3127 is relative. But we hope it's possible to not keep the block 
> locations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to