[
https://issues.apache.org/jira/browse/IMPALA-6729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Quanlong Huang updated IMPALA-6729:
-----------------------------------
Labels: catalog-2024 (was: )
> Provide startup option to disable file and block location cache
> ---------------------------------------------------------------
>
> Key: IMPALA-6729
> URL: https://issues.apache.org/jira/browse/IMPALA-6729
> Project: IMPALA
> Issue Type: New Feature
> Components: Catalog
> Reporter: Quanlong Huang
> Priority: Major
> Labels: catalog-2024
> Attachments: Screen Shot 2018-05-04 at 12.12.21 PM.png
>
>
> In HDFS, scheduling PlanFragments according to block locations can improve
> the locality of queries. However, every coin has two sides. There’re some
> scenarios that loading & keeping the block locations brings no benefits,
> sometimes even becomes a burden.
> {panel:title=Scenario 1}
> In a Hadoop cluster with ~1000 nodes, Impala cluster is only deployed on tens
> of computation nodes (i.e. with small disks but larger memory and powerful
> CPUs). Data locality is poor since most of the blocks have no replicas in the
> Impala nodes. Network bandwidth is 1Gbit/s so it’s ok for remote read.
> Queries are only required to finish within 5 mins.
>
> Block location info is useless since the scheduler always comes up with the
> same plan.
> {panel}
> {panel:title=Scenario 2}
> load_catalog_in_background is set to false since there’re several PB of data
> in hive warehouse. If it’s set to true, the Impala cluster won’t be able to
> start up (will waiting for loading block locations and finally full fill the
> memory of catalogd and crash it).
> Accessing a hive table containing >10,000 partitions at the first time will
> be stuck for a long time. Sometimes it can’t even finish for some large
> tables. Users are annoyed when they only want to describe the table or select
> a few partitions on this table.
>
> Block location info is a burden here since its loading dominates the query
> time. Finally, only a little portion of the block location info can be used.
> {panel}
> {panel:title=Scenario 3}
> There’re many ETL pipelines ingesting data into Hive warehouse. Some tables
> are updated by replacing the whole data set. Some partitioned tables are
> updated by inserting new partitions.
> Ad hoc queries are used to be served by Presto. When trying to introduce
> Impala to replace Presto, we should add a REFRESH table step at the end of
> each pipeline, which takes great efforts (many code changes on the existing
> warehouse).
> IMPALA-4272 can solve this but has no progress. If file and block location
> metadata cache can be disabled, things will be simple.
> {panel}
> IMPALA-3127 is relative. But we hope it's possible to not keep the block
> locations.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]