Michael Ho created IMPALA-8341:
----------------------------------
Summary: Data cache for remote reads
Key: IMPALA-8341
URL: https://issues.apache.org/jira/browse/IMPALA-8341
Project: IMPALA
Issue Type: Improvement
Components: Backend
Affects Versions: Impala 3.2.0
Reporter: Michael Ho
Assignee: Michael Ho
When running in public cloud (e.g. AWS with S3) or in certain private cloud
settings (e.g. data stored in object store), the computation and storage are no
longer co-located. This breaks the typical pattern in which Impala query
fragment instances are scheduled at where the data is located. In this setting,
the network bandwidth requirement of both the nics and the top of rack switches
will go up quite a lot as the network traffic includes the data fetch in
addition to the shuffling exchange traffic of intermediate results.
To mitigate the pressure on the network, one can build a storage backed cache
at the compute nodes to cache the working set. With deterministic scan range
scheduling, each compute node should hold non-overlapping partitions of the
data set.
A prototype of the cache was posted here: https://gerrit.cloudera.org/#/c/12683/
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]