Hey Todd,

thanks for putting that together, that area certainly needs some work.
I think there are a number of good ideas in the proposal, and I also
think there are a number of ways to augment it to make it even better:
1. The size of the cache is obviously a result of the caching
granularity (or lack thereof), and switching to a bounded size with
more granularity seems like a good choice.
2. However, one of the biggest usability problems stems from the need
for manual metadata discovery (via Invalidate/Refresh). Introducing a
TTL for cached objects seems like a bad approach, because it creates a
conflict between metadata freshness and caching
effectiveness/performance. This isn't really something that users want
to have to trade off.
3. Removing the role of a central "metadata curator" effectively
closes the door on doing automated metadata discovery in the future
(which in turn seems like a prerequisite for services like automated
format conversion, ie, anything that needs to react to the presence of
new data).

To address those concerns, I am proposing the following changes:
1. Keep the role of a central metadata curator (let's call it metad
just for the sake of the argument, so as not to confuse it with the
existing catalogd).
2. Partition the namespace in some way (for instance, into a fixed
number of topics, with each  database assigned to a topic) and allow
coordinators to join and drop topics as needed in order to stay within
their cache size bounds.
3. The improvements you suggest for streamlining the code, avoiding
in-place updates, and making metadata retrieval more efficient still
apply here, they're just executed by a different process.

On Tue, May 22, 2018 at 9:27 PM, Todd Lipcon <[email protected]> wrote:
> Hey Impala devs,
>
> Over the past 3 weeks I have been investigating various issues with
> Impala's treatment of metadata. Based on data from a number of user
> deployments, and after discussing the issues with a number of Impala
> contributors and committers, I've come up with a proposal for a new design.
> I've also developed a prototype to show that the approach is workable and
> is likely to achieve its goals.
>
> Rather than describe the design in duplicate, I've written up a proposal
> document here:
> https://docs.google.com/document/d/1WcUQ7nC3fzLFtZLofzO6kvWdGHFaaqh97fC_PvqVGCk/edit?ts=5b04a6b8#
>
> Please take a look and provide any input, questions, or concerns.
>
> Additionally, if any users on this list have experienced metadata-related
> problems in the past and would be willing to assist in testing or
> contribute workloads, please feel free to respond to me either on or off
> list.
>
> Thanks
> Todd

Reply via email to