Hey Todd, thanks for putting that together, that area certainly needs some work. I think there are a number of good ideas in the proposal, and I also think there are a number of ways to augment it to make it even better: 1. The size of the cache is obviously a result of the caching granularity (or lack thereof), and switching to a bounded size with more granularity seems like a good choice. 2. However, one of the biggest usability problems stems from the need for manual metadata discovery (via Invalidate/Refresh). Introducing a TTL for cached objects seems like a bad approach, because it creates a conflict between metadata freshness and caching effectiveness/performance. This isn't really something that users want to have to trade off. 3. Removing the role of a central "metadata curator" effectively closes the door on doing automated metadata discovery in the future (which in turn seems like a prerequisite for services like automated format conversion, ie, anything that needs to react to the presence of new data).
To address those concerns, I am proposing the following changes: 1. Keep the role of a central metadata curator (let's call it metad just for the sake of the argument, so as not to confuse it with the existing catalogd). 2. Partition the namespace in some way (for instance, into a fixed number of topics, with each database assigned to a topic) and allow coordinators to join and drop topics as needed in order to stay within their cache size bounds. 3. The improvements you suggest for streamlining the code, avoiding in-place updates, and making metadata retrieval more efficient still apply here, they're just executed by a different process. On Tue, May 22, 2018 at 9:27 PM, Todd Lipcon <[email protected]> wrote: > Hey Impala devs, > > Over the past 3 weeks I have been investigating various issues with > Impala's treatment of metadata. Based on data from a number of user > deployments, and after discussing the issues with a number of Impala > contributors and committers, I've come up with a proposal for a new design. > I've also developed a prototype to show that the approach is workable and > is likely to achieve its goals. > > Rather than describe the design in duplicate, I've written up a proposal > document here: > https://docs.google.com/document/d/1WcUQ7nC3fzLFtZLofzO6kvWdGHFaaqh97fC_PvqVGCk/edit?ts=5b04a6b8# > > Please take a look and provide any input, questions, or concerns. > > Additionally, if any users on this list have experienced metadata-related > problems in the past and would be willing to assist in testing or > contribute workloads, please feel free to respond to me either on or off > list. > > Thanks > Todd
