On Thu, Jun 7, 2018 at 9:27 AM, Marcel Kornacker <[email protected]> wrote:

> Hey Todd,
>
> thanks for putting that together, that area certainly needs some work.
> I think there are a number of good ideas in the proposal, and I also
> think there are a number of ways to augment it to make it even better:
> 1. The size of the cache is obviously a result of the caching
> granularity (or lack thereof), and switching to a bounded size with
> more granularity seems like a good choice.
> 2. However, one of the biggest usability problems stems from the need
> for manual metadata discovery (via Invalidate/Refresh). Introducing a
> TTL for cached objects seems like a bad approach, because it creates a
> conflict between metadata freshness and caching
> effectiveness/performance. This isn't really something that users want
> to have to trade off.
>

Agreed. In the "future directions" part of the document I included one item
about removing the need for invalidate/refresh, perhaps by subscribing to
the push notification mechanisms provided by HMS/NN/S3/etc. With those
notification mechanisms, the cache TTL could be dialed way up (perhaps to
infinity) and only rely on cache capacity for eviction.

In some cases the notification mechanisms provide enough information to
update the cache in place (i.e the notification has the new information)
whereas in others it would just contain enough information to perform
invalidation. But from the user's perspective it achieves near-real-time
freshness either way.

Note, though, that, absent a transactional coordination between Impala and
source systems, we'll always be in the realm of "soft guarantees". That is
to say, the notification mechanisms are all asynchronous and could be
backlogged due to a number of issues out of our control (eg momentary
network congestion between HMS and Impala, or unknown issues on the S3/SNS
backend). As you pointed out in the doc, "soft guarantees" are somewhat
weak when you come at it from the perspective of an application developer.
So, even with those mechanisms, we may need some explicit command like
"sync metadata;" or somesuch that acts like a memory barrier.

I think trying to bite all of that off in one go will increase the scope of
the project quite a bit, though. Did you have some easier mechanism in mind
to improve usability here that I'm missing?

3. Removing the role of a central "metadata curator" effectively
> closes the door on doing automated metadata discovery in the future
> (which in turn seems like a prerequisite for services like automated
> format conversion, ie, anything that needs to react to the presence of
> new data).
>

That's fair. Something like catalogd may still have its place to centralize
the subscription to the notification services and re-broadcast them to the
cluster as necessary. My thinking was that it looks sufficiently different
from the current catalogd, though, that it would be a new component rather
than an evolution of the existing catalogd-statestore-impalad metadata
propagation protocol.


> To address those concerns, I am proposing the following changes:
> 1. Keep the role of a central metadata curator (let's call it metad
> just for the sake of the argument, so as not to confuse it with the
> existing catalogd).
>

Per above, I think this could be useful.


> 2. Partition the namespace in some way (for instance, into a fixed
> number of topics, with each  database assigned to a topic) and allow
> coordinators to join and drop topics as needed in order to stay within
> their cache size bounds.
>

I don't think partitioning to the granularity of a database is sufficient.
We've found that even individual single tables can cause metadata updates
in excess of 1GB, even though those tables are infrequently accessed, and
when accessed, often only a very small subset of partitions. Often these
mega tables live in the same database as the rest of the tables accessed by
a workload, and asking users to manually partition them out brings physical
implementation concerns bleeding into logical schema design decisions.
That's quite a hassle when considering things like security policies being
enforced on a per-database basis, right?

3. The improvements you suggest for streamlining the code, avoiding
> in-place updates, and making metadata retrieval more efficient still
> apply here, they're just executed by a different process.
>

I think a critical piece of the design, though, is that the incremental and
fine-grained metadata retrieval extends all the way down to the impalad,
and not just to the centralized coordinator. If we are pushing metadata at
database (or even table) granularity from the catalog to the impalads, we
won't reap the major benefits described here.

Here's another approach we could consider long term:
- keep a central daemon responsible for interacting with source systems
- impalads dont directly access source systems, but instead send metadata
fetch requests to the central coordinator
- impalads maintain a small cache of their own, and the central coordinator
maintains a larger cache. If the impalad requests some info that the
central daemon doesn't have, it fetches on demand from the source.
- central coordinator propagates invalidations to impalads

In essence, it's like the current catalog design, but based on a
fine-grained "pull" design instead of a course-grained "push/broadcast"
design.

This allows impalads to scale out without linearly increasing the amount of
load on source systems. The downsides, though, are:
- extra hop between impalad and source system on a cache miss
- extra complexity in failure cases (source system becomes a SPOF, and if
we replicate it we have more complex consistency to worry about)
- scalability benefits may only be really important on large clusters.
Small clusters can often be served sufficiently by just one or two
coordinator impalads.

-Todd


> On Tue, May 22, 2018 at 9:27 PM, Todd Lipcon <[email protected]> wrote:
> > Hey Impala devs,
> >
> > Over the past 3 weeks I have been investigating various issues with
> > Impala's treatment of metadata. Based on data from a number of user
> > deployments, and after discussing the issues with a number of Impala
> > contributors and committers, I've come up with a proposal for a new
> design.
> > I've also developed a prototype to show that the approach is workable and
> > is likely to achieve its goals.
> >
> > Rather than describe the design in duplicate, I've written up a proposal
> > document here:
> > https://docs.google.com/document/d/1WcUQ7nC3fzLFtZLofzO6kvWdGHFaa
> qh97fC_PvqVGCk/edit?ts=5b04a6b8#
> >
> > Please take a look and provide any input, questions, or concerns.
> >
> > Additionally, if any users on this list have experienced metadata-related
> > problems in the past and would be willing to assist in testing or
> > contribute workloads, please feel free to respond to me either on or off
> > list.
> >
> > Thanks
> > Todd
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to